In pursuit of happiness!: June 2013

I was researching graph analytics and it's probable applications. Graph analysis and big data are overlapping areas and then I came across this piece of text which beautifully summarizes the difficulty of discovering the unknown.

"Many Big Data problems are about searching for things you know that you want to find. It's challenging because the volumes of data make it like searching for a needle in a haystack. But that’s easy because a needle and a piece of hay, though similar, do not look exactly alike.

However, discovering problems are about finding what you don't know. Imagine trying to find a needle in a stack of needles. That's even harder. How can you find the right needle if you don't know what it looks like? How can you discover something new if you don't know what you're looking for?

In order to find the unknown, you often have to know the right question to ask. It takes time and effort to ask every question and you keep learning as you continue to ask questions ....."

This happens to be true for any evolving technology that solves some existing problem in a novel way, and due to its sheer sophistication, opens door for us to ask for more, to be inquisitive and finally discover.

Original source: YarcData

This article discusses the architecture of a cloud-based log management service. The LaaS (Logging-as-a-Service), code named ‘Trolly’, will provide log management and data analysis. We will cover technical requirements, high level architecture, data flow and technology selection based on open source software.

1.0 Modelling

The proposed architecture is a layered architecture, and each layer has components. Within layers, pipeline and event-driven architecture are employed.

1.1 Layered architecture

Multi-layered software architecture is a software architecture that uses many layers for allocating the responsibilities.

1.2 Pipeline architecture

A pipeline consists of a chain of processing elements (processes, threads etc.), arranged so that the output of each element is the input of the next.

A pipe is a message queue. A message can be anything. A filter is a process, thread, or other component that perpetually reads messages from an input pipe, one at a time, processes each message, then writes the result to an output pipe. Thus, it is possible to form pipelines of filters connected by pipes.

1.3 Event driven architecture

Event-driven architecture (EDA) is a software architecture pattern promoting the production, detection, consumption of, and reaction to events.

An event can be defined as "a significant change in state". For this system, a line in a log file represents an event.

2. Motivation

Layered architecture helps in:

· Clear separation of responsibilities – search, alerting and visualization.

· Ability to replace one or several layers implementation with minimum effort and side effects.

An event driven architecture is extremely loosely coupled and well distributed. Event driven architectures have loose coupling within space, time and synchronization, providing a scalable infrastructure for information exchange and distributed workflows.

3. Requirements

3.1 In scope

· Data Visualization

· Search

· Custom Alerts

3.2 Technical requirements

· Near-real time processing

· Multi-tenant – Support multiple clients

· High availability

· Scalable – The system must scale for following reasons:

o New clients sign up

o Existing clients upgrade

3.3 Not covered in this article

· Client Registration

· Billing, Reporting and Archival

· User Management

· Third party integration

- Self-service – Automatic provisioning of hardware and software

· Multi-data center deployment and redundancy

· Client and Service geographic proximity

4.0 Trolly Architecture

Trolly High Level Architecture

Trolly has the following components:

i. Data collection

a. Adaptors for accepting input log streams

ii. Data access

a. Portal

b. API gateway

iii. Data pre-processing

a. Core processor

iv. Alerting

a. Complex event processing engine

v. Search service

a. Distributed indexer and search

vi. Log storage

a. Distributed data store

Each component is explained in the following sections.

Data Flow and Components Diagram

5. Concepts

5.1.1 Definitions

Client: Users of Trolly and uniquely identified by a Client Id.

Source: Log source e.g. Apache HTTP logs. Client can configure multiple sources.

Stream: A log stream sent by a client and originating from a pre-defined source. Stream is keyed by Stream Id. Stream uniquely maps to a source.

5.1.2 Data Collection

The API gateway will accept logs from multiple end points like rsyslog, REST over HTTP/HTTPS etc. It will authenticate client and queue the log event for further preprocessing.

RabbitMQ will be used for messaging. A cluster setup with mirrored queues provides high availability.

5.1.3 Data pre-processing

The core processor cleans data, applies regex based on the input format and converts it to JSON format. Each processed log event is forked to three different locations - CEP engine, indexing engine and data store.

The JSON file is:

i. Sent to CEP engine for alerting.

ii. Sent to indexing service.

iii. Stored in a distributed datastore.

5.1.4 Custom Alerts

Here we choose software that provides low-latency and high-throughput complex event processing. The software must support HA configuration. Esper is a complex event processing library. It is stream oriented, supports HA and can process 100,000 - 500,000 events / sec at an average latency of 3 micro seconds and well suited for our requirements.

The system will create a new queue for every Stream defined by a client. This will act as a source for Esper. Custom user supplied alerts are converted to Espers DSL (EPL). The subscribers receive output events that is -

i. Persisted it in MySQL database and

ii. Published to 3^rd party integration end-points e.g. SMS gateway.

5.1.5 Indexing and Search

The design must support near-real time indexing, high availability (HA) and multi-tenancy. To address scalability and HA, we use a distributed cluster based design. ElasticSearch is a good fit for above requirements.

ElasticSearch supports multi-tenancy. Multiple indexes can be stored in a single ElasticSearch server, and data of multiple indexes can be searched with a single query.

5.1.5.1 Index Setup

The system will use indices based on time. For example, the index name can be the date in a format like <Client Id>-DD-MM-YYYY. Client Id is a unique identifier assigned to every client. The timeframe depends on how long the client intends to keep logs.

For storing a week worth of logs, system will maintain one index per day and for a year worth of logs, system will maintain one index per month. This will help in narrowing searches to the relevant indices. Often the searches will be for recent logs, and a “quick search” option in GUI can only looks in the last index.

Further, the BookKeeping module will rotate and optimize logs as well as delete old indexes.

5.1.5.2 High Availability

Each index will be divided into 5 or more shrads. Each shrad will have one replica. Replicas will help the cluster continue running when one or more nodes go down. It will also be used for searching.

5.1.5.3 Performance considerations

· Empirically determine number of shrads and replicas

· Refresh indexes asynchronously and automatically every few seconds

· Enable bulk indexing


Proposed Elastic Search Topology

5.1.6 Log Analysis and Storage

If the log volume is low, the original log line can be compressed and stored in the search index. But this seems unlikely and the raw logs will be written to a distributed highly-available data store.

5.1.6.1 Log Analysis Software Stack

Hadoop is an open source framework that allows users to process very large data in parallel. The Hadoop core is mainly divided into two modules:

o HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using multiple commodity servers connected in a cluster.

o Map-Reduce (MR) is a framework for parallel processing of large data sets. The default implementation is bonded with HDFS.

o The database will be a NoSQL database such as HBase. The advantage of a NoSQL database is that it provides scalability for the reporting module as well, as we can keep historical processed data for reporting purposes. HBase is an open source columnar DB or NoSQL DB, which uses HDFS. It can also use MR jobs to process data. It gives real-time, random read/write access to very large data sets - HBase can save very large tables having million of rows. It's a distributed database and can also keep multiple versions of a single row.

o The Pig framework is an open source platform for analyzing large data sets and is implemented as a layered language over the Hadoop Map-Reduce framework. It is built to ease the work of developers who write code in the Map-Reduce format, since code in Map-Reduce format needs to be written in Java. In contrast, Pig enables users to write code in a scripting language.

5.1.6.2 Data Flow

As discussed above, raw log lines are processed and stored as JSON in HBase. JSON will serve as a common input format for all visualization tools.

5.1.7 Data Access

5.1.7.1 Reporting and Visualization

For basic reporting, Google chart or High Charts are a good fit. For data visualization, D3 or InfoViz toolkit can be used.

o For analyzing recent log events, HBase will be directly queried. For advanced analysis, HBase will have the data processed by Pig scripts ready for reporting and further slicing and dicing can be performed.

o The data-access Web service is a REST-based service that eases the access and integrations with data clients.

6. Summary

The proposed architecture attempts to show how large amounts of unstructured log events can be easily analyzed using the Apache Hadoop, ElasticSearch and Esper framework.

Feel free to comment and share your thoughts about the proposed architecture.

In pursuit of happiness!

Download Android App

Friday, June 28, 2013

The story of 'Big Data'

Wednesday, June 12, 2013

Graph Analytics: Discovering the undiscovered!

Monday, June 10, 2013

Architecture: Cloud based Log Management Service