Introduction

In the era of the Internet of Things and Mobility, with a huge volume of data becoming available at a fast velocity, there must be the need for an efficient analytics system.

Also, the variety of data is coming from various sources in various formats, such as sensors, logs, structured data from a RDBMS, etc. In the past few years, the generation of new data has drastically increased. More applications are being built and they are generating more data at a faster rate.

Earlier, Data Storage was costly and there was an absence of technology which could process the data in an efficient manner. Now the storage costs have become cheaper, and the availability of technology to process Big Data is a reality.

What is Big Data

According to the Author Dr. Kirk Borne, Principal Data Scientist, Big Data Definition is Everything, Quantified and Tracked. Let’s pick that apart -

All of these quantified and tracked data streams will enable

Big data Defines three D2D’s

The 10 V’s of Big Data

Big Data Framework

The Best Way for a solution is to “Split The Problem”. Big Data solution can be well understood using Layered Architecture. The Layered Architecture is split into different Layers where each layer performs a particular function.

This Architecture helps in designing the Data Pipeline with different requirements of either Batch Processing System or Stream Processing System. This architecture consists of 6 layers which ensure a secure flow of data.

  1. Data Ingestion Layer — This layer is the first step for the data coming from variable sources to start its journey. Data here is prioritised and categorised which makes data flow smooth in further layers.
  2. Data Collector Layer — In this Layer, more focus is on the transportation of data from ingestion layer to rest of data pipeline. This is the Layer, where components are decoupled so that analytic capabilities may begin.
  3. Data Processing Layer — In this layer main focus is to specialize the data pipeline processing system or we can say the data we have collected in the previous layer is to be processed in this layer. Here we do some magic with the data to route them to a different destination, classify the data flows and it’s the first point where the analytic may take place.
  4. Data Storage Layer — Storage becomes a challenge when the size of the data you are dealing with, becomes large. There are several possible solutions that can rescue from such problems. Finding a storage solution is very much important when the size of your data becomes large. This layer focuses on “where to store such a large data efficiently”.
  5. Data Query Layer — This is the layer where strong analytic processing takes place. Here main focus is to gather the data value so that they are made to be more helpful for the next layer.
  6. Data Visualization Layer — The visualization, or presentation tier, probably the most important tier, where the data pipeline users may feel the VALUE of DATA. We need something that will grab people’s attention, pull them into, make your findings well-understood.

1. Data Ingestion Layer

Data ingestion is the first step for building Data Pipeline and also the toughest task in the System of Big Data. In this layer we plan the way to ingest data flows from hundreds or thousands of sources into Data Center. As the Data coming from Multiple sources at variable speed, in different formats.

That’s why we should properly ingest the data for the successful business decisions making. It’s rightly said that “If starting goes well, then, half of the work is already done”

1.1 What is Big Data Ingestion?

Big Data Ingestion involves connecting to various data sources, extracting the data, and detecting the changed data. It’s about moving data — and especially the unstructured data — from where it is originated, into a system where it can be stored and analyzed.

We can also say that Data Ingestion means taking data coming from multiple sources and putting it somewhere it can be accessed. It is the beginning of Data Pipeline where it obtains or import data for immediate use.

Data can be streamed in real time or ingested in batches, When data is ingested in real time then, as soon as data arrives it is ingested immediately. When data is ingested in batches, data items are ingested in some chunks at a periodic interval of time. Ingestion is the process of bringing data into Data Processing system.

Effective Data Ingestion process begins by prioritizing data sources, validating individual files and routing data items to the correct destination.

1.2 Challenges Faced with Data Ingestion

As the number of IoT devices increases, both the volume and variance of Data Sources are expanding rapidly. So, extracting the data such that it can be used by the destination system is a significant challenge in terms of time and resources. Some of other challenges faced by Data Ingestion are -

That’s why it should be well designed assuring following things -

1.3 Data Ingestion Parameters

1.4 Big Data Ingestion Key Principles

In order to complete the process of Data Ingestion, we should use right tools for that and most important that tools should be capable of supporting some of the key principles written below -

1.5 Data Serialization

Different types of users have different types of data consumer needs. Here we want to share variable data, so we must plan how the user can access data in a meaningful way. That’s why a single image of variable data optimize the data for human readability.

Approaches used for this are -

1.6 Data Ingestion Tools

1.6.1 Apache Flume — Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

It has a simple and flexible architecture based on streaming data flows. It is robust and faults tolerant with tunable reliability mechanisms and many failovers and recovery mechanisms.

It uses a simple extensible data model that allows for an online analytic application. Its functions are -

1.6.2 Apache Nifi — Apache Nifi provides an easy to use, the powerful, and reliable system to process and distribute data. Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. Its functions are -

1.6.3 Elastic Logstash —Elastic Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously transforms it, and then sends it to your “stash” i.e. Elasticsearch.

It easily ingests from your logs, metrics, web applications, data stores, and various AWS services and done in continuous, streaming fashion. It can Ingest Data of all Shapes, Sizes, and Sources.

2. Data Collector Layer

In this Layer, more focus is on transportation data from ingestion layer to rest of Data Pipeline. Here we use a messaging system that will act as a mediator between all the programs that can send and receive messages.

Here the tool used is Apache Kafka. It’s a new approach in message oriented middleware.

2.1 Apache Kafka

It is used for building real-time data pipelines and streaming apps. It can process streams of data in real-time and store streams of data safely in a distributed replicated cluster.

Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering of streaming data.

2.2 What is Data Pipeline?

2.2.1 Functions of Data Pipeline

2.2.2 Need Of Data Pipeline

A Data Pipeline is software that takes data from multiple sources and makes it available to be used strategically for making business decisions.

Primarily reasons for the need of data pipeline is because it’s very hard to monitor Data Migration and manage data errors. Other reasons for this are below -

2.2.3 Use cases for Data Pipeline

Data Pipeline is useful to a number of roles, including CTOs, CIOs, Data Scientists, Data Engineers, BI Analysts, SQL Analysts, and anyone else who derives value from a unified real-time stream of user, web, and mobile engagement data. So, use cases for data pipeline are given below -

2.3 Apache Kafka is Good for 2 Things

2.3.1 Common use cases of Apache Kafka -

2.3.2 Features of Apache Kafka

2.3.3 How Apache Kafka Works

Kafka System design act as Distributed commit log, where incoming data is written sequentially on disk. There are four main components involved in moving data in and out of Apache Kafka -

3. Data Processing Layer

In the previous layer, we gathered the data from different sources and made it available to go through rest of pipeline.

In this layer, our task is to do magic with data, as now data is ready we only have to route the data to different destinations.

In this layer main focus is to specialize Data Pipeline processing system or we can say the data we have collected by the last layer in this next layer we have to do processing on that data.

Processing can be done in 3 ways i.e.

3.1 Batch Processing System

A pure batch processing system for off-line analytic. For doing this tool used is Apache Sqoop.

3.2 Apache Sqoop

It efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores.

Apache Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.

3.2.1 Functions of Apache Sqoop are -

3.3 Near Real Time Processing System

A pure online processing system for on-line analytic. For this type of processing tool i.e. used is Apache Storm. The Apache Storm cluster makes decisions about the criticality of the event and sends the alerts to the alert system (dashboard, e-mail, other monitoring systems).

3.3.1 Apache Storm — It is a system for processing streaming data in real time. It adds reliable real-time data processing capabilities to Enterprise Hadoop. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations.

3.3.2 Features of Apache Storm

3.4 Apache Spark

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.

With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared data set in Hadoop.

3.5 Apache Flink

Flink is an open-source framework for distributed stream processing that Provides results that are accurate, even in the case of out-of-order or late-arriving data. Some of its features are -

3.5.1 Apache Flink Use Cases

4. Data Storage Layer

Next, the major issue is to keep data in the right place based on usage. We have relational Databases, that were a successful place to store our data over years.

But with the new big data strategic enterprise applications, you should no longer be assuming that your persistence should be relational.

We need different databases to handle the different variety of data, but using different databases creates overhead. That’s why there is an introduction to the new concept in the database world i.e. the Polyglot Persistence.

4.1 Polyglot Persistence

Polyglot persistence is the idea of using multiple databases to power a single application. Polyglot persistence is the way to share or divide your data into multiple databases and leverage their power together.

It takes advantage of the strength of different database. Here different types of data are arranged in different ways. In short, it means picking the right tool for the right use case.

It’s the same idea behind Polyglot Programming, which is the idea that applications should be written in a mix of languages to take advantage of the fact that different languages are suitable for tackling different problems.

4.1.1 Advantages of Polygon Persistence -

4.2 Tools used for Data Storage

4.2.1 HDFS

4.2.1.1 Features of HDFS

4.2.2 Gluster file systems (GFS)

As we know good storage solution must provide elasticity in both storage and performance without affecting active operations.

Scale-out storage systems based on GlusterFS are suitable for unstructured data such as documents, images, audio and video files, and log files.GlusterFS is a scalable network filesystem.

Using this, we can create large, distributed storage solutions for media streaming, data analysis, and other data- and bandwidth-intensive tasks.

4.2.2.1 Use Cases For GlusterFS include

4.2.3 Amazon S3

5. Data Query Layer

This is the layer where strong analytic processing takes place. This is a field where interactive queries are necessaries and it’s a zone traditionally dominated by SQL expert developers. Before Hadoop, we had a very limited storage due to which it takes long analytics process.

Continue Reading The Full Article At —XenonStack.com/Blog