Overview of Fundamental Technology

Apache Hadoop is a software-hardware complex that allows tasks related to storage and processing of large data arrays to be performed. Hadoop is designed for batch data processing using the computational power of more than one physical machine, as well as for storing datasets that do not fit on a single node.

You may need to use Hadoop if your needs meet the following criteria:

Examples of tasks currently implemented on Hadoop:


Prerequisites for Hadoop Across Multiple Data Centers:

As a rule, Hadoop is located in one data center, but what if we require high availability of a service like Hadoop that would allow it to withstand the shutdown of one of the data centers? Or, as is not excluded, perhaps you simply do not have enough racks in one data center, and you need to increase both data and resource capacity. We faced a choice - to place consumers on different Hadoops in different data centers (DC) or to do something special.

We chose to do something special, and thus a concept that we developed, tested, and implemented was born - Hadoop Multi Data Center (HMDC).

Hadoop Multi Data Center

As the name suggests, it lives in multiple data centers. This solution provides cluster resilience to the loss of capacity in any individual data center and efficiently distributes resources among consumers. HMDC provides two main services to consumers: the distributed file system HDFS and the resource manager YARN, which also acts as a task scheduler.

This technology allows for a much more fault-tolerant Hadoop cluster, although it has certain features, a brief overview of which is given below.

In each data center, there is one Master node and a set of worker Slave nodes. The Master nodes have the following components installed:

Both services operate in High-Availability mode based on the Active-Standby principle. That is, at any given moment, one Master node serves client requests. If it becomes unavailable, another Master node switches to “Active” mode and starts serving client requests. In this way, we can survive a data center outage (DC-1) without losing functionality.

Other advantages of this “stretched” architecture include:


Distinctive features of HMDC

HDFS Datanodes (DN) and YARN Node Managers (NM) are configured on machines functionally grouped as slave nodes. These are the main working machines in the cluster: the sum of the available volumes of allocated hard drives (taking replication into account), RAM, and virtual cores on these machines make up the total resource pool of HMDC cluster.

Control services - HDFS Namenode, ZKFC, YARN ResourceManager, HiveServer2, and HiveMetastore - are placed on machines grouped as master nodes. In each data center, there is one master node. The configuration of each service is implemented in such a way that the three master nodes provide High Availability for the service as a whole.

Main Components of the System

Let's take a closer look at the main components of the system.

Benefits of HMDC

As I said at the very beginning - we were faced with a choice: to create independent data centers, where each consumer is located in their own data center (in our case, there are 3) or to implement this concept - HMDC. It is worth noting that despite the separation of consumers by data centers - many consumers need the same data (it is not possible to divide the data).

Let's compare these 2 paradigms:

Independent DCs

HMDC

Data management

×3, mapping

As a single cluster

Error with data

Possible loss

Possible loss

Data volume

×3 - ×9

×3

DC loss migration

Several weeks

Hours, config editing

Inter-DC data replication

Custom service

HDFS functionality

Support

3 different domains

Unified cluster

Inequality of clusters

Flexibility in uploading

Block Placement Policies


What Does HMDC Give Us as a whole?