For most of us who have already heard of Big-Data, the term Hadoop is nothing new . For those who are unaware of it, let me go ahead and give a brief overview of Hadoop, its architecture, its use in the industry etc in this blog post.
Hadoop: Hadoop is a software framework for processing large data-sets in a distributed fashion across many clusters. It provides huge support for massive storage of data and also limitless handling of job tasks in parallel. Hadoop is a open source software which is basically developed by the Apache Community. The Hadoop framework breaks data into chunks and stores it in the commodity hardware. The framework is written in Java. There are basically four modules that are included in the Apache’s Hadoop Framework. They are :
1. Hadoop Common : Contains the libraries and other utilities used by the other Hadoop’s modules.
2. HDFS (Hadoop Distributed File System) : a distributed file system that stores data on commodity hardware.
3. Hadoop YARN : YARN stands for ‘yet-another-resource negotiator’- It is responsible for managing computer resources on the cluster .
4. Map-Reduce: A programming model for large-scale data processing.
Along with these, there are also additional software components which are installed and run on the Hadoop Framework. All these are part of the hadoop ecosystem. They are:
Oozie: A work-flow management system
Hive : Hive is the datawarehousing concept in Hadoop. It also gives us a SQL like querying language called HiveQL. Hive programming is similar to SQL. It was initially developed at Facebook.
HBase: HBase is a non-relational, distributed database. It is written in Java and is modeled on the basis of Google’s BigTable.
Pig: Pig Latin is a language for creating and executing map reduce programs in hadoop. It provides easier ways to data extraction, transformations and loading without the need of writing map reduce programs.
Sqoop: Sqoop is a mechanism which is used for loading and exporting data from the traditional Relational DBMS.
Flume: Flume is used majorly in the cases of unstructured data, usually from the real-time streaming data. These data is moved into the Hadoop distributed file system using Flume.
Mahout: Apache mahout is to provide implementations of some of the machine learning algorithms like custering, classification, collaborative filtering etc.
Advantages and Disadvatages of Hadoop:
Advantages:
1. Scalable:
2. Fast, flexible and Fault tolerant: Hadoop can process terabytes of data in minutes and petabytes of data in hours. It can manage data from different sources into structured and unstructured data. Usually in the HDFS the replication factor is 3 in Hadoop 2.0 which means the number of copies of each block are 3. These 3 copies are again stored in different server racks so that the fault tolerance is even more increased.
3. Cost- effectiveness: The cost of adding a hardware component like storage space is very less compared to the huge business insights the scale-out architecture, Hadoop is going to provide.
Disadvantages:
1. Security: Hadoop security model, is disabled by default because of the complexities involved. If whoever’s managing the platform lacks the knowhow to enable it, your data could be at huge risk. Hadoop is also missing encryption at the storage and network levels, which is a major selling point for government agencies and others that prefer to keep their data under wraps. With the frameworks written Java, there is a higher risk of data compromising as java is the most exploited language by the hackers and cyber wrong-doers.
2. Not Fit for Small Data
Due to the high capacity design of Hadoop , the HDFS does not efficiently process the small data. The latency for processing the small data is near to that of processing big data. So, the small companies which does not need big data processing may not find Hadoop a suitable platform for their needs. As a result, it is not recommended for organizations with small quantities of data.
3. Stability issues:
Hadoop is a open source framework which is written by many developers. So, it does contain its share of stabilty issues.
In the next article, I will try to share more information of each of the modules in the Hadoop’s ecosystem and the probable alternatives to Hadoop. I will also share insights into the real time analytics that is fast becoming predominant (Hadoop does not process real-time data. It only processes saved information in batch processing).