Apache Hadoop


We are in the era of data age and there has been a precipitous increase in the size and the scale of web data. Big data has distinct and complex unstructured formats this includes information from social media, websites, email and video presentation. Therefore, organizations need technologies to extract valuable intelligence from Big Data. To process large data sets, there is the need of having technologies that can accurately, powerfully analyze the big data formats. Google is one of the companies that process data using its Google File systems and MapReduce frameworks (Team YS, 2012).  These scalable frameworks motivated the Hadoop initiative. The Apache Hadoop allows the distributed processing of big data across many machines.


The Apache Hadoop consists of 2 subprojects that are the MapReduce and the Hadoop Distributed Filesystem. The MapReduce is a software framework and a programming model that assists in writing applications that process large amounts of data in parallel on clusters of compute nodes. The Hadoop Distributed File System is the storage systems that Hadoop applications use.

The popularity of the Hadoop application has grown significantly since it meets the needs of the most organization. The Hadoop framework provides flexible data analysis with an unmatched price-performance curve. The feature of flexibility analysis applies to unstructured data; semi-structured data and structured data (thinkbiganalytics.com, 2013). The importance of Hadoop framework has been recognized in places where massive server farms are used to collect data from many sources. This is because it can process parallel questions as large, background jobs on one server farm. The Hadoop ecosystem incorporates some tools to enable it address specific needs.  This includes Hive (it is a SQL dialect), Zookeeper (used for federating services), Oozie, Avro, Thrift, and Protobuf are description formats.

Hadoop has a bunch of advantages since it can handle data from disparate systems regardless of the data’s native format. Sometimes data is stored in unrelated systems but it is easy to dump it in the Hadoop cluster without the need of applying a schema (thinkbiganalytics.com, 2013). Hadoop is cost effective as compared to other legacy systems. Having in mind that today there are large data sets, using legacy systems is expensive since they were not engineered to cater for the large data sets. Hadoop is cheap because it relies on redundant data structure. The framework is deployed on industry standard servers while other legacy systems are deployed in expensive specialized data storage systems.


Hadoop framework has proofed its efficiency in most organizations but as much as it is fast and cost efficient, legacy systems will be required to compliment the use of Hadoop. The features of Hadoop will make the technology attractive; therefore, it is upon the organizations to take advantage of the framework. With Hadoop, organizations will make important predictions by sorting out and analyzing big data. There is no doubt that Hadoop is the core platform for structuring big data sets.

Leave a Reply

Your email address will not be published. Required fields are marked *