Introduction of Hadoop History

Jim Sun

Hadoop was originally part of Yahoo's Nutch search engine project, it was created by 2 Yahoo employees, Doug Cutting and Mike Cafarella in 2005. Inspired by 2 papers from Google and MapReduce publications, Cutting and Mike relaized that they could place Nutch on top of a processing framework that built on a lot of computers rather than a single workstation. It became the initial image of Hadoop's distributed processing framework. Since that time, Hadoop has become an open-source project that attracts a handful of part-time developers to contribute in its community.

From the year 2006, Hadoop has been split out of Nutch to become an isolated project. Yahoo also established a dedicated team led by Cutting to develop and maintain this project. In early 2008, Hadoop hit web-scale and runs on tons of nodes from that time. In July 2008, one of Yahoo's Hadoop clusters sorted 1 TB of data in 209 seconds, breaking the previous record of 297 seconds, which was the first time that a Java or open source program had won.

From the year 2009, Hadoop started to get bigger; with many more components being added to the Hadoop family:

  • Hadoop Core is renamed Hadoop Common
  • MapReduce is sperated from HDFS
  • Avro and Chukwa are added as new Hadoop subprojects.

Hadoop's components were enriched rapidly since 2010, with the end result being what you see in the Hadoop landscape today:

  • In May 2010, Avro and HBase become top-level Apache projects
  • In Sep 2010, Hive and Pig become top-level projects
  • In Jan 2011, ZooKeeper become top-level projects

In Dec 2011, Hadoop version 1.0.0 was released, on Nov. 18th, 2014, the latest version, 2.6.0 was released. Hadoop could not have developed so fast without everyone's effort. A handful of part-time developers have made this open-source project one of the most influencial big data tools in the world. Among those contributers of Hadoop, there are also some "full-time ones":

  • Cloudera: their product (CDH) makes Hadoop to be a safe and commercialized for enterprise use. CDH offers configuration menus to help users better deploy Hadoop. Also Cloudera published a lot of code for Hadoop beginners and experienced developers to use.
  • Datameer: Datameer Analytics Solution (DAS) provides an easy BI solution for its users with a straightforward UI. It enables data from Hadoop to connect to any data source through JDBC, Hive or any other reasonable standard connection.
  • Hortonworks: 50 of Hadoop's earliest and most prolific contributors from the original Yahoo Nutch project built this independent company. This Yahoo team developed a very big part of the code for Hadoop platform and will provide guidance of the platform in the future.
  • Karmasphere: Karma is focused on data mining and unstructured data analysis, such as web, mobile, social media data and so on.
  • Oracle, IBM, Microsoft, HP, Amazon...: the reason why I am listing these IT giants here is that their businesses are not entirely focused on big data area, but they definitely have their product published for big data.