Understand Hadoop Terminology 101

Maria Wei

Understand Big Data Hadoop Terminology

Flume – Flume allows users to:

  • Stream data from multiple sources into Hadoop for analysis
  • Collect high-volume Web logs in real time
  • Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at which data can be written to the destination
  • Guarantee data delivery Scale horizontally to handle additional data volume

Sqoop (SQL+Hadoop) –Designed to efficiently transfer bulk data between Apache Hadoop and structured datastores such as relational databases, Apache Sqoop:

  • Allows data imports from external datastores and enterprise data warehouses into Hadoop
  • Parallelizes data transfer for fast performance and optimal system utilization
  • Copies data quickly from external systems to Hadoop
  • Makes data analysis more efficient
  • Mitigates excessive loads to external systems.

HDFS – Hadoop distributed file system. MapReduce – a programming model and associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

HBase – HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases.

Pig – a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets

Hive – data warehouse tool, use SQL like language (HiveQL), good for structured data.

Mahout – a machine learning framework, used to develop social network/E- commerce recommendations.

Apache oozie – workflow scheduler and management tool, can schedule and run Hadoop jobs in parallel.

Compare Hadoop Databases



Learn how to use hadoop to get social media data please click Below