Understand Hadoop Terminology 101
Understand Big Data Hadoop Terminology
Flume – Flume allows users to:
- Stream data from multiple sources into Hadoop for analysis
- Collect high-volume Web logs in real time
- Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at which data can be written to the destination
- Guarantee data delivery Scale horizontally to handle additional data volume
Sqoop (SQL+Hadoop) –Designed to efficiently transfer bulk data between Apache Hadoop and structured datastores such as relational databases, Apache Sqoop:
- Allows data imports from external datastores and enterprise data warehouses into Hadoop
- Parallelizes data transfer for fast performance and optimal system utilization
- Copies data quickly from external systems to Hadoop
- Makes data analysis more efficient
- Mitigates excessive loads to external systems.
HDFS – Hadoop distributed file system. MapReduce – a programming model and associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
HBase – HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases.
Pig – a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets
Hive – data warehouse tool, use SQL like language (HiveQL), good for structured data.
Mahout – a machine learning framework, used to develop social network/E- commerce recommendations.
Apache oozie – workflow scheduler and management tool, can schedule and run Hadoop jobs in parallel.
Compare Hadoop Databases
Learn how to use hadoop to get social media data please click Below