What is Hortonworks and How Do We install it?

Selwyn Zhou

What is Hortonworks:

Before we answer this question, we have to get a basic concept of one hot word recently-Hadoop. Now

What is Hadoop:

  • Hadoop is the open source framework designed to address the Three Vs of Big Data. It enables applications to work with thousands of computationally independent computers processing petabytes of data.
  • Hadoop handles petabytes of data and most forms of unstructured data
  • The velocity challenge of big data can be addressed by integrating appropriate tools within the Hadoop eco system, such as Vertica, HANA, etc.

Since the idea of “Big Data” has been an increasingly "hot" trend and will keep being so in the future, many people regard Hadoop as the first choice to explore “Big Data”. All of the big data companies use Apache Hadoop in some way or the other. In order to simplify the process of using Apache Hadoop, Hortonworks developed its own data platform (HDP) whose core parts are HDFS and Map Reduce.

Now, Horotnworks is one of the biggest vendors of the Hadoop eco system. At the same time, it developed a lot of connections and partnerships with other Hadoop vendors, data visualization tools, Database vendors and so on. On Feb 12th, the company announced that it had reached the truly impressive 1,000 partner milestone mark. At the same time, two months after its IPO date, its stock price had risen more than 50% which indicates positive market sentiment for the company and its solutions.

This is Hortonworks latest product HDP 2.2 architecture

Apache Hadoop YARN

  • Slide existing services onto YARN through ‘Slider’
  • GA release of HBase, Accumulo, and Storm on YARN
  • Support long running services: handling of logs, containers not killed when AM dies, secure token renewal, YARN Labels for tagging nodes for specific workloads
  • Support for CPU Scheduling and CPU Resource Isolation through CGroups

Apache Hadoop HDFS

  • Heterogeneous storage: Support for archival tier
  • Rolling Upgrade (This is an item that applies to the entire HDP Stack. YARN, Hive, HBase, everything. We now support comprehensive Rolling Upgrade across the HDP Stack).
  • Multi-NIC Support
  • Heterogeneous storage: Support memory as a storage tier (Tech Preview)
  • HDFS Transparent Data Encryption (Tech Preview)

Apache Hive, Apache Pig, and Apache Tez

  • Hive Cost Based Optimizer: Function Pushdown & Join re-ordering support for other join types: star & bushy.
  • Hive SQL Enhancements including:
  • ACID Support: Insert, Update, Delete
  • Temporary Tables
  • Metadata-only queries return instantly
  • Pig on Tez
  • Including DataFu for use with Pig
  • Vectorized shuffle
  • Tez Debug Tooling & UI

Apache HBase, Apache Phoenix, & Apache Accumulo

  • HBase & Accumulo on YARN via Slider
  • HBase HA
  • Replicas update in real-time
  • Fully supports region split/merge
  • Scan API now supports standby RegionServers
  • HBase Block cache compression
  • HBase optimizations for low latency
  • Phoenix Robust Secondary Indexes
  • Performance enhancements for bulk import into Phoenix
  • Hive over HBase Snapshots
  • Hive Connector to Accumulo
  • HBase & Accumulo wire-level encryption
  • Accumulo multi-datacenter replication

Apache Storm

  • Storm-on-YARN via Slider
  • Ingest & notification for JMS (IBM MQ not supported)
  • Kafka bolt for Storm – supports sophisticated chaining of topologies through Kafka
  • Kerberos support
  • Hive update support – Streaming Ingest
  • Connector improvements for HBase and HDFS
  • Deliver Kafka as a companion component
  • Kafka install, start/stop via Ambari
  • Security Authorization Integration with Ranger

Apache Spark

  • Refreshed Tech Preview to Spark 1.1.0 (available now)
  • ORC File support & Hive 0.13 integration
  • Planned for GA of Spark 1.2.0
  • Operations integration via YARN ATS and Ambari
  • Security: Authentication

Apache Solr

  • Added Banana, a rich and flexible UI for visualizing time series data indexed in Solr


  • Cascading 3.0 on Tez distributed with HDP — coming soon


  • Support for HiveServer 2
  • Support for Resource Manager HA

Apache Falcon

  • Authentication Integration
  • Lineage – now GA. (it’s been a tech preview feature…)
  • Improve UI for pipeline management & editing: list, detail, and create new (from existing elements)
  • Replicate to Cloud – Azure & S3

Apache Sqoop, Apache Flume & Apache Oozie

  • Sqoop import support for Hive types via HCatalog
  • Secure Windows cluster support: Sqoop, Flume, Oozie
  • Flume streaming support: sink to HCat on secure cluster
  • Oozie HA now supports secure clusters
  • Oozie Rolling Upgrade
  • Operational improvements for Oozie to better support Falcon
  • Capture workflow job logs in HDFS
  • Don’t start new workflows for re-run
  • Allow job property updates on running jobs

Apache Knox & Apache Ranger (Argus) & HDP Security

  • Apache Ranger – Support authorization and auditing for Storm and Knox
  • Introducing REST APIs for managing policies in Apache Ranger
  • Apache Ranger – Support native grant/revoke permissions in Hive and HBase
  • Apache Ranger – Support Oracle DB and storing of audit logs in HDFS
  • Apache Ranger to run on Windows environment
  • Apache Knox to protect YARN RM
  • Apache Knox support for HDFS HA
  • Apache Ambari install, start/stop of Knox

Apache Slider

  • Allow on-demand create and run different versions of heterogeneous applications
  • Allow users to configure different application instances differently
  • Manage operational lifecycle of application instances
  • Expand / shrink application instances
  • Provide application registry for publish and discovery

Apache Ambari

  • Support for HDP 2.2 Stack, including support for Kafka, Knox and Slider
  • Enhancements to Ambari Web configuration management including: versioning, history and revert, setting final properties and downloading client configurations
  • Launch and monitor HDFS rebalance
  • Perform Capacity Scheduler queue refresh
  • Configure High Availability for ResourceManager
  • Ambari Administration framework for managing user and group access to Ambari
  • Ambari Views development framework for customizing the Ambari Web user experience
  • Ambari Stacks for extending Ambari to bring custom Services under Ambari management
  • Ambari Blueprints for automating cluster deployments
  • Performance improvements and enterprise usability guardrails

Overall, it is important to take a strategic, methodical approach to your big data goals. Start with your objectives, and weigh your options. A big data partner like ATCG Solutions can help you.

To find out more about Hortonworks and Hadoop, and get the step-by-step installation guide, please click