Tutorial of “How to Refine and Visualize Consumer Sentiment Data”

Author: 
Selwyn Zhou

INTRODUCTION:

This post is mostly based on the Hortonworks tutorial “how to refine and visualize sentiment data”: http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-senti.... But there are several mistakes in the original blogs. In this blog, I will show you the whole step-by-step process and demonstrate solutions for the errors you may encounter so that you can easily analyze consumer sentiment data.

Now let’s get started.

Prerequisites:

  • Hortonworks Sandbox v2.1, Oracle VM
  • Hortonworks ODBC Driver, Raw Sentiment Data (from Hortonworks), WinSCP

Step 1: Download and Extract the Sentiment Tutorial Files

Step 2: Load Twitter Data into the Hortonworks Sandbox

Step 3: Copy a Hive Script to the Sandbox

Windows 7: Copy the files to the Sandbox

On Windows you will need to download and install the free WinSCP application.

  • Open WinSCP and type in the following settings, then click Login.
  • Host name: 127.0.0.1
  • Port: 2222
  • User name: root
  • Password: Hadoop

Use the WinSCP file browser to navigate to the SentimentFiles\hive folder in the left-hand pane, and to the Sandbox /root folder in the right-hand pane. Drag-and-drop the hiveddl.sql file from the SentimentFiles\hive folder to the /root folder on the Sandbox.

Step 4: Run the Hive Script to Refine the Raw Data;

In the Hortonworks Sandbox virtual machine (VM) console window, press the Alt and F5 keys, then log into the Sandbox using the following user name and password:Login: root\ Password: hadoop

After you log in, the command prompt will appear with the prefix [root@sandbox ~]#:

At the command prompt, type in the following command, then press the Enter key:hive -f hiveddl.sql

Lines of text appear as the script runs a series of MapReduce jobs. It will take a few minutes for the script to finish running. When the script has finished running, the time taken is displayed, and the normal command prompt appears.

The hiveddl.sql script has performed the following steps to refine the data:

  • Converted the raw Twitter data into a tabular format.
  • Used the dictionary file to score the sentiment of each Tweet by the number of positive words compared to the number of negative words, and then assigned a positive, negative, or neutral sentiment value to each Tweet.
  • Created a new table that includes the consumer sentiment data value for each Tweet.

We can look at the data using the Hive command line. We can start Hive by typing hive at the prompt.

Let’s use HCatalog to take a quick look at the data.

  • Open the Sandbox HUE user interface, then click HCatalog in the menu at the top of the page. Select the check box next to the “tweetsbi” table, then click Browse Data. The “tweetsbi” table is the table created by the Hive script that added a column with the sentiment value for each tweet. (Note, you may need to scroll right to see all of the columns).

Now that we have refined Twitter data in a tabular format with sentiment ratings, we can access the data with Excel.

Step 5: Access the Refined Sentiment Data with Excel

  • In Windows, open a new Excel workbook, then select Data > From Other Sources > From Microsoft Query.
  • At first, we have to make sure we have the connection built correctly.
  • Select the System DSN tab. The Sample Hortonworks Hive DSN should be selected by default; if not, select it.
  • Click Configure to continue.
  • On the Hortonworks Hive ODBC Driver DSN Setup window, type in the settings as shown in the image below. Type the IP address of the Hortonworks sandbox in the Host box. The Authentication mechanism should be set to User Name, and the sandbox user name should be entered in the User Name box (in this case the default user name, “hue”).
  • Click Test to test the configuration settings. If the test is successful, a confirmation message appears. Click OK to close the message box.
  • On the Choose Data Source pop-up, select the Hortonworks ODBC data source you installed previously, then click OK. The Hortonworks ODBC driver enables you to access Hortonworks data with Excel and other Business Intelligence (BI) applications that support ODBC.

    After the connection to the Sandbox is established, the Query Wizard appears. Select the “tweetsbi” table in the Available tables and columns box, then click the right arrow button to add the entire “tweetsbi” table to the query. Click Next to continue.

  • Select the “text” column in the “Columns in your query” box, then click the left arrow button to remove the text column.
  • After the “text” column has been removed, click Next to continue.
  • On the Filter Data screen, click Next to continue without filtering the data.
  • On the Sort Order screen, click Next to continue without setting a sort order.
  • Click Finish on the Query Wizard Finish screen to retrieve the query data from the Sandbox and import it into Excel.
  • On the Import Data dialog box, click OK to accept the default settings and import the data as a table.
  • The imported query data appears in the Excel workbook.

This is the procedure of the sample consumer sentiment data analysis. Of course, we had a lot of bugs and errors when we tried to finish it. There are three major problems you have to fix:

  • Execution error
  • Build ODBC connector
  • Load data from Hive to Microsoft query

In our previous article "Typical Issues with Hortonworks Database Configuration"(here), we have a detailed document to clarify these three mistakes, here I will just provide some supplementary data. For the first problem, we have to fix the bug of the .sql document which contained in the sample data, and for the third problem, we have to command to “select ” command to start loading. But for the second problem, here is another solution;

In this solution, I used the bridged adapter to build the ODBC connector. Finally, it works well and here is the screen shot of the procedure.

Then we type command “ifconfig” to get the latest IP address of our sandbox, here it is 192.168.0.5

Then we open the ODBC administrator to set up a new connection; It will succeed directly:

Then follow the same step to load the data from hive to excel, if you do not type the command “grant SELECT on table tweetsbi to user hue; You need to face third mistake, if not, you will find it works well.

Here is the result.

Hope this blog gave you more useful information and inspiration.

Now we know when you will have these issues and how to fix them. Once you master the ability to get data out of Twitter, you may want to explore it visually . You can find out how to do that with our FREE Guide on data visualuzation.