Comet Big Data Tutorial¶

Goal: Deploy an analytics cluster and run spark examples.

Time: ~ 15 min

Overview ¶

In this tutorial you will deploy an analytics stack consisting of Hadoop and Spark and run some toy analyses on the cluster.

Requirements ¶

Basic knowledge of git, shell scripting, and Python.
A Comet cluster
- ip addresses of the nodes
- ssh-reachable user with admin privileges

Note

This tutorial used vct01 as an example. Please change to your cluster name.

Preparation ¶

On your VC front-end node clone the Big Data Stack repository:

FE $ git clone --recursive --branch pearc17 git://github.com/futuresystems/big-data-stack
FE $ cd big-data-stack

Next create a virtualenvironment and install dependencies:

FE $ virtualenv ~/BDS
FE $ source ~/BDS/bin/activate
(BDS) FE $ pip install -r requirements.txt

The proper internal network interface has been set for this tutorial. For general case, you may need to configure ansible to use the internal network interface. Do so by setting the bds_internal_iface to ens3 in group_vars/all.yml

The ansible remote_user has been set properly for this tutorial assuming you followed all the previous steps to setup the VC. For general use, please make sure to configure the authentication. Edit ansible.cfg and ensure the following are set in the [defaults] section:

remote_user=YOUR_COMPUTE_USERNAME
become_ask_pass=True

Finally, generate the ansible inventory file using your cluster IP addresses. Note, your IP addresses may be different.

(BDS) FE $ python mk-inventory -n bds- 10.0.0.1 10.0.0.2 >inventory.txt

Sanity Check ¶

At this point you should be good to go, but make sure that Ansible is correctly configured by pinging the cluster.

(BDS) FE $ ansible all -m ping
bds-0 | SUCCESS => {
  "changed": false,
  "ping": "pong"
}
bds-1 | SUCCESS => {
  "changed": false,
  "ping": "pong"
}

Deploy the Stack ¶

Once you’ve prepared your head node and generated the inventory you can deploy Hadoop, Spark, and optionally other addons. The addons are available in the addons directory, and include packages such as Apache Drill, HBase, and others. For this tutorial we will only need Spark.

The following command will deploy Hadoop and Spark using Ansible. This will take several minutes.

(BDS) FE $ ansible-playbook -K pearc17.yml

Usage ¶

Once the stack has been deployed you may log in to the first address you passed to mk-inventory and switch to the hadoop user.

(BDS) FE $ ssh vm-vct01-00
comet@vm-vct01-00 $ sudo su - hadoop
hadoop@vm-vct01-00 $

Word Count ¶

You can download the complete works for Shakespeare and upload it to HDFS:

hadoop@vm-vct01-00 $ wget https://raw.githubusercontent.com/cloudmesh/cloudmesh.comet.vcdeploy/master/examples/shakespeare.txt
hadoop@vm-vct01-00 $ hadoop fs -put shakespeare.txt /

Next, you can use the following program (adapted from the Spark website) to analyze Shakespeare’s works. The analysis consists of the following steps:

split the text into words
reduce by counting each words
sort the result in descending order
save to results on HDFS

from pyspark import SparkContext

sc = SparkContext()

txt = sc.textFile('hdfs:///shakespeare.txt')
counts = txt.flatMap(lambda line: line.split(" ")) \
            .map(lambda word: (word, 1)) \
            .reduceByKey(lambda a, b: a + b) \
            .sortBy(lambda t: t[1], ascending=False)
counts.saveAsTextFile('hdfs:///shakespeare-wordcount.txt')

You can download this example by running this:

hadoop@vm-vct01-00 $ wget https://raw.githubusercontent.com/cloudmesh/cloudmesh.comet.vcdeploy/master/examples/spark-shakespeare.py

You can run the analysis locally with the following invocation:

hadoop@vm-vct01-00 $ spark-submit spark-shakespeare.py

Or you can submit to the cluster by invoking:

hadoop@vm-vct01-00 $ spark-submit --master yarn --deploy-mode cluster spark-shakespeare.py

Make sure to cleanup before rerunning otherwise the task will fail:

hadoop@vm-vct01-00 $ hadoop fs -rm -r /shakespeare-wordcount.txt

You can then view the top ten words by:

hadoop@vm-vct01-00 $ hadoop fs -cat /shakespeare-wordcount.txt/part-00000 | head

What’s next?¶

Try with a larger data set and compare the performance between local run and cluster run
Try with more sophisticated textual analytics
Deploy other components. This tutorial is tailored specially for the PEARC17 Comet VC tutorial to minimize user intervention and customization while showing the essence of the big data stack deployment. For the general use, please refer to the main repo.