Additional Programming Assignments¶

Programming: Hadoop Cluster¶

You will be provided with a Hadoop cluster running MapReduce. Your goal will be to use the Hadoop cluster to run a “Big Data” computation. One possible approach is the Terabyte Sort procedure. The components are:

TeraGen: create the data
TeraSort: analyze the data using MapReduce
TeraValidate: validation of the output

Access to the cluster¶

Todo

HadoopClusterAccess.html

Invocation¶

The teragen command accepts two parameters:

number of 100-byte rows
the output directory

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar
teragen $COUNT /user/$USER/tera-gen
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar
terasort /user/$USER/tera-gen /user/$USER/tera-sort
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar
teravalidate /user/$USER/tera-sort /user/$USER/tera-validate

Exercise¶

Run the Terabyte Sort procedure for various sizes of data:

1 GB
10 GB
100 GB

For each component (tera{gen,sort,validate}), report the execution time, data read and written (in GB) as well as the cumulative values.

Programming: Using futuresystems.org¶

In this homework, you are expected to run Python or Java programs on FutureSystems or on your local machine. A few examples for beginners will help you to understand how to write and run Java or Python programs on your environment.

We will print some elementary system information such as time, date, user name or hostname (machine name) which will be important when you report on your infrastructure in your program reports. You will likely need to add more information such as processor type, core number, and frequency.

Java¶

Here is a simple program in Java.

Download: FirstProgramWithSystemInfo.java:

import java.util.Date;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.net.InetAddress;
import java.net.UnknownHostException;
/**
* * Sample Program with system information
* *
* * Compile : javac FirstProgramWithSystemInfo.java
* * Run : java FirstProgramWithSystemInfo
* */
public class FirstProgramWithSystemInfo {
    public static void main(String[] args){
    System.out.println("My first program with System Information!");
        // Print Date with Time
        DateFormat dateFormat = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss");
        Date date = new Date();
        System.out.println("Today is: " + dateFormat.format(date));
        // Print Username
        System.out.println("Username is: " + System.getProperty("user.name"));
        // Print hostname
        try {
            java.net.InetAddress localMachine = java.net.InetAddress.getLocalHost();
            System.out.println("Hostname is: " + localMachine.getHostName());
        } catch (UnknownHostException e) {
            e.printStackTrace();
        System.out.println("No host name: " + e.getMessage());
        }
    }
}

Compiling and Execution:

javac FirstProgramWithSystemInfo.java
java FirstProgramWithSystemInfo

My first program with System Information!

Today is: 2015/01/01 18:54:10
Username is: albert
Hostname is: bigdata-host

Python¶

Let’s write a simple program in Python.

Create the following program: FirstProgram.py:

############################################
# Run python FirstProgram.py
############################################
from datetime import datetime
import getpass
import socket
############################################
# Run python FirstProgramWithSystemInfo.py
############################################
print (’My first program with System Information!’)
print ("Today is: " + str(datetime.now()))
print ("Username is: " + getpass.getuser())
print ("Hostname is: " + socket.gethostname())

Execution:

Compiling is not necessary in Python. You can run your code directly with python command.:
python FirstProgram.py

What does the output look like?:

python FirstProgramWithSystemInfo.py
My first program with System Information!
Today is: 2015-01-01 18:58:10.937227
Username is: albert
Hostname is: bigdata-host

Challenge tasks¶

Run any Java or Python on a FutureSystems OpenStack instance
Run NumPyTutorial Python on IPython Notebook

Code Examples¶

Preview Course Examples¶

The Elusive Mr.Higgs [Java][Python]
Number Theory [Python]
Calculated Dice Roll [Java][Python]
KNN [Java][Python]
PageRank [Java][Python]
KMeans [Java][Python]

Hadoop Cluster Access¶

This document describes getting access to the Hadoop cluster for the course.

You will need

An a account with FutureSystems
To be a member of a active project on FutureSystems (fg511)
Have uploaded an ssh key to the portal

The cluster frontend is located at <IP_ADDRESS> Login using ssh:

ssh -i $PATH_TO_SSH_PUBLIC_KEY $PORTAL_USERNAME@$HADOOP_IP

In the above:

$PATH_TO_SSH_PUBLIC_KEY is the location of the public key that has been added to the futuresystems portal
$PORTAL_USERNAME is the username on the futuresystems portal
$HADOOP_IP is the IP address of the hadoop frontend node

Hadoop is installed under /opt/hadoop, and you can refer to this location using $HADOOP_HOME. See:

hadoop fs

and:

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar