Stack

About BDS

BDS is a collection of Ansible playbooks to deploy a stack of data analytics software. The current development version of BDS can be fond online here: https://github.com/futuresystems/big-data-stack/tree/unstable

BDS Requirements

  • Python 2.7
  • Virtualenv
  • Pip
  • Git
  • ssh client
  • ssh-keys in github (currently a bug and needs to be fixed)
  • IP address of nodes to be controlled with privileged ssh user

Using BDS

BDS is not a Python library or program and therefore cannot be installed using pip or other tools. It currently works by:

  1. git clone the bds repository
  2. ./mk-inventory with the IP address to create the inventory file
  3. ansible-playbook play-hadoop.yml addons/... to install hadoop and any addons

Integrating BDS with Cloudmesh Client

Proposed Cloudmesh Changes

Additional commands:

  1. cm stack
  2. cm hadoop

Additional yaml file dir:

  • .cloudmesh/stack.yml

cm stack

cm stack provides the low-level tools to manage the BDS. This include:

  • check: sanity-checking to ensure the all requirements are complete
  • cloning and updating the local cache of BDS
  • creating and setting up a clone of BDS for the current project/deployment
  • deploying software onto pre-configured nodes

cm hadoop

cm hadoop wrap several steps in order to deploy a virtual cluster. This includes:

  1. starting the machines on various providers (EC2, Chameleon, FutureSystems, etc)
  2. using cm stack to initialize, sanity check, and configure current project
  3. deploy software using cm stack

.cloudmesh/stack.yml

This file identifies the stacks that may be installed and used. For example:

$ cat ~/.cloudmesh/stack.yml
stack:
  bds:
    repo: git://github.com/futuresystems/big-data-stack
    checkout: unstable

This will allow cm stack to easily learn about different deployment stacks in the future.

Use Case: Hadoop with Spark, HBase, Drill

This should be achievable with a single line:

$ cm hadoop \
    --nodes 5 \
    --cloud chameleon \
    --with spark hbase drill \
    --define spark_version=1.7.0 spark_package_type=src

This will:

  • start 5 nodes (--nodes 5) on the chameleon cloud (--on chameleon)
  • install and configure hadoop
  • install and configure the apache spark, hbase, and drill packages
  • override ansible variables spark_version and spark_package_type (NOTE: the values passed must be supported by BDS).

Implementation Overview

This section describes possible implementation approaches

Sanity Check cm stack check

Example success:

$ cm stack check
python.......OK
virtualenv...OK
pip..........OK
ansible......OK
git..........OK
ssh..........OK
github.......OK

Example failure:

$ cm stack check
python.......OK
virtualenv...OK
pip..........FAILED
ansible......FAILED
git..........OK
ssh..........OK
github.......FAILED

The following errors were detected:

* Pip is not installed correctly
  > `pip` not found in $PATH
* Ansbile is not installed correctly
  > `ansible` related commands not found in $PATH
* Authentication to github.com failed
  > did you add your public key to https://github.com/settings/ssh?

cm stack check MUST:

  • verify that the python ecosystem and ansbile are installed. Do this by ensuring that the the following commands are in the $PATH and checking versions if applicable:

    • python (must be 2.7)
    • virtualenv
    • pip
    • ansible
    • ansible-playbook
    • ansible-vault
    • git
    • ssh
  • verify that keys are added to github. Do this by ensuring that the following command exits with 1:

    $ ssh -T git@github.com
    Hi badi! You've successfully authenticated, but GitHub does not provide shell access.
    $ echo $?
    1
    

Initialization cm stack init

Example:

$ cm stack init --branch unstable --user ubuntu 10.0.0.10 10.0.0.11 10.0.0.12

cm stack init MUST:

  • accept --branch <branchname> to specify the branch name of the repository (eg master [default], unstable)
  • accept --user <username to specify the ssh-login username on the nodes. This user MUST have privileges to manage the node.
  • accept a list of IP addresses as the nodes to control
  • accept --name <project name> to specify the name of this project. It not given, a default one must be chosen or generated. This project name is referred to below as $PROJ

Note

.cloudmesh refers to $HOME/.cloudmesh or $PWD/.cloudmesh, or wherever the .cloudmesh directory is found.

Note

$BDS below refers to .cloudmesh/stack/bds

  • clone BDS from github to a local cache directory. This should be in $DBS/cache/bds.git.

  • clone $BDS/cache/bds.git to $BDS/projects/$PROJ and checkout the branch that $BDS/cache/bds.git was on (default) or switch to the branch specified by --branch.

  • within $BDS/projects/$PROJ run ./mk-inventory -n $USER-$PROJ $IP1 $IP2 ... >inventory.txt where $IPN... is the list of ip addresses and $USER is the username of the owner of the local machine.

  • write the following information to $BDS/projects/$PROJ/.cloudmesh.yml:

    • the parameter of --user
    • the list of ip addresses

    This will allow other programs to inspect properties about this specific project

Listing Stacks cm stack list

Example:

$ cm stack list
Deployment Stacks
- BDS (<version or branchname>)  ~/.cloudmesh/stack/bds/cache/bds.git

Projects
- > foo    [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/foo
-   test-1 [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/test-1
-   p1     [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/p1
-   p2     [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/p2

cm stack list provides an interface to list the deployment stacks (eg BDS or others) and all the projcts using a stack.

cm stack list MUST:

  • accept --sort <field> where field can be date, or stack, or name (default: date
  • accept --list <field,...> to list a subset of (stack, project)
  • accept --json which will cause the output to be rendered using json so that other programs may easity parse the output

Switching Projects cm stack project

Example:

$ cm stack list --list project
Projects
-   test-1 [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/test-1
- > p1     [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/p1


$ tm stack project
p1

$ cm stack project test-1
Switched to project `test-1``

$ cm stack project
test-1

$ cm stack list --list project
Projects
- > test-1 [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/test-1
-   p1     [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/p1

Deploying Onto Nodes cm stack deploy

Example:

$ cm stack project
p1

$ cm stack deploy bds \
    --plays play-hadoop.yml addons/spark.yml addons/hbase.yml \
    --define spark_version=1.7.0
Verifying that nodes are reachable...........OK
Deploying play-hadoop.yml....................OK
Deploying addons/spark.yml...................OK
Deploying addons/hbase.yml...................OK

Done.
  1. os.chdir($BDS/project/$PROJ)
  2. Verify nodes are reachable: until ansible all -m ping -u <username>; do sleep 5; done
  3. Deploy hadoop: ansible-playbook play-hadoop.yml -e spark_version=1.7.0
  4. Deploy spark: ansible-playbook addons/spark.yml -e spark_version=1.7.0
  5. Deploy hbase: ansible-playbook addons/hbase.yml -e spark_version=1.7.0

Deploying Hadoop with Addons cm hadoop

Example:

$ cm hadoop --nodes 5 --cloud chameleon --with spark hbase drill