About BDS

BDS is a collection of Ansible playbooks to deploy a stack of data analytics software. The current development version of BDS can be fond online here:

BDS Requirements

  • Python 2.7
  • Virtualenv
  • Pip
  • Git
  • ssh client
  • ssh-keys in github (currently a bug and needs to be fixed)
  • IP address of nodes to be controlled with privileged ssh user

Using BDS

BDS is not a Python library or program and therefore cannot be installed using pip or other tools. It currently works by:

  1. git clone the bds repository
  2. ./mk-inventory with the IP address to create the inventory file
  3. ansible-playbook play-hadoop.yml addons/... to install hadoop and any addons

Integrating BDS with Cloudmesh Client

Proposed Cloudmesh Changes

Additional commands:

  1. cm stack
  2. cm hadoop

Additional yaml file dir:

  • .cloudmesh/stack.yml

cm stack

cm stack provides the low-level tools to manage the BDS. This include:

  • check: sanity-checking to ensure the all requirements are complete
  • cloning and updating the local cache of BDS
  • creating and setting up a clone of BDS for the current project/deployment
  • deploying software onto pre-configured nodes

cm hadoop

cm hadoop wrap several steps in order to deploy a virtual cluster. This includes:

  1. starting the machines on various providers (EC2, Chameleon, FutureSystems, etc)
  2. using cm stack to initialize, sanity check, and configure current project
  3. deploy software using cm stack


This file identifies the stacks that may be installed and used. For example:

$ cat ~/.cloudmesh/stack.yml
    repo: git://
    checkout: unstable

This will allow cm stack to easily learn about different deployment stacks in the future.

Use Case: Hadoop with Spark, HBase, Drill

This should be achievable with a single line:

$ cm hadoop \
    --nodes 5 \
    --cloud chameleon \
    --with spark hbase drill \
    --define spark_version=1.7.0 spark_package_type=src

This will:

  • start 5 nodes (--nodes 5) on the chameleon cloud (--on chameleon)
  • install and configure hadoop
  • install and configure the apache spark, hbase, and drill packages
  • override ansible variables spark_version and spark_package_type (NOTE: the values passed must be supported by BDS).

Implementation Overview

This section describes possible implementation approaches

Sanity Check cm stack check

Example success:

$ cm stack check

Example failure:

$ cm stack check

The following errors were detected:

* Pip is not installed correctly
  > `pip` not found in $PATH
* Ansbile is not installed correctly
  > `ansible` related commands not found in $PATH
* Authentication to failed
  > did you add your public key to

cm stack check MUST:

  • verify that the python ecosystem and ansbile are installed. Do this by ensuring that the the following commands are in the $PATH and checking versions if applicable:

    • python (must be 2.7)
    • virtualenv
    • pip
    • ansible
    • ansible-playbook
    • ansible-vault
    • git
    • ssh
  • verify that keys are added to github. Do this by ensuring that the following command exits with 1:

    $ ssh -T
    Hi badi! You've successfully authenticated, but GitHub does not provide shell access.
    $ echo $?

Initialization cm stack init


$ cm stack init --branch unstable --user ubuntu

cm stack init MUST:

  • accept --branch <branchname> to specify the branch name of the repository (eg master [default], unstable)
  • accept --user <username to specify the ssh-login username on the nodes. This user MUST have privileges to manage the node.
  • accept a list of IP addresses as the nodes to control
  • accept --name <project name> to specify the name of this project. It not given, a default one must be chosen or generated. This project name is referred to below as $PROJ


.cloudmesh refers to $HOME/.cloudmesh or $PWD/.cloudmesh, or wherever the .cloudmesh directory is found.


$BDS below refers to .cloudmesh/stack/bds

  • clone BDS from github to a local cache directory. This should be in $DBS/cache/bds.git.

  • clone $BDS/cache/bds.git to $BDS/projects/$PROJ and checkout the branch that $BDS/cache/bds.git was on (default) or switch to the branch specified by --branch.

  • within $BDS/projects/$PROJ run ./mk-inventory -n $USER-$PROJ $IP1 $IP2 ... >inventory.txt where $IPN... is the list of ip addresses and $USER is the username of the owner of the local machine.

  • write the following information to $BDS/projects/$PROJ/.cloudmesh.yml:

    • the parameter of --user
    • the list of ip addresses

    This will allow other programs to inspect properties about this specific project

Listing Stacks cm stack list


$ cm stack list
Deployment Stacks
- BDS (<version or branchname>)  ~/.cloudmesh/stack/bds/cache/bds.git

- > foo    [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/foo
-   test-1 [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/test-1
-   p1     [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/p1
-   p2     [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/p2

cm stack list provides an interface to list the deployment stacks (eg BDS or others) and all the projcts using a stack.

cm stack list MUST:

  • accept --sort <field> where field can be date, or stack, or name (default: date
  • accept --list <field,...> to list a subset of (stack, project)
  • accept --json which will cause the output to be rendered using json so that other programs may easity parse the output

Switching Projects cm stack project


$ cm stack list --list project
-   test-1 [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/test-1
- > p1     [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/p1

$ tm stack project

$ cm stack project test-1
Switched to project `test-1``

$ cm stack project

$ cm stack list --list project
- > test-1 [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/test-1
-   p1     [<stack name eg BDS>]  [<date created>]     ~/.cloudmesh/stack/projects/p1

Deploying Onto Nodes cm stack deploy


$ cm stack project

$ cm stack deploy bds \
    --plays play-hadoop.yml addons/spark.yml addons/hbase.yml \
    --define spark_version=1.7.0
Verifying that nodes are reachable...........OK
Deploying play-hadoop.yml....................OK
Deploying addons/spark.yml...................OK
Deploying addons/hbase.yml...................OK

  1. os.chdir($BDS/project/$PROJ)
  2. Verify nodes are reachable: until ansible all -m ping -u <username>; do sleep 5; done
  3. Deploy hadoop: ansible-playbook play-hadoop.yml -e spark_version=1.7.0
  4. Deploy spark: ansible-playbook addons/spark.yml -e spark_version=1.7.0
  5. Deploy hbase: ansible-playbook addons/hbase.yml -e spark_version=1.7.0

Deploying Hadoop with Addons cm hadoop


$ cm hadoop --nodes 5 --cloud chameleon --with spark hbase drill