Stack¶
About BDS¶
BDS is a collection of Ansible playbooks to deploy a stack of data analytics software. The current development version of BDS can be fond online here: https://github.com/futuresystems/big-data-stack/tree/unstable
BDS Requirements¶
- Python 2.7
- Virtualenv
- Pip
- Git
- ssh client
- ssh-keys in github (currently a bug and needs to be fixed)
- IP address of nodes to be controlled with privileged ssh user
Using BDS¶
BDS is not a Python library or program and therefore cannot be installed using pip or other tools. It currently works by:
git clone
the bds repository./mk-inventory
with the IP address to create the inventory fileansible-playbook play-hadoop.yml addons/...
to install hadoop and any addons
Integrating BDS with Cloudmesh Client¶
Proposed Cloudmesh Changes¶
Additional commands:
cm stack
cm hadoop
Additional yaml file dir:
.cloudmesh/stack.yml
cm stack
¶
cm stack
provides the low-level tools to manage the BDS. This include:
- check: sanity-checking to ensure the all requirements are complete
- cloning and updating the local cache of BDS
- creating and setting up a clone of BDS for the current project/deployment
- deploying software onto pre-configured nodes
cm hadoop
¶
cm hadoop
wrap several steps in order to deploy a virtual cluster. This includes:
- starting the machines on various providers (EC2, Chameleon, FutureSystems, etc)
- using
cm stack
to initialize, sanity check, and configure current project - deploy software using
cm stack
.cloudmesh/stack.yml
¶
This file identifies the stacks that may be installed and used. For example:
$ cat ~/.cloudmesh/stack.yml
stack:
bds:
repo: git://github.com/futuresystems/big-data-stack
checkout: unstable
This will allow cm stack
to easily learn about different
deployment stacks in the future.
Use Case: Hadoop with Spark, HBase, Drill¶
This should be achievable with a single line:
$ cm hadoop \
--nodes 5 \
--cloud chameleon \
--with spark hbase drill \
--define spark_version=1.7.0 spark_package_type=src
This will:
- start 5 nodes (
--nodes 5
) on the chameleon cloud (--on chameleon
) - install and configure hadoop
- install and configure the apache spark, hbase, and drill packages
- override ansible variables
spark_version
andspark_package_type
(NOTE: the values passed must be supported by BDS).
Implementation Overview¶
This section describes possible implementation approaches
Sanity Check cm stack check
¶
Example success:
$ cm stack check
python.......OK
virtualenv...OK
pip..........OK
ansible......OK
git..........OK
ssh..........OK
github.......OK
Example failure:
$ cm stack check
python.......OK
virtualenv...OK
pip..........FAILED
ansible......FAILED
git..........OK
ssh..........OK
github.......FAILED
The following errors were detected:
* Pip is not installed correctly
> `pip` not found in $PATH
* Ansbile is not installed correctly
> `ansible` related commands not found in $PATH
* Authentication to github.com failed
> did you add your public key to https://github.com/settings/ssh?
cm stack check
MUST:
verify that the python ecosystem and ansbile are installed. Do this by ensuring that the the following commands are in the
$PATH
and checking versions if applicable:python
(must be 2.7)virtualenv
pip
ansible
ansible-playbook
ansible-vault
git
ssh
verify that keys are added to github. Do this by ensuring that the following command exits with 1:
$ ssh -T git@github.com Hi badi! You've successfully authenticated, but GitHub does not provide shell access. $ echo $? 1
Initialization cm stack init
¶
Example:
$ cm stack init --branch unstable --user ubuntu 10.0.0.10 10.0.0.11 10.0.0.12
cm stack init
MUST:
- accept
--branch <branchname>
to specify the branch name of the repository (egmaster
[default],unstable
) - accept
--user <username
to specify the ssh-login username on the nodes. This user MUST have privileges to manage the node. - accept a list of IP addresses as the nodes to control
- accept
--name <project name>
to specify the name of this project. It not given, a default one must be chosen or generated. This project name is referred to below as$PROJ
Note
.cloudmesh
refers to $HOME/.cloudmesh
or
$PWD/.cloudmesh
, or wherever the .cloudmesh
directory is
found.
Note
$BDS
below refers to .cloudmesh/stack/bds
clone BDS from github to a local cache directory. This should be in
$DBS/cache/bds.git
.clone
$BDS/cache/bds.git
to$BDS/projects/$PROJ
and checkout the branch that$BDS/cache/bds.git
was on (default) or switch to the branch specified by--branch
.within
$BDS/projects/$PROJ
run./mk-inventory -n $USER-$PROJ $IP1 $IP2 ... >inventory.txt
where$IPN...
is the list of ip addresses and$USER
is the username of the owner of the local machine.write the following information to
$BDS/projects/$PROJ/.cloudmesh.yml
:- the parameter of
--user
- the list of ip addresses
This will allow other programs to inspect properties about this specific project
- the parameter of
Listing Stacks cm stack list
¶
Example:
$ cm stack list
Deployment Stacks
- BDS (<version or branchname>) ~/.cloudmesh/stack/bds/cache/bds.git
Projects
- > foo [<stack name eg BDS>] [<date created>] ~/.cloudmesh/stack/projects/foo
- test-1 [<stack name eg BDS>] [<date created>] ~/.cloudmesh/stack/projects/test-1
- p1 [<stack name eg BDS>] [<date created>] ~/.cloudmesh/stack/projects/p1
- p2 [<stack name eg BDS>] [<date created>] ~/.cloudmesh/stack/projects/p2
cm stack list
provides an interface to list the deployment stacks (eg BDS or others) and all the projcts using a stack.
cm stack list
MUST:
- accept
--sort <field>
wherefield
can bedate
, orstack
, orname
(default:date
- accept
--list <field,...>
to list a subset of (stack
,project
) - accept
--json
which will cause the output to be rendered using json so that other programs may easity parse the output
Switching Projects cm stack project
¶
Example:
$ cm stack list --list project
Projects
- test-1 [<stack name eg BDS>] [<date created>] ~/.cloudmesh/stack/projects/test-1
- > p1 [<stack name eg BDS>] [<date created>] ~/.cloudmesh/stack/projects/p1
$ tm stack project
p1
$ cm stack project test-1
Switched to project `test-1``
$ cm stack project
test-1
$ cm stack list --list project
Projects
- > test-1 [<stack name eg BDS>] [<date created>] ~/.cloudmesh/stack/projects/test-1
- p1 [<stack name eg BDS>] [<date created>] ~/.cloudmesh/stack/projects/p1
Deploying Onto Nodes cm stack deploy
¶
Example:
$ cm stack project
p1
$ cm stack deploy bds \
--plays play-hadoop.yml addons/spark.yml addons/hbase.yml \
--define spark_version=1.7.0
Verifying that nodes are reachable...........OK
Deploying play-hadoop.yml....................OK
Deploying addons/spark.yml...................OK
Deploying addons/hbase.yml...................OK
Done.
os.chdir($BDS/project/$PROJ)
- Verify nodes are reachable:
until ansible all -m ping -u <username>; do sleep 5; done
- Deploy hadoop:
ansible-playbook play-hadoop.yml -e spark_version=1.7.0
- Deploy spark:
ansible-playbook addons/spark.yml -e spark_version=1.7.0
- Deploy hbase:
ansible-playbook addons/hbase.yml -e spark_version=1.7.0
Deploying Hadoop with Addons cm hadoop
¶
Example:
$ cm hadoop --nodes 5 --cloud chameleon --with spark hbase drill