Virtual Cluster (in progress)¶
This is a tool used to submit jobs to remote hosts in parallel and contains the following subcommands:
cms vcluster create virtual-cluster VIRTUALCLUSTER_NAME --clusters=CLUSTERS_LIST [--computers=COMPUTERS_LIST] [--debug]
cms vcluster destroy virtual-cluster VIRTUALCLUSTER_NAME
cms vcluster create runtime-config CONFIG_NAME PROCESS_NUM in:params out:stdout [--fetch-proc-num=FETCH_PROCESS_NUM [default=1]] [--download-now [default=True]] [--debug]
cms vcluster create runtime-config CONFIG_NAME PROCESS_NUM in:params out:file [--fetch-proc-num=FETCH_PROCESS_NUM [default=1]] [--download-now [default=True]] [--debug]
cms vcluster create runtime-config CONFIG_NAME PROCESS_NUM in:params+file out:stdout [--fetch-proc-num=FETCH_PROCESS_NUM [default=1]] [--download-now [default=True]] [--debug]
cms vcluster create runtime-config CONFIG_NAME PROCESS_NUM in:params+file out:file [--fetch-proc-num=FETCH_PROCESS_NUM [default=1]] [--download-now [default=True]] [--debug]
cms vcluster create runtime-config CONFIG_NAME PROCESS_NUM in:params+file out:stdout+file [--fetch-proc-num=FETCH_PROCESS_NUM [default=1]] [--download-now [default=True]] [--debug]
cms vcluster set-param runtime-config CONFIG_NAME PARAMETER VALUE
cms vcluster destroy runtime-config CONFIG_NAME
cms vcluster list virtual-clusters [DEPTH [default:1]]
cms vcluster list runtime-configs [DEPTH [default:1]]
cms vcluster run-script --script-path=SCRIPT_PATH --job-name=JOB_NAME --vcluster-name=VIRTUALCLUSTER_NAME --config-name=CONFIG_NAME --arguments=SET_OF_PARAMS --remote-path=REMOTE_PATH> --local-path=LOCAL_PATH [--argfile-path=ARGUMENT_FILE_PATH] [--outfile-name=OUTPUT_FILE_NAME] [--suffix=SUFFIX] [--overwrite]
cms vcluster fetch JOB_NAME
cms vcluster clean-remote JOB_NAME PROCESS_NUM
cms vcluster test-connection VIRTUALCLUSTER_NAME PROCESS_NUM
As can be seen, the command vcluster
can be called with XX possible
options:
create
virtual-cluster
runtime-config
destroy
virtual-cluster
runtime-config
list
virtual-clusters
runtime-configs
set-param
virtual-cluster
runtime-config
run-script
fetch
clean-remote
test-connection
The information needed to create a virtual cluster, are extracted from
the yaml
file of the cloudmesh v4, aka cms
, however, it does not
modify that file. Instead, it will create a new configuration file in a
folder called vcluster_workspace
. This newly generate configuration
file contains all the information about the virtual clusters, runtime
configurations as well as submitted jobs and therefore the file is
crucial for fetching the result of the previous runs. Although possible,
it is highly recommended not to modify the file directly but instead use
the set-param
command to modify the file.
When you are creating a virtual cluster, you can pick your nodes of
interest from the cloudmesh configuration and just pass it as an
argument to create virtual-cluster
and you will have your Virtual
Cluster created this way. When you are done with a Virtual Cluster, aka
vcluster
, you can simply destroy it.
Creating a Virtual Cluster and testing connections¶
Consider the following two dummy clusters in the cloudmesh.yaml
file:
cloudmesh:
...
vcluster_test1:
computer_a:
name: machine1
label: one
address: localhost
credentials:
sshconfigpath: ~/vms/ubuntu14/sshconfig1
computer_b:
name: computer_a
label: one
address: localhost
credentials:
username: TBD
pulickey: ~/.ssh/id_rsa.pub
vcluster_test2:
c2:
name: machine2
label: two
address: localhost
credentials:
sshconfigpath: ~/vms/ubuntu14/sshconfig2
...
Suppose you want to create a virtual cluster called new_vcluster
using computer_a
from vcluster_test1
and c2
from
vcluster_test2
. This can be achieved using the following command:
$ cms vcluster create virtual-cluster vcluster1 --clusters=vcluster_test1,vcluster_test2 --computers=computer_a,c2
Virtual cluster created/replaced successfully.
This command will create the vcluster.yaml
file in the
vcluster_workspace
folder and will keep the information about the
virtual cluster in there. Now, we can get the information about the
virtual cluster that we just created:
$ cms vcluster list virtual-clusters
vcluster1:
computer_a
c2
By passing a depth higher than one as an extra argument, you can get more information about the virtual clusters:
$ cms vcluster list virtual-clusters 2
vcluster1:
computer_a:
name:
machine1
label:
one
address:
localhost
credentials:
sshconfigpath
c2:
name:
machine2
label:
two
address:
localhost
credentials:
sshconfigpath
Now that the virtual cluster is created, we can test the connection to the remote nodes. We will try that using 2 processes in parallel:
$ cms vcluster test-connection vcluster1 2
Node computer_a is accessible.
Node c2 is accessible.
The output indicates that both nodes in the vcluster1
are
accessible. In case you did not need the vcluster1
anymore, you can
easily remove it using:
$ cms vcluster destroy virtual-cluster vcluster1
Virtual-cluster vcluster1 destroyed successfully.
Creating a runtime-configuration¶
Next, we have to create a runtime-configuration
which defines the
type of input and output for possibly a set of jobs that are going to be
submitted later. In the next example, we will create a runtime
configuration for jobs that we want to run remotely using 5 processes,
fetch their results using 3 processes and the script that we want to run
remotely takes just some parameter (which could be left empty for no
parameters), and the output of the script is going to be printed on the
standard output, and suppose we want to just submit the jobs for running
on remote nodes and download them later (hence the --download-later
flag):
$ cms vcluster create runtime-config ParamInStdOut 5 in:params out:stdout --fetch-proc-num=3 --download-later
Runtime-configuration created/replaced successfully.
Let’s get the list of runtime configurations to make sure our configuration is created as we expected:
$ cms vcluster list runtime-configs 2
ParamInStdOut:
proc_num:
5
download_proc_num:
1
download-later:
False
input-type:
params
output-type:
stdout
Similar to the virtual cluster, you can remove a runtime-configuration
using the destroy
sub-command:
$ cms vcluster destroy runtime-config ParamInStdOut
Runtime-configuration ParamInStdOut destroyed successfully.
Running Parallel Remote Jobs¶
Now that we have both the virtual cluster and runtime configuration
ready, we can try to submit a batch job to our virtual cluster using
cms vcluster run-script
. This is by far the most complicated
sub-command of the vcluster
, however, the name of the arguments are
pretty clear and looking at the names you would be able to pretty much
find your way. In the next example, we submit the
inf_script_stdin_stdout.sh
file to the nodes of vcluster1
and
using the ParamInStdOut
configuration we run 10 instance of that
script on the virtual cluster. This script will be copied and run on the
home directory of the remote nodes (~/
). Note that even though the
remote path is set to home directory, for each job a folder with a
unique suffix will be created to avoid conflicts. Also, note that this
script does not take any argument, but we indicated 10 _
separated
by commas as a meaningless argument. This will notify the tool that you
need 10 instances of this script to be executed:
$ cms vcluster run-script --script-path=./cm4/vcluster/sample_scripts/inf_script_stdin_stdout.sh --job-name=TestJob1 --vcluster-name=vcluster1 --config-name=ParamInStdOut --arguments=_,_,_,_,_,_,_,_,_,_ --remote-path=~/ --local-path=./cm4/vcluster/sample_output --overwrite
Remote Pid on c2: 10104
Remote Pid on c2: 10109
Remote Pid on c2: 10402
Remote Pid on computer_a: 8973
Remote Pid on computer_a: 8979
Remote Pid on computer_a: 8983
Remote Pid on computer_a: 9464
Remote Pid on c2: 10884
Remote Pid on c2: 10993
Remote Pid on computer_a: 9592
collecting results
waiting for other results if any...
Results collected from c2.
Results collected from c2.
Results collected from c2.
Results collected from computer_a.
Results collected from computer_a.
Results collected from computer_a.
Results collected from computer_a.
Results collected from c2.
Results collected from c2.
Results collected from computer_a.
waiting for other results if any...
All of the remote results collected.
As you can see all of the jobs were submitted (using 5 processes) and results were collected afterward (using 3 processes). We can check the existence of the results:
$ ll ./cloudmesh-cloud/vcluster/sample_output/
total 48
drwxr-xr-x 2 corriel 4096 Oct 31 22:12 ./
drwxr-xr-x 8 corriel 4096 Oct 31 22:12 ../
-rw-r--r-- 1 corriel 255 Oct 31 22:12 outputfile_0_20181031_22123465
-rw-r--r-- 1 corriel 255 Oct 31 22:12 outputfile_1_20181031_22123465
-rw-r--r-- 1 corriel 255 Oct 31 22:12 outputfile_2_20181031_22123465
-rw-r--r-- 1 corriel 255 Oct 31 22:12 outputfile_3_20181031_22123465
-rw-r--r-- 1 corriel 255 Oct 31 22:12 outputfile_4_20181031_22123465
-rw-r--r-- 1 corriel 255 Oct 31 22:12 outputfile_5_20181031_22123465
-rw-r--r-- 1 corriel 255 Oct 31 22:12 outputfile_6_20181031_22123465
-rw-r--r-- 1 corriel 255 Oct 31 22:12 outputfile_7_20181031_22123465
-rw-r--r-- 1 corriel 255 Oct 31 22:12 outputfile_8_20181031_22123465
-rw-r--r-- 1 corriel 255 Oct 31 22:12 outputfile_9_20181031_22123465
Now, suppose the jobs were going to take so long that we could not wait
for the results and we had to download them later. To prepare this
scenario, we can set the download-later
attribute of the runtime
configuration to true
:
$ cms vcluster set-param runtime-config ParamInStdOut download-later true
Runtime-configuration parameter download-later set to true successfully.
Now that we set this parameter, we can submit the jobs and this time the tool will not wait for the results:
$ cms vcluster run-script --script-path=./cloudmesh-cloud/vcluster/sample_scripts/inf_script_stdin_stdout.sh --job-name=TestJob1 --vcluster-name=vcluster1 --config-name=ParamInStdOut --arguments=_,_,_,_,_,_,_,_,_,_ --remote-path=~/ --local-path=./cloudmesh-cloud/vcluster/sample_output --overwrite
Remote Pid on c2: 12981
Remote Pid on c2: 12987
Remote Pid on c2: 13280
Remote Pid on computer_a: 11858
Remote Pid on computer_a: 11942
Remote Pid on computer_a: 11945
Remote Pid on computer_a: 12300
Remote Pid on c2: 13795
Remote Pid on computer_a: 12427
Remote Pid on c2: 13871
As you can see, the jobs are submitted and the script is finished. Note
that since a job with that exact job name exists, you cannot submit the
job unless you use the --overwrite
flag. Now that we have submitted
the jobs and their results are ready, we can fetch their produced
results using the fetch
command and all results will be collected
using the same number of processes that were indicated in the
runtime-configuration using which the job was submitted in the first
place:
$ cms vcluster fetch TestJob1
collecting results
Results collected from c2.
Results collected from c2.
Results collected from c2.
Results collected from computer_a.
Results collected from computer_a.
Results collected from computer_a.
Results collected from c2.
Results collected from computer_a.
Results collected from computer_a.
Results collected from c2.
waiting for other results if any...
All of the remote results collected.
Cleaning the remote¶
By default, the Virtual Cluster tool does not clean the remotes
automatically and this task is left to be performed manually since
important results might be lost due to mistakes. To clean the remotes,
the user has to explicitly use the clean-remote
command for a
specific job and this way only the results of that particular job will
be removed from ALL remotes using 2 parallel processes:
$ cms vcluster clean-remote TestJob1 4
Node c2 cleaned successfully.
Node computer_a cleaned successfully.