Virtual Cluster (in progress)
=============================

This is a tool used to submit jobs to remote hosts in parallel and
contains the following subcommands:

.. code:: console

   cms vcluster create virtual-cluster VIRTUALCLUSTER_NAME --clusters=CLUSTERS_LIST [--computers=COMPUTERS_LIST] [--debug]
   cms vcluster destroy virtual-cluster VIRTUALCLUSTER_NAME
   cms vcluster create runtime-config CONFIG_NAME PROCESS_NUM in:params out:stdout [--fetch-proc-num=FETCH_PROCESS_NUM [default=1]] [--download-now [default=True]]  [--debug]
   cms vcluster create runtime-config CONFIG_NAME PROCESS_NUM in:params out:file [--fetch-proc-num=FETCH_PROCESS_NUM [default=1]] [--download-now [default=True]]  [--debug]
   cms vcluster create runtime-config CONFIG_NAME PROCESS_NUM in:params+file out:stdout [--fetch-proc-num=FETCH_PROCESS_NUM [default=1]]  [--download-now [default=True]]  [--debug]
   cms vcluster create runtime-config CONFIG_NAME PROCESS_NUM in:params+file out:file [--fetch-proc-num=FETCH_PROCESS_NUM [default=1]] [--download-now [default=True]]  [--debug]
   cms vcluster create runtime-config CONFIG_NAME PROCESS_NUM in:params+file out:stdout+file [--fetch-proc-num=FETCH_PROCESS_NUM [default=1]] [--download-now [default=True]]  [--debug]
   cms vcluster set-param runtime-config CONFIG_NAME PARAMETER VALUE
   cms vcluster destroy runtime-config CONFIG_NAME
   cms vcluster list virtual-clusters [DEPTH [default:1]]
   cms vcluster list runtime-configs [DEPTH [default:1]]
   cms vcluster run-script --script-path=SCRIPT_PATH --job-name=JOB_NAME --vcluster-name=VIRTUALCLUSTER_NAME --config-name=CONFIG_NAME --arguments=SET_OF_PARAMS --remote-path=REMOTE_PATH> --local-path=LOCAL_PATH [--argfile-path=ARGUMENT_FILE_PATH] [--outfile-name=OUTPUT_FILE_NAME] [--suffix=SUFFIX] [--overwrite]
   cms vcluster fetch JOB_NAME
   cms vcluster clean-remote JOB_NAME PROCESS_NUM
   cms vcluster test-connection VIRTUALCLUSTER_NAME PROCESS_NUM

As can be seen, the command ``vcluster`` can be called with XX possible
options:

-  create

   -  virtual-cluster
   -  runtime-config

-  destroy

   -  virtual-cluster
   -  runtime-config

-  list

   -  virtual-clusters
   -  runtime-configs

-  set-param

   -  virtual-cluster
   -  runtime-config

-  run-script
-  fetch
-  clean-remote
-  test-connection

The information needed to create a virtual cluster, are extracted from
the ``yaml`` file of the cloudmesh v4, aka ``cms``, however, it does not
modify that file. Instead, it will create a new configuration file in a
folder called ``vcluster_workspace``. This newly generate configuration
file contains all the information about the virtual clusters, runtime
configurations as well as submitted jobs and therefore the file is
crucial for fetching the result of the previous runs. Although possible,
it is highly recommended not to modify the file directly but instead use
the ``set-param`` command to modify the file.

When you are creating a virtual cluster, you can *pick* your nodes of
interest from the cloudmesh configuration and just pass it as an
argument to ``create virtual-cluster`` and you will have your *Virtual
Cluster* created this way. When you are done with a Virtual Cluster, aka
``vcluster``, you can simply destroy it.

Creating a Virtual Cluster and testing connections
--------------------------------------------------

Consider the following two dummy clusters in the ``cloudmesh.yaml``
file:

::

   cloudmesh: 
       ...
       vcluster_test1:
         computer_a:
           name: machine1
           label: one
           address: localhost
           credentials:
             sshconfigpath: ~/vms/ubuntu14/sshconfig1
         computer_b:
           name:                       computer_a
           label:                      one
           address:                    localhost
           credentials:
             username:                 TBD
             pulickey:                 ~/.ssh/id_rsa.pub
       vcluster_test2:
         c2:
           name: machine2
           label: two
           address: localhost
           credentials:
             sshconfigpath: ~/vms/ubuntu14/sshconfig2
       ...

Suppose you want to create a virtual cluster called ``new_vcluster``
using ``computer_a`` from ``vcluster_test1`` and ``c2`` from
``vcluster_test2``. This can be achieved using the following command:

.. code:: console

   $ cms vcluster create virtual-cluster vcluster1 --clusters=vcluster_test1,vcluster_test2 --computers=computer_a,c2
   Virtual cluster created/replaced successfully.

This command will create the ``vcluster.yaml`` file in the
``vcluster_workspace`` folder and will keep the information about the
virtual cluster in there. Now, we can get the information about the
virtual cluster that we just created:

.. code:: console

   $ cms vcluster list virtual-clusters
    vcluster1:
        computer_a
        c2

By passing a depth higher than one as an extra argument, you can get
more information about the virtual clusters:

.. code:: console

   $ cms vcluster list virtual-clusters 2
    vcluster1:
        computer_a:
            name:
                machine1
            label:
                one
            address:
                localhost
            credentials:
                sshconfigpath
        c2:
            name:
                machine2
            label:
                two
            address:
                localhost
            credentials:
                sshconfigpath

Now that the virtual cluster is created, we can test the connection to
the remote nodes. We will try that using 2 processes in parallel:

.. code:: console

   $ cms vcluster test-connection vcluster1 2
   Node computer_a is accessible.
   Node c2 is accessible.

The output indicates that both nodes in the ``vcluster1`` are
accessible. In case you did not need the ``vcluster1`` anymore, you can
easily remove it using:

.. code:: console

   $ cms vcluster destroy virtual-cluster vcluster1
   Virtual-cluster vcluster1 destroyed successfully.

Creating a runtime-configuration
--------------------------------

Next, we have to create a ``runtime-configuration`` which defines the
type of input and output for possibly a set of jobs that are going to be
submitted later. In the next example, we will create a runtime
configuration for jobs that we want to run remotely using 5 processes,
fetch their results using 3 processes and the script that we want to run
remotely takes just some parameter (which could be left empty for no
parameters), and the output of the script is going to be printed on the
standard output, and suppose we want to just submit the jobs for running
on remote nodes and download them later (hence the ``--download-later``
flag):

.. code:: console

   $ cms vcluster create runtime-config ParamInStdOut 5 in:params out:stdout --fetch-proc-num=3 --download-later
   Runtime-configuration created/replaced successfully.

Let’s get the list of runtime configurations to make sure our
configuration is created as we expected:

.. code:: console

   $ cms vcluster list runtime-configs 2
    ParamInStdOut:
        proc_num:
            5
        download_proc_num:
            1
        download-later:
            False
        input-type:
            params
        output-type:
            stdout

Similar to the virtual cluster, you can remove a runtime-configuration
using the ``destroy`` sub-command:

.. code:: console

   $ cms vcluster destroy runtime-config ParamInStdOut
   Runtime-configuration ParamInStdOut destroyed successfully.

Running Parallel Remote Jobs
----------------------------

Now that we have both the virtual cluster and runtime configuration
ready, we can try to submit a batch job to our virtual cluster using
``cms vcluster run-script``. This is by far the most complicated
sub-command of the ``vcluster``, however, the name of the arguments are
pretty clear and looking at the names you would be able to pretty much
find your way. In the next example, we submit the
``inf_script_stdin_stdout.sh`` file to the nodes of ``vcluster1`` and
using the ``ParamInStdOut`` configuration we run 10 instance of that
script on the virtual cluster. This script will be copied and run on the
home directory of the remote nodes (``~/``). Note that even though the
remote path is set to home directory, for each job a folder with a
unique suffix will be created to avoid conflicts. Also, note that this
script does not take any argument, but we indicated 10 ``_`` separated
by commas as a meaningless argument. This will notify the tool that you
need 10 instances of this script to be executed:

.. code:: console

   $ cms vcluster run-script --script-path=./cm4/vcluster/sample_scripts/inf_script_stdin_stdout.sh --job-name=TestJob1 --vcluster-name=vcluster1 --config-name=ParamInStdOut --arguments=_,_,_,_,_,_,_,_,_,_ --remote-path=~/ --local-path=./cm4/vcluster/sample_output --overwrite
   Remote Pid on c2: 10104
   Remote Pid on c2: 10109
   Remote Pid on c2: 10402
   Remote Pid on computer_a: 8973
   Remote Pid on computer_a: 8979
   Remote Pid on computer_a: 8983
   Remote Pid on computer_a: 9464
   Remote Pid on c2: 10884
   Remote Pid on c2: 10993
   Remote Pid on computer_a: 9592
   collecting results
   waiting for other results if any...
   Results collected from c2.
   Results collected from c2.
   Results collected from c2.
   Results collected from computer_a.
   Results collected from computer_a.
   Results collected from computer_a.
   Results collected from computer_a.
   Results collected from c2.
   Results collected from c2.
   Results collected from computer_a.
   waiting for other results if any...
   All of the remote results collected.

As you can see all of the jobs were submitted (using 5 processes) and
results were collected afterward (using 3 processes). We can check the
existence of the results:

.. code:: console

   $ ll ./cloudmesh-cloud/vcluster/sample_output/
   total 48
   drwxr-xr-x 2 corriel 4096 Oct 31 22:12 ./
   drwxr-xr-x 8 corriel 4096 Oct 31 22:12 ../
   -rw-r--r-- 1 corriel  255 Oct 31 22:12 outputfile_0_20181031_22123465
   -rw-r--r-- 1 corriel  255 Oct 31 22:12 outputfile_1_20181031_22123465
   -rw-r--r-- 1 corriel  255 Oct 31 22:12 outputfile_2_20181031_22123465
   -rw-r--r-- 1 corriel  255 Oct 31 22:12 outputfile_3_20181031_22123465
   -rw-r--r-- 1 corriel  255 Oct 31 22:12 outputfile_4_20181031_22123465
   -rw-r--r-- 1 corriel  255 Oct 31 22:12 outputfile_5_20181031_22123465
   -rw-r--r-- 1 corriel  255 Oct 31 22:12 outputfile_6_20181031_22123465
   -rw-r--r-- 1 corriel  255 Oct 31 22:12 outputfile_7_20181031_22123465
   -rw-r--r-- 1 corriel  255 Oct 31 22:12 outputfile_8_20181031_22123465
   -rw-r--r-- 1 corriel  255 Oct 31 22:12 outputfile_9_20181031_22123465

Now, suppose the jobs were going to take so long that we could not wait
for the results and we had to download them later. To prepare this
scenario, we can set the ``download-later`` attribute of the runtime
configuration to ``true``:

.. code:: console

   $ cms vcluster set-param runtime-config ParamInStdOut download-later true
   Runtime-configuration parameter download-later set to true successfully.

Now that we set this parameter, we can submit the jobs and this time the
tool will not wait for the results:

.. code:: console

   $ cms vcluster run-script --script-path=./cloudmesh-cloud/vcluster/sample_scripts/inf_script_stdin_stdout.sh --job-name=TestJob1 --vcluster-name=vcluster1 --config-name=ParamInStdOut --arguments=_,_,_,_,_,_,_,_,_,_ --remote-path=~/ --local-path=./cloudmesh-cloud/vcluster/sample_output --overwrite
   Remote Pid on c2: 12981
   Remote Pid on c2: 12987
   Remote Pid on c2: 13280
   Remote Pid on computer_a: 11858
   Remote Pid on computer_a: 11942
   Remote Pid on computer_a: 11945
   Remote Pid on computer_a: 12300
   Remote Pid on c2: 13795
   Remote Pid on computer_a: 12427
   Remote Pid on c2: 13871

As you can see, the jobs are submitted and the script is finished. Note
that since a job with that exact job name exists, you cannot submit the
job unless you use the ``--overwrite`` flag. Now that we have submitted
the jobs and their results are ready, we can fetch their produced
results using the ``fetch`` command and all results will be collected
using the same number of processes that were indicated in the
runtime-configuration using which the job was submitted in the first
place:

.. code:: console

   $ cms vcluster fetch TestJob1
   collecting results
   Results collected from c2.
   Results collected from c2.
   Results collected from c2.
   Results collected from computer_a.
   Results collected from computer_a.
   Results collected from computer_a.
   Results collected from c2.
   Results collected from computer_a.
   Results collected from computer_a.
   Results collected from c2.
   waiting for other results if any...
   All of the remote results collected.

Cleaning the remote
-------------------

By default, the Virtual Cluster tool does not clean the remotes
automatically and this task is left to be performed manually since
important results might be lost due to mistakes. To clean the remotes,
the user has to explicitly use the ``clean-remote`` command for a
specific job and this way only the results of that particular job will
be removed from **ALL** remotes using 2 parallel processes:

.. code:: console

   $ cms vcluster clean-remote TestJob1 4
   Node c2 cleaned successfully.
   Node computer_a cleaned successfully.