Skip to end of metadata
Go to start of metadata

Motivation

Some researchers run complex computational workflows on their computers. Individual tasks of these workflows can be compute-intensive, and some workflows involve launching a whole batch of computations, e.g. a parameter sweep, which can easily overload a single computer. Re-implementing and running the entire workflow in the Pan cluster environment might prove to be difficult. For that reason, the Centre for eResearch provides a set of tools to offload compute-intensive tasks to the Pan cluster, that can be integrated into a workflow running in the researcher's environment. 

These tools provide functionality to

  • Upload files to the cluster file system
  • Submit a batch of jobs onto the cluster
  • Cancel a batch of jobs running on the cluster
  • Download files from the cluster file system
  • Clean up job directories on the cluster file system

Features: Command-line executables for Windows, Mac and Linux.

Installation

Microsoft Windows

MSI installer

Installation on network drives is not supported. In that case you'll be better of downloading the zip archive with the executables, as described below.

Download the CerToolkit Microsoft Installer, and double-click it to run the installer. This will install the rjm tools, and a few dependency software packages.

During the installation the following properties are configurable:

  • Installation directory
  • Adding the rjm tools to the system path.
    By default, the installation directory is not added to the system path, because this is discussed rather controversially and often times considered a bad practice.

After the installation finished, there will be a new entry in the Start Menu: CeR Toolkit --> CeR Toolkit Shell.
If you launch this program, a Windows command window will open, and the rjm command-line scripts are in the path, i.e. you don't have to call them using the absolute path to the installation directory.

The environment variable %CER_TOOLKIT_PATH% will be set during the installation, which is used to find the dependency tools puttygen and pageant. And it may make your life easier if you don't want append the installation directory to the system path, and cannot use the CeR Toolkit Shell either.

Download zip archive with executables

If the MSI installer doesn't allow for the flexibility you need, you can download the executables in a zip archive, and extract them to a directory of your choice.

This directory must be on the system path though.

Other operating systems

Please contact the support team if you are interested in installing these tools on other operating systems, like Linux or Mac.

Configuration

In order to use the tools, a one-time configuration step has to be performed by each user who will use the tools, on each computer the tools are installed on.

You'll be asked for the following information:

  • Passphrase for private key
    The tools use an SSH key pair to securely manage your jobs on the cluster and upload and download files. The private key of that SSH key pair must be protected by a passphrase. If it wasn't protected, anybody who gets a hold of your key pair, e.g. a hacker, could get access to the cluster using your account and use your data. Your passphrase will not be stored anywhere, it will only be used in the creation of the private key.
  • Name of cluster login node
    Remember, you'll run these tools on your computer, and the tools have to connect to the cluster. So the tools need to know which computer to talk to. Choose the cluster login node.
    Your cluster login name (UPI)
    The identity to be used to connect to the remote computer, and should be set to your cluster account name (UPI if you are affiliated with the University of Auckland)
  • Default project code
    In order to submit jobs, you have to specify a project code. You can specify the project code each time you submit jobs, but you can also configure a default project code to be used.
    Sample project code: uoa00042
  • Default remote directory
    For each job, a directory will be created on the cluster file system. Choose a default remote directory, where these job directories will be created in.
    Just like with the project code, you can specify a default remote directory each time you submit jobs, in case you don't want to use the default remote directory. A default remote directory will make your life easier though.
  • Name of file in each job directory to specify files to be uploaded
    Name of the file that contains a list of all files to be uploaded from your computer to the cluster file system before the job is submitted. This file is stored in each job directory of the batch on your computer. It doesn't have to exist. If it doesn't exist, no files will be uploaded.
    Default name: rjm_uploads.txt
  • Name of file in each job directory to specify files to be downloaded
    Name of the file that contains a list of all files to be downloaded from the cluster file system to your computer after the job finished. This file is stored in each job directory of the batch on your computer. It doesn't have to exist. If it doesn't exist, no files will be downloaded.
    Default name: rjm_downloads.txt
  • University Password which goes with your UPI
    In order to set up SSH key-based access, the public key of the key pair, that was just generated, needs to be copied to the cluster file system. For this your password is required. Your password will not be stored anywhere, it will only be used to copy your public SSH key in the initial configuration step.

Microsoft Windows

In the Start Menu click on CeR Toolkit --> CeR Toolkit Shell and run the following command in the new Windows command window:

rjm_configure

Follow the instructions in the command window to complete the configuration step.

Don't worry when a few windows pop up during generation of the SSH key pair.

A central configuration file will be created in %USERPROFILE%\.remote_jobs\config.ini to store configuration parameters. If you wish to change some of the configuration parameters, edit this file.

Other operating systems

Please contact the support team if you are interested in configuration of these tools on other operating systems, like Linux or Mac.

Test the Installation

Download the test archive, which consists of 5 simple jobs, and a batch script to

  • Submit the jobs. Each job runs a very simple Octave (free Matlab clone) script to transpose a matrix. Each job works on a different matrix.
  • Wait until the jobs are done and download the results
  • Clean up the remote job directories on the cluster file system

Microsoft Windows

Extract the test archive and launch the CeR Toolkit Shell from the Start Menu. Change into the folder where you extracted the archive, and run run.bat

If you run this script for the first time, you will be prompted for the passphrase of your private key. In subsequent calls of the rjm tools, the passphrase will be fetched from the ssh agent, until you reboot the machine, or restart the ssh agent.

Inspect run.bat, and the other files and folders to gett a feel of how things work.

Other operating systems

Please contact the support team if you are interested in installing and testing these tools on another operating system, like Linux or Mac.

Batch Workflow - High-Level Overview

There are many ways to submit a batch of jobs. This section describes the conventions and mechanisms used by the rjm tools. Understanding these conventions will make the usage of the tools easier.

Submit a batch of jobs

Each job to be submitted to the cluster must have its own directory on your computer. Specify the names of these directories in a file, one name per line.

The submission tool will read this file and for each directory will

  • Create a matching job directory on the cluster file system and create a job description
  • Optionally upload all input files required by the job. Input files for a job are read from an input file list, located in the job directory on your computer. Relative or absolute paths can be specified in this list.
  • Launch the job on the cluster
  • Create a job configuration file in the job directory on your computer, containing the remote job directory and the job id

Download results when jobs have finished

Once the jobs have been submitted to the cluster, you typically want to download the results once the jobs are done.

The tool to do this will use the list of directories and for each job will periodically check the status of the jobs on the cluster, and download the results once a job is done. It utilizes the job configuration file in the local job directory to identify the job id and the remote job directory on the cluster.

Files to be downloaded must be specified in an output file list, located in the job directory on your computer. Relative or absolute paths can be specified in this list (typically relative paths make more sense, because presumably all files created are located in the job directory on the cluster filesystem).

Cancel a batch of jobs

Sometimes you submit a batch of jobs, and realize later that something went wrong, and want to cancel all these jobs.

The tool to cancel a batch of jobs will use the list of directories and for each directory will cancel matching job on the cluster. It waits until all jobs have been cancelled, but does not download any results.

Clean up a batch of jobs

After a batch of jobs finished and the results have been downloaded and you verified that everything looks good to you, you may want to delete the job directories on the cluster filesystem to save disk space.

The tool to clean up a batch of jobs on the cluster filesystem uses the list of directories and for each directory will remove the remote job directory on the cluster filesystem.

Overview Diagram

rjm_architecture

Batch Workflow - By Example

See here for concrete examples how the tools are used

Command-line Options Explained

See here for a detailed overview of the command-line options of the tools.

Selected Concepts In Detail

See here for a detailed overview of certain concepts, like logging, security, configuration files, etc.

Code Repository

https://github.com/mondkaefer/rjm

Todo List / Issues

https://github.com/mondkaefer/rjm/issues

 

  • No labels