13. Additional Rocoto Information¶
The tasks in the SRW Application (Table 4.6) are typically run using the Rocoto Workflow Manager. Rocoto is a Ruby program that interfaces with the batch system on an HPC system to run and manage dependencies between the tasks. Rocoto submits jobs to the HPC batch system as the task dependencies allow, and runs one instance of the workflow for a set of user-defined cycles. More information on Rocoto can be found at https://github.com/christopherwharrop/rocoto/wiki/documentation.
The SRW App workflow is defined in a Jinja-enabled Rocoto XML template called
which resides in the
regional_workflow/ufs/templates directory. When the
script is run, the
fill_jinja_template.py script is called, and the parameters in the template file
are filled in. The completed file contains the workflow task names, parameters needed by the job scheduler,
and task interdependencies. The generated XML file is then copied to the experiment directory:
There are a number of Rocoto commands available to run and monitor the workflow and can be found in the complete Rocoto documentation. Descriptions and examples of commonly used commands are discussed below.
rocotorun command is used to run the workflow by submitting tasks to the batch system. It will
automatically resubmit failed tasks and can recover from system outages without user intervention.
An example is:
rocotorun -w /path/to/workflow/xml/file -d /path/to/workflow/database/file -v 10
-wspecifies the name of the workflow definition file. This must be an XML file.
-dspecifies the name of the database file that is to be used to store the state of the workflow. The database file is a binary file created and used only by Rocoto and need not exist prior to the first time the command is run.
-v(optional) specified level of verbosity. If no level is specified, a level of 1 is used.
$EXPTDIR directory, the
rocotorun command for the workflow would be:
rocotorun -w FV3LAM_wflow.xml -d FV3LAM_wflow.db
It is important to note that the
rocotorun process is iterative; the command must be executed
many times before the entire workflow is completed, usually every 2-10 minutes. This command can be
placed in the user’s crontab and cron will call it with a specified frequency. More information on
this command can be found at https://github.com/christopherwharrop/rocoto/wiki/documentation.
The first time the
rocotorun command is executed for a workflow, the files
FV3LAM_wflow_lock.db are created. There is usually no need for the user to modify these files.
Each time this command is executed, the last known state of the workflow is read from the
file, the batch system is queried, jobs are submitted for tasks whose dependencies have been satisfied,
and the current state of the workflow is saved in
FV3LAM_wflow.db. If there is a need to relaunch
the workflow from scratch, both database files can be deleted, and the workflow can be run using
or the launch script
launch_FV3LAM_wflow.sh (executed multiple times as described above).
rocotostat is a tool for querying the status of tasks in an active Rocoto workflow. Once the
workflow has been started with the
rocotorun command, Rocoto can also check the status of the
workflow using the
rocotostat -w /path/to/workflow/xml/file -d /path/to/workflow/database/file
Executing this command will generate a workflow status table similar to the following:
CYCLE TASK JOBID STATE EXIT STATUS TRIES DURATION ============================================================================================================================= 201907010000 make_grid 175805 QUEUED - 0 0.0 201907010000 make_orog - - - - - 201907010000 make_sfc_climo - - - - - 201907010000 get_extrn_ics druby://hfe01:36261 SUBMITTING - 0 0.0 201907010000 get_extrn_lbcs druby://hfe01:36261 SUBMITTING - 0 0.0 201907010000 make_ics - - - - - 201907010000 make_lbcs - - - - - 201907010000 run_fcst - - - - - 201907010000 run_post_f000 - - - - - 201907010000 run_post_f001 - - - - - 201907010000 run_post_f002 - - - - - 201907010000 run_post_f003 - - - - - 201907010000 run_post_f004 - - - - - 201907010000 run_post_f005 - - - - - 201907010000 run_post_f006 - - - - -
This table indicates that the
make_grid task was sent to the batch system and is now queued, while
get_extrn_lbcs tasks for the
201907010000 cycle are in the process of being
submitted to the batch system.
Note that issuing a
rocotostat command without an intervening
rocotorun command will not result in an
updated workflow status table; it will print out the same table. It is the
rocotorun command that updates
the workflow database file (in this case
FV3LAM_wflow.db, located in
reads the database file and prints the table to the screen. To see an updated table, the
must be executed followed by the
After issuing the
rocotorun command several times (over the course of several minutes or longer, depending
on your grid size and computational resources), the output of the
rocotostat command should look like this:
CYCLE TASK JOBID STATE EXIT STATUS TRIES DURATION ============================================================================================================================ 201907010000 make_grid 175805 SUCCEEDED 0 1 10.0 201907010000 make_orog 175810 SUCCEEDED 0 1 27.0 201907010000 make_sfc_climo 175822 SUCCEEDED 0 1 38.0 201907010000 get_extrn_ics 175806 SUCCEEDED 0 1 37.0 201907010000 get_extrn_lbcs 175807 SUCCEEDED 0 1 53.0 201907010000 make_ics 175825 SUCCEEDED 0 1 99.0 201907010000 make_lbcs 175826 SUCCEEDED 0 1 90.0 201907010000 run_fcst 175937 RUNNING - 0 0.0 201907010000 run_post_f000 - - - - - 201907010000 run_post_f001 - - - - - 201907010000 run_post_f002 - - - - - 201907010000 run_post_f003 - - - - - 201907010000 run_post_f004 - - - - - 201907010000 run_post_f005 - - - - - 201907010000 run_post_f006 - - - - -
When the workflow runs to completion, all tasks will be marked as SUCCEEDED. The log files from the tasks
are located in
$EXPTDIR/log. If any tasks fail, the corresponding log file can be checked for error
messages. Optional arguments for the
rocotostat command can be found at https://github.com/christopherwharrop/rocoto/wiki/documentation.
Sometimes, issuing a
rocotorun command will not cause the next task to launch.
rocotocheck is a
tool that can be used to query detailed information about a task or cycle in the Rocoto workflow. To
determine the cause of a particular task not being submitted, the
rocotocheck command can be used
$EXPTDIR directory as follows:
rocotocheck -w /path/to/workflow/xml/file -d /path/to/workflow/database/ file -c YYYYMMDDHHMM -t taskname
-cis the cycle to query
-tis the task name
A specific example is:
rocotocheck -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10 -c 201907010000 -t run_fcst
This will result in output similar to the following:
Task: run_fcst account: gsd-fv3 command: /scratch2/BMC/det/$USER/ufs-srweather-app/regional_workflow/ush/load_modules_run_task.sh "run_fcst" "/scratch2/BMC/det/$USER/ufs-srweather-app/regional_workflow/jobs/JREGIONAL_RUN_FCST" cores: 24 final: false jobname: run_FV3 join: /scratch2/BMC/det/$USER/expt_dirs/test_community/log/run_fcst_2019070100.log maxtries: 3 name: run_fcst nodes: 1:ppn=24 queue: batch throttle: 9999999 walltime: 04:30:00 environment CDATE ==> 2019070100 CYCLE_DIR ==> /scratch2/BMC/det/$USER/UFS_CAM/expt_dirs/test_community/2019070100 PDY ==> 20190701 SCRIPT_VAR_DEFNS_FP ==> /scratch2/BMC/det/$USER/expt_dirs/test_community/var_defns.sh dependencies AND is satisfied make_ICS_surf_LBC0 of cycle 201907010000 is SUCCEEDED make_LBC1_to_LBCN of cycle 201907010000 is SUCCEEDED Cycle: 201907010000 Valid for this task: YES State: active Activated: 2019-10-29 18:13:10 UTC Completed: - Expired: - Job: 513615 State: DEAD (FAILED) Exit Status: 1 Tries: 3 Unknown count: 0 Duration: 58.0
This shows that although all dependencies for this task are satisfied (see the dependencies section, highlighted above),
it cannot run because its
maxtries value (highlighted) is 3. Rocoto will attempt to launch it at most 3 times,
and it has already been tried 3 times (the
Tries value, also highlighted).
The output of the
rocotocheck command is often useful in determining whether the dependencies for a given task
have been met. If not, the dependencies section in the output of
rocotocheck will indicate this by stating that a
dependency “is NOT satisfied”.
rocotorewind is a tool that attempts to undo the effects of running a task and is commonly used to rerun part
of a workflow that has failed. If a task fails to run (the STATE is DEAD), and needs to be restarted, the
command will rerun tasks in the workflow. The command line options are the same as those described in the
section 13.3, and the general usage statement looks like:
rocotorewind -w /path/to/workflow/xml/file -d /path/to/workflow/database/ file -c YYYYMMDDHHMM -t taskname
Running this command will edit the Rocoto database file
FV3LAM_wflow.db to remove evidence that the job has been run.
rocotorewind is recommended over
rocotoboot for restarting a task, since
rocotoboot will force a specific
task to run, ignoring all dependencies and throttle limits. The throttle limit, denoted by the variable cyclethrottle
FV3LAM_wflow.xml file, limits how many cycles can be active at one time. An example of how to use this
command to rerun the forecast task from
rocotorewind -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10 -c 201907010000 -t run_fcst
rocotoboot will force a specific task of a cycle in a Rocoto workflow to run. All dependencies and throttle
limits are ignored, and it is generally recommended to use
rocotorewind instead. An example of how to
use this command to rerun the
make_ics task from
rocotoboot -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10 -c 201907010000 -t make_ics