3. Container-Based Quick Start Guide
This Container-Based Quick Start Guide will help users build and run the “out-of-the-box” case for the Unified Forecast System (UFS) Short-Range Weather (SRW) Application using a Singularity container. The container approach provides a uniform enviroment in which to build and run the SRW App. Normally, the details of building and running the SRW App vary from system to system due to the many possible combinations of operating systems, compilers, MPIs, and package versions available. Installation via Singularity container reduces this variability and allows for a smoother SRW App build experience. Normally, containers can only run on a single compute node and are not compatible with the Rocoto workflow manager, so users must run each task in the workflow manually. However, the Singularity container described in this chapter has been adapted such that it is able to run across multiple nodes using Rocoto. This makes it an excellent starting point for beginners. The non-container build approach may still be more appropriate for users who desire additional customizability, particularly if they already have experience running the SRW App.
The “out-of-the-box” SRW App case described in this User’s Guide builds a weather forecast for June 15-16, 2019. Multiple convective weather events during these two days produced over 200 filtered storm reports. Severe weather was clustered in two areas: the Upper Midwest through the Ohio Valley and the Southern Great Plains. This forecast uses a predefined 25-km Continental United States (CONUS) grid (RRFS_CONUS_25km), the Global Forecast System (GFS) version 16 physics suite (FV3_GFS_v16 CCPP), and FV3-based GFS raw external model data for initialization.
Attention
The SRW Application has four levels of support. The steps described in this chapter will work most smoothly on preconfigured (Level 1) systems. However, this guide can serve as a starting point for running the SRW App on other systems, too.
This chapter of the User’s Guide should only be used for container builds. For non-container builds, see Chapter 2 for a Quick Start Guide or Chapter 4 for a detailed guide to building the SRW App without a container.
3.1. Download the Container
3.1.1. Prerequisites:
Users must have an Intel compiler and MPI (available for free here) in order to run the SRW App in the container provided using the method described in this chapter. Additionally, it is recommended that users install the Rocoto workflow manager on their system in order to take advantage of automated workflow options. Although it is possible to run an experiment without Rocoto, and some tips are provided, the only fully-supported and tested container option for the develop
branch assumes that Rocoto is pre-installed.
3.1.1.1. Install Singularity
To build and run the SRW App using a Singularity container, first install the Singularity package according to the Singularity Installation Guide. This will include the installation of dependencies and the installation of the Go programming language. SingularityCE Version 3.7 or above is recommended.
Warning
Docker containers can only be run with root privileges, and users cannot have root privileges on HPCs. Therefore, it is not possible to build the SRW App, which uses the HPC-Stack, inside a Docker container on an HPC system. However, a Singularity image may be built directly from a Docker image for use on the system.
3.1.2. Working in the Cloud or on HPC Systems
For users working on systems with limited disk space in their /home
directory, it is recommended to set the SINGULARITY_CACHEDIR
and SINGULARITY_TMPDIR
environment variables to point to a location with adequate disk space. For example:
export SINGULARITY_CACHEDIR=</absolute/path/to/writable/directory/cache>
export SINGULARITY_TMPDIR=</absolute/path/to/writable/directory/tmp>
where /absolute/path/to/writable/directory/
refers to a writable directory (usually a project or user directory within /lustre
, /work
, /scratch
, or /glade
on NOAA Level 1 systems). If the cache
and tmp
directories do not exist already, they must be created with a mkdir
command.
On NOAA Cloud systems, the sudo su
command may also be required:
mkdir /lustre/cache
mkdir /lustre/tmp
sudo su
export SINGULARITY_CACHEDIR=/lustre/cache
export SINGULARITY_TMPDIR=/lustre/tmp
exit
Note
/lustre
is a fast but non-persistent file system used on NOAA Cloud systems. To retain work completed in this directory, tar the files and move them to the /contrib
directory, which is much slower but persistent.
3.1.3. Build the Container
Hint
If a singularity: command not found
error message appears in any of the following steps, try running: module load singularity
.
3.1.3.1. Level 1 Systems
On most Level 1 systems, a container named ubuntu20.04-intel-srwapp-develop.img
has already been built at the following locations:
Machine |
File location |
---|---|
Cheyenne |
/glade/scratch/epicufsrt/containers |
Hera |
/scratch1/NCEPDEV/nems/role.epic/containers |
Jet |
/mnt/lfs4/HFIP/hfv3gfs/role.epic/containers |
NOAA Cloud |
/contrib/EPIC/containers |
Orion |
/work/noaa/epic-ps/role-epic-ps/containers |
Note
Singularity is not available on Gaea, and therefore container use is not supported on Gaea.
Users can simply copy the container to their local working directory. For example, on Hera:
cp /scratch1/NCEPDEV/nems/role.epic/containers/ubuntu20.04-intel-srwapp-develop.img .
Users may convert the container .img
file to a writable sandbox. This step is required when running on Cheyenne but is optional on other systems:
singularity build --sandbox ubuntu20.04-intel-srwapp ubuntu20.04-intel-srwapp-develop.img
When making a writable sandbox on Level 1 systems, the following warnings commonly appear and can be ignored:
INFO: Starting build...
INFO: Verifying bootstrap image ubuntu20.04-intel-srwapp-develop.img
WARNING: integrity: signature not found for object group 1
WARNING: Bootstrap image could not be verified, but build will continue.
3.1.3.2. Level 2-4 Systems
On non-Level 1 systems, users should build the container in a writable sandbox:
sudo singularity build --sandbox ubuntu20.04-intel-srwapp docker://noaaepic/ubuntu20.04-intel-srwapp:develop
Some users may prefer to issue the command without the sudo
prefix. Whether sudo
is required is system-dependent.
Note
Users can choose to build a release version of the container (SRW App v2.1.0) using a similar command:
sudo singularity build --sandbox ubuntu20.04-intel-srwapp docker://noaaepic/ubuntu20.04-intel-srwapp:release-public-v2.1.0
3.1.4. Allocate a Compute Node
Users working on HPC systems that do not have Rocoto installed must install Rocoto or allocate a compute node. All other users may skip to the next step.
Note
All NOAA Level 1 systems have Rocoto pre-installed.
The appropriate commands for allocating a compute node will vary based on the user’s system and resource manager (e.g., Slurm, PBS). If the user’s system has the Slurm resource manager, the allocation command will follow this pattern:
salloc -N 1 -n <cores-per-node> -A <account> -t <time> -q <queue/qos> --partition=<system> [-M <cluster>]
For more information on the salloc
command options, see Slurm’s documentation.
If users have the PBS resource manager installed on their system, the allocation command will follow this pattern:
qsub -I -lwalltime=<time> -A <account> -q <destination> -lselect=1:ncpus=36:mpiprocs=36
For more information on the qsub
command options, see the PBS Manual §2.59.3, (p. 1416).
These commands should output a hostname. Users can then run ssh <hostname>
. After “ssh-ing” to the compute node, they can run the container from that node. To run larger experiments, it may be necessary to allocate multiple compute nodes.
3.1.5. Start Up the Container
Copy stage-srw.sh
from the container to the local working directory:
singularity exec -B /<local_base_dir>:/<container_dir> ./<container_name> cp /opt/ufs-srweather-app/container-scripts/stage-srw.sh .
where <container_name>
is the name of the sandbox directory (i.e., ubuntu20.04-intel-srwapp
) or the name of the .img
container file.
Hint
On Jet, users may need to bind to an lfs
directory (e.g., /lfs4
) rather than /mnt
.
If the command worked properly, stage-srw.sh
should appear in the local directory. The command above also binds the local directory to the container so that data can be shared between them. On Level 1 systems, <local_base_dir>
is usually the topmost directory (e.g., /lustre
, /contrib
, /work
, or /home
). Additional directories can be bound by adding another -B /<local_base_dir>:/<container_dir>
argument before the name of the container. In general, it is recommended that the local base directory and container directory have the same name. For example, if the host system’s top-level directory is /user1234
, the user can create a user1234
directory in the container sandbox and then bind it:
mkdir <path/to/container>/user1234
singularity exec -B /user1234:/user1234 ./ubuntu20.04-intel-srwapp cp /opt/ufs-srweather-app/container-scripts/stage-srw.sh .
Attention
Be sure to bind the directory that contains the experiment data!
To explore the container and view available directories, users can either cd
into the container and run ls
(if it was built as a sandbox) or run the following commands:
singularity shell ./ubuntu20.04-intel-srwapp-develop.img
cd /
ls
The list of directories printed will be similar to this:
bin discover lfs lib media run singularity usr
boot environment lfs1 lib32 mnt sbin srv var
contrib etc lfs2 lib64 opt scratch sys work
data glade lfs3 libx32 proc scratch1 tmp
dev home lfs4 lustre root scratch2 u
Users can run exit
to exit the shell.
3.2. Download and Stage the Data
The SRW App requires input files to run. These include static datasets, initial and boundary condition files, and model configuration files. On Level 1 systems, the data required to run SRW App tests are already available as long as the bind argument (starting with -B
) in Step 3.1.3 included the directory with the input model data. See Table 5.1 for Level 1 data locations. For Level 2-4 systems, the data must be added manually by the user. Detailed instructions on how to add the data can be found in Section 9.3. Sections 9.1 and 9.2 contain useful background information on the input and output files used in the SRW App.
3.3. Generate the Forecast Experiment
To generate the forecast experiment, users must:
The first two steps depend on the platform being used and are described here for Level 1 platforms. Users will need to adjust the instructions to their machine if their local machine is a Level 2-4 platform.
3.3.1. Activate the Regional Workflow
Copy the container’s modulefiles to the local working directory so that the files can be accessed outside of the container:
singularity exec -B /<local_base_dir>:/<container_dir> ./<container_name> cp -r /opt/ufs-srweather-app/modulefiles .
After this command runs, the local working directory should contain the modulefiles
directory.
To activate the regional workflow, run the following commands:
module use <path/to/modulefiles>
module load wflow_<platform>
where:
<path/to/modulefiles>
is replaced with the actual path to the modulefiles on the user’s local system (often$PWD/modulefiles
), and
<platform>
is a valid, lowercased machine/platform name (see theMACHINE
variable in Section 11.1).
The wflow_<platform>
modulefile will then output instructions to activate the regional workflow. The user should run the commands specified in the modulefile output. For example, if the output says:
Please do the following to activate conda:
> conda activate regional_workflow
then the user should run conda activate regional_workflow
. This will activate the regional_workflow
conda environment. The command(s) will vary from system to system, but the user should see (regional_workflow)
in front of the Terminal prompt at this point.
3.3.2. Configure the Workflow
Run stage-srw.sh
:
./stage-srw.sh -c=<compiler> -m=<mpi_implementation> -p=<platform> -i=<container_name>
where:
-c
indicates the compiler on the user’s local machine (e.g.,intel/2022.1.2
)
-m
indicates the MPI on the user’s local machine (e.g.,impi/2022.1.2
)
<platform>
refers to the local machine (e.g.,hera
,jet
,noaacloud
,mac
). SeeMACHINE
in Section 11.1 for a full list of options.
-i
indicates the name of the container image that was built in Step 3.1.3 (ubuntu20.04-intel-srwapp
orubuntu20.04-intel-srwapp-develop.img
by default).
For example, on Hera, the command would be:
./stage-srw.sh -c=intel/2022.1.2 -m=impi/2022.1.2 -p=hera -i=ubuntu20.04-intel-srwapp-develop.img
Attention
The user must have an Intel compiler and MPI on their system because the container uses an Intel compiler and MPI. Intel compilers are now available for free as part of Intel’s oneAPI Toolkit.
After this command runs, the working directory should contain srw.sh
and a ufs-srweather-app
directory.
From here, users can follow the steps below to configure the out-of-the-box SRW App case with an automated Rocoto workflow. For more detailed instructions on experiment configuration, users can refer to Section 5.3.2.2.
Copy the out-of-the-box case from
config.community.yaml
toconfig.yaml
. This file contains basic information (e.g., forecast date, grid, physics suite) required for the experiment.cd ufs-srweather-app/ush cp config.community.yaml config.yamlThe default settings include a predefined 25-km CONUS grid (RRFS_CONUS_25km), the GFS v16 physics suite (FV3_GFS_v16 CCPP), and FV3-based GFS raw external model data for initialization.
Edit the
MACHINE
andACCOUNT
variables in theuser:
section ofconfig.yaml
. See Section 11.1 for details on valid values.Note
On
JET
, users must also addPARTITION_DEFAULT: xjet
andPARTITION_FCST: xjet
to theplatform:
section of theconfig.yaml
file.To automate the workflow, add these two lines to the
workflow:
section ofconfig.yaml
:USE_CRON_TO_RELAUNCH: TRUE CRON_RELAUNCH_INTVL_MNTS: 3There are instructions for running the experiment via additional methods in Section 5.4. However, this technique (automation via crontab) is the simplest option.
Note
On Orion, cron is only available on the orion-login-1 node, so users will need to work on that node when running cron jobs on Orion.
Edit the
task_get_extrn_ics:
section of theconfig.yaml
to include the correct data paths to the initial conditions files. For example, on Hera, add:USE_USER_STAGED_EXTRN_FILES: true EXTRN_MDL_SOURCE_BASEDIR_ICS: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/input_model_data/FV3GFS/grib2/${yyyymmddhh} EXTRN_MDL_FILES_ICS: [] EXTRN_MDL_DATA_STORES: diskOn other systems, users will need to change the path for
EXTRN_MDL_SOURCE_BASEDIR_ICS
andEXTRN_MDL_FILES_LBCS
(below) to reflect the location of the system’s data. The location of the machine’s global data can be viewed here for Level 1 systems. Alternatively, the user can add the path to their local data if they downloaded it as described in Section 9.3.2.Edit the
task_get_extrn_lbcs:
section of theconfig.yaml
to include the correct data paths to the lateral boundary conditions files. For example, on Hera, add:USE_USER_STAGED_EXTRN_FILES: true EXTRN_MDL_SOURCE_BASEDIR_LBCS: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/input_model_data/FV3GFS/grib2/${yyyymmddhh} EXTRN_MDL_FILES_LBCS: [] EXTRN_MDL_DATA_STORES: disk
3.3.3. Generate the Workflow
Attention
This section assumes that Rocoto is installed on the user’s machine. If it is not, the user will need to allocate a compute node (described in Section 3.1.4) and run the workflow using standalone scripts as described in Section 5.4.2.
Run the following command to generate the workflow:
./generate_FV3LAM_wflow.py
This workflow generation script creates an experiment directory and populates it with all the data needed to run through the workflow. The last line of output from this script should start with */3 * * * *
(or similar).
The generated workflow will be in the experiment directory specified in the config.yaml
file in Step 3.3.2. The default location is expt_dirs/test_community
. To view experiment progress, users can cd
to the experiment directory from ufs-srweather-app/ush
and run the rocotostat
command to check the experiment’s status:
cd ../../expt_dirs/test_community
rocotostat -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10
Users can track the experiment’s progress by reissuing the rocotostat
command above every so often until the experiment runs to completion. For users who do not have Rocoto installed, see Section 5.4.2 for information on how to run the workflow without Rocoto.
3.3.4. Troubleshooting
If a task goes DEAD, it will be necessary to restart it according to the instructions in Section 17.5. To determine what caused the task to go DEAD, users should view the log file for the task in $EXPTDIR/log/<task_log>
, where <task_log>
refers to the name of the task’s log file. After fixing the problem and clearing the DEAD task, it is sometimes necessary to reinitialize the crontab. Users can copy-paste the crontab command from the bottom of the $EXPTDIR/log.generate_FV3LAM_wflow
file into the crontab:
crontab -e
i
*/3 * * * * cd /<path/to>/expt_dirs/test_community && ./launch_FV3LAM_wflow.sh called_from_cron="TRUE"
esc
:wq
enter
where:
<path/to>
is replaced by the actual path to the user’s experiment directory, and
esc
andenter
refer to the escape and enter keys (not a typed command).
3.4. New Experiment
To run a new experiment in the container at a later time, users will need to rerun the commands in Section 3.3.1 to reactivate the regional workflow. Then, users can configure a new experiment by updating the environment variables in config.yaml
to reflect the desired experiment configuration. Basic instructions appear in Section 3.3.2 above, and detailed instructions can be viewed in Section 5.3.2.2. After adjusting the configuration file, regenerate the experiment by running ./generate_FV3LAM_wflow.py
.