1.2.2. Container-Based Quick Start Guide

This Container-Based Quick Start Guide will help users build and run the “out-of-the-box” case for the Unified Forecast System (UFS) Short-Range Weather (SRW) Application using a Singularity/Apptainer container. The container approach provides a uniform enviroment in which to build and run the SRW App. Normally, the details of building and running the SRW App vary from system to system due to the many possible combinations of operating systems, compilers, MPIs, and package versions available. Installation via container reduces this variability and allows for a smoother SRW App build experience.

The basic “out-of-the-box” case described in this User’s Guide builds a weather forecast for June 15-16, 2019. Multiple convective weather events during these two days produced over 200 filtered storm reports. Severe weather was clustered in two areas: the Upper Midwest through the Ohio Valley and the Southern Great Plains. This forecast uses a predefined 25-km Continental United States (CONUS) grid (RRFS_CONUS_25km), the Global Forecast System (GFS) version 16 physics suite (FV3_GFS_v16 CCPP), and FV3-based GFS raw external model data for initialization.

Attention

  • The SRW Application has four levels of support. The steps described in this chapter will work most smoothly on preconfigured (Level 1) systems. However, this guide can serve as a starting point for running the SRW App on other systems, too.

  • This chapter of the User’s Guide should only be used for container builds. For non-container builds, see Section 1.2.1 for a Quick Start Guide or Section 1.2.3 for a detailed guide to building the SRW App without a container.

1.2.2.1. Download the Container

1.2.2.1.1. Prerequisites

Intel Compiler and MPI

Users must have an Intel compiler and MPI (available for free here) in order to run the SRW App in the container provided using the method described in this chapter. Additionally, it is recommended that users install the Rocoto workflow manager on their system in order to take advantage of automated workflow options. Although it is possible to run an experiment without Rocoto, and some tips are provided, the only fully-supported and tested container option assumes that Rocoto is preinstalled.

Install Singularity/Apptainer

To build and run the SRW App using a Singularity/Apptainer container, first install the software according to the Apptainer Installation Guide. This will include the installation of all dependencies.

Note

As of November 2021, the Linux-supported version of Singularity has been renamed to Apptainer. Apptainer has maintained compatibility with Singularity, so singularity commands should work with either Singularity or Apptainer (see compatibility details here.)

Attention

Docker containers can only be run with root privileges, and users cannot have root privileges on HPCs. Therefore, it is not possible to build the SRW App, which uses the spack-stack, inside a Docker container on an HPC system. However, a Singularity/Apptainer image may be built directly from a Docker image for use on the system.

1.2.2.1.2. Working in the Cloud or on HPC Systems

Users working on systems with limited disk space in their /home directory may need to set the SINGULARITY_CACHEDIR and SINGULARITY_TMPDIR environment variables to point to a location with adequate disk space. For example:

export SINGULARITY_CACHEDIR=/absolute/path/to/writable/directory/cache
export SINGULARITY_TMPDIR=/absolute/path/to/writable/directory/tmp

where /absolute/path/to/writable/directory/ refers to the absolute path to a writable directory with sufficient disk space. If the cache and tmp directories do not exist already, they must be created with a mkdir command. See Section 1.2.2.5.1 to view an example of how this can be done.

1.2.2.1.3. Build the Container

Hint

If a singularity: command not found error message appears when working on Level 1 platforms, try running: module load singularity or (on Derecho) module load apptainer.

1.2.2.1.3.1. Level 1 Systems

On most Level 1 systems, a container named ubuntu20.04-intel-ue-1.4.1-srw-dev.img has already been built at the following locations:

Table 1.7 Locations of pre-built containers

Machine

File Location

Derecho [1]

/glade/work/epicufsrt/contrib/containers

Gaea [1]

/lustre/f2/dev/role.epic/containers

Hera

/scratch1/NCEPDEV/nems/role.epic/containers

Jet

/mnt/lfs4/HFIP/hfv3gfs/role.epic/containers

NOAA Cloud

/contrib/EPIC/containers

Orion/Hercules [1]

/work/noaa/epic/role-epic/contrib/containers

Note

  • On Gaea, Singularity/Apptainer is only available on the C5 partition, and therefore container use is only supported on Gaea C5.

  • The NOAA Cloud containers are accessible only to those with EPIC resources.

Users can simply set an environment variable to point to the container:

export img=/path/to/ubuntu20.04-intel-ue-1.4.1-srw-dev.img

Users may convert the container .img file to a writable sandbox:

singularity build --sandbox ubuntu20.04-intel-srwapp $img

When making a writable sandbox on Level 1 systems, the following warnings commonly appear and can be ignored:

INFO:    Starting build...
INFO:    Verifying bootstrap image ubuntu20.04-intel-ue-1.4.1-srw-dev.img
WARNING: integrity: signature not found for object group 1
WARNING: Bootstrap image could not be verified, but build will continue.

1.2.2.1.3.2. Level 2-4 Systems

On non-Level 1 systems, users should build the container in a writable sandbox:

sudo singularity build --sandbox ubuntu20.04-intel-srwapp docker://noaaepic/ubuntu20.04-intel-srwapp:develop

Some users may prefer to issue the command without the sudo prefix. Whether sudo is required is system-dependent.

Note

Users can choose to build a release version of the container using a similar command:

sudo singularity build --sandbox ubuntu20.04-intel-srwapp docker://noaaepic/ubuntu20.04-intel-srwapp:release-public-v2.2.0

For easier reference, users can set an environment variable to point to the container:

export img=/path/to/ubuntu20.04-intel-srwapp

1.2.2.1.4. Start Up the Container

Copy stage-srw.sh from the container to the local working directory:

singularity exec -B /<local_base_dir>:/<container_dir> $img cp /opt/ufs-srweather-app/container-scripts/stage-srw.sh .

If the command worked properly, stage-srw.sh should appear in the local directory. The command above also binds the local directory to the container so that data can be shared between them. On Level 1 systems, <local_base_dir> is usually the topmost directory (e.g., /lustre, /contrib, /work, or /home). Additional directories can be bound by adding another -B /<local_base_dir>:/<container_dir> argument before the name of the container. In general, it is recommended that the local base directory and container directory have the same name. For example, if the host system’s top-level directory is /user1234, the user can create a user1234 directory in the writable container sandbox and then bind it:

mkdir /path/to/container/user1234
singularity exec -B /user1234:/user1234 $img cp /opt/ufs-srweather-app/container-scripts/stage-srw.sh .

Attention

Be sure to bind the directory that contains the experiment data!

To explore the container and view available directories, users can either cd into the container and run ls (if it was built as a sandbox) or run the following commands:

singularity shell $img
cd /
ls

The list of directories printed will be similar to this:

bin      discover       lfs   lib     media  run         singularity    usr
boot     environment    lfs1  lib32   mnt    sbin        srv            var
contrib  etc            lfs2  lib64   opt    scratch     sys            work
data     glade          lfs3  libx32  proc   scratch1    tmp
dev      home           lfs4  lustre  root   scratch2    u

Users can run exit to exit the shell.

1.2.2.2. Download and Stage the Data

The SRW App requires input files to run. These include static datasets, initial and boundary condition files, and model configuration files. On Level 1 systems, the data required to run SRW App tests are already available as long as the bind argument (starting with -B) in Step 1.2.2.1.4 included the directory with the input model data. See Table 1.10 for Level 1 data locations. For Level 2-4 systems, the data must be added manually by the user. In general, users can download fix file data and experiment data (ICs/LBCs) from the SRW App Data Bucket and then untar it:

wget https://noaa-ufs-srw-pds.s3.amazonaws.com/current_srw_release_data/fix_data.tgz
wget https://noaa-ufs-srw-pds.s3.amazonaws.com/current_srw_release_data/gst_data.tgz
tar -xzf fix_data.tgz
tar -xzf gst_data.tgz

More detailed information can be found in Section 1.3.2.3. Sections 1.3.2.1 and 1.3.2.2 contain useful background information on the input and output files used in the SRW App.

1.2.2.3. Generate the Forecast Experiment

To generate the forecast experiment, users must:

  1. Activate the workflow

  2. Set experiment parameters to configure the workflow

  3. Run a script to generate the experiment workflow

The first two steps depend on the platform being used and are described here for Level 1 platforms. Users will need to adjust the instructions to match their machine configuration if their local machine is a Level 2-4 platform.

1.2.2.3.1. Activate the Workflow

Copy the container’s modulefiles to the local working directory so that the files can be accessed outside of the container:

singularity exec -B /<local_base_dir>:/<container_dir> $img cp -r /opt/ufs-srweather-app/modulefiles .

After this command runs, the local working directory should contain the modulefiles directory.

To activate the workflow, run the following commands:

module use /path/to/modulefiles
module load wflow_<platform>

where:

  • /path/to/modulefiles is replaced with the actual path to the modulefiles on the user’s local system (often $PWD/modulefiles), and

  • <platform> is a valid, lowercased machine/platform name (see the MACHINE variable in Section 1.3.1.1).

The wflow_<platform> modulefile will then output instructions to activate the workflow. The user should run the commands specified in the modulefile output. For example, if the output says:

Please do the following to activate conda:
    > conda activate workflow_tools

then the user should run conda activate srw_app. This will activate the srw_app conda environment. The command(s) will vary from system to system, but the user should see (srw_app) in front of the Terminal prompt at this point.

1.2.2.3.2. Configure the Workflow

Run stage-srw.sh:

./stage-srw.sh -c=<compiler> -m=<mpi_implementation> -p=<platform> -i=$img

where:

  • -c indicates the compiler on the user’s local machine (e.g., intel/2022.1.2)

  • -m indicates the MPI on the user’s local machine (e.g., impi/2022.1.2)

  • <platform> refers to the local machine (e.g., hera, jet, noaacloud, macos, linux). See MACHINE in Section 1.3.1.1 for a full list of options.

  • -i indicates the container image that was built in Step 1.2.2.1.3 (ubuntu20.04-intel-srwapp or ubuntu20.04-intel-ue-1.4.1-srw-dev.img by default).

For example, on Hera, the command would be:

./stage-srw.sh -c=intel/2022.1.2 -m=impi/2022.1.2 -p=hera -i=ubuntu20.04-intel-ue-1.4.1-srw-dev.img

Attention

The user must have an Intel compiler and MPI on their system because the container uses an Intel compiler and MPI. Intel compilers are now available for free as part of the Intel oneAPI Toolkit.

After this command runs, the working directory should contain srw.sh, a ufs-srweather-app directory, and an ush directory.

From here, users can follow the steps below to configure the out-of-the-box SRW App case with an automated Rocoto workflow. For more detailed instructions on experiment configuration, users can refer to Section 1.2.4.3.2.2.

  1. Copy the out-of-the-box case from config.community.yaml to config.yaml. This file contains basic information (e.g., forecast date, grid, physics suite) required for the experiment.

    cd ufs-srweather-app/ush
    cp config.community.yaml config.yaml
    

    The default settings include a predefined 25-km CONUS grid (RRFS_CONUS_25km), the GFS v16 physics suite (FV3_GFS_v16 CCPP), and FV3-based GFS raw external model data for initialization.

  2. Edit the MACHINE and ACCOUNT variables in the user: section of config.yaml. See Section 1.3.1.1 for details on valid values.

    Note

    On JET, users must also add PARTITION_DEFAULT: xjet and PARTITION_FCST: xjet to the platform: section of the config.yaml file.

  3. To automate the workflow, add these two lines to the workflow: section of config.yaml:

    USE_CRON_TO_RELAUNCH: TRUE
    CRON_RELAUNCH_INTVL_MNTS: 3
    

    There are instructions for running the experiment via additional methods in Section 1.2.4.4. However, this technique (automation via crontab) is the simplest option.

    Note

    On Orion, cron is only available on the orion-login-1 node, so users will need to work on that node when running cron jobs on Orion.

  4. Edit the task_get_extrn_ics: section of the config.yaml to include the correct data paths to the initial conditions files. For example, on Hera, add:

    USE_USER_STAGED_EXTRN_FILES: true
    EXTRN_MDL_SOURCE_BASEDIR_ICS: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/input_model_data/FV3GFS/grib2/${yyyymmddhh}
    

    On other systems, users will need to change the path for EXTRN_MDL_SOURCE_BASEDIR_ICS and EXTRN_MDL_SOURCE_BASEDIR_LBCS (below) to reflect the location of the system’s data. The location of the machine’s global data can be viewed here for Level 1 systems. Alternatively, the user can add the path to their local data if they downloaded it as described in Section 1.3.2.3.2.

  5. Edit the task_get_extrn_lbcs: section of the config.yaml to include the correct data paths to the lateral boundary conditions files. For example, on Hera, add:

    USE_USER_STAGED_EXTRN_FILES: true
    EXTRN_MDL_SOURCE_BASEDIR_LBCS: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/input_model_data/FV3GFS/grib2/${yyyymmddhh}
    

1.2.2.3.3. Generate the Workflow

Attention

This section assumes that Rocoto is installed on the user’s machine. If it is not, the user will need to allocate a compute node (described in the Appendix) and run the workflow using standalone scripts as described in Section 1.2.4.4.2.

Run the following command to generate the workflow:

./generate_FV3LAM_wflow.py

This workflow generation script creates an experiment directory and populates it with all the data needed to run through the workflow. The last line of output from this script should start with */3 * * * * (or similar).

The generated workflow will be in the experiment directory specified in the config.yaml file in Step 1.2.2.3.2. The default location is expt_dirs/test_community. To view experiment progress, users can cd to the experiment directory from ufs-srweather-app/ush and run the rocotostat command to check the experiment’s status:

cd ../../expt_dirs/test_community
rocotostat -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10

Users can track the experiment’s progress by reissuing the rocotostat command above every so often until the experiment runs to completion. The following message usually means that the experiment is still getting set up:

08/04/23 17:34:32 UTC :: FV3LAM_wflow.xml :: ERROR: Can not open FV3LAM_wflow.db read-only because it does not exist

After a few (3-5) minutes, rocotostat should show a status-monitoring table:

       CYCLE             TASK      JOBID    STATE   EXIT STATUS   TRIES   DURATION
==================================================================================
201906151800        make_grid   53583094   QUEUED             -       0        0.0
201906151800        make_orog          -        -             -       -          -
201906151800   make_sfc_climo          -        -             -       -          -
201906151800    get_extrn_ics   53583095   QUEUED             -       0        0.0
201906151800   get_extrn_lbcs   53583096   QUEUED             -       0        0.0
201906151800         make_ics          -        -             -       -          -
201906151800        make_lbcs          -        -             -       -          -
201906151800         run_fcst          -        -             -       -          -
201906151800    run_post_f000          -        -             -       -          -
...
201906151800    run_post_f012          -        -             -       -          -

When all tasks show SUCCEEDED, the experiment has completed successfully.

For users who do not have Rocoto installed, see Section 1.2.4.4.2 for guidance on how to run the workflow without Rocoto.

1.2.2.3.4. Troubleshooting

If a task goes DEAD, it will be necessary to restart it according to the instructions in Section 1.4.2.3.1. To determine what caused the task to go DEAD, users should view the log file for the task in $EXPTDIR/log/<task_log>, where <task_log> refers to the name of the task’s log file. After fixing the problem and clearing the DEAD task, it is sometimes necessary to reinitialize the crontab. Run crontab -e to open your configured editor. Inside the editor, copy-paste the crontab command from the bottom of the $EXPTDIR/log.generate_FV3LAM_wflow file into the crontab:

crontab -e
*/3 * * * * cd /path/to/expt_dirs/test_community && ./launch_FV3LAM_wflow.sh called_from_cron="TRUE"

where /path/to is replaced by the actual path to the user’s experiment directory.

1.2.2.4. New Experiment

To run a new experiment in the container at a later time, users will need to rerun the commands in Section 1.2.2.3.1 to reactivate the workflow. Then, users can configure a new experiment by updating the experiment variables in config.yaml to reflect the desired experiment configuration. Basic instructions appear in Section 1.2.2.3.2 above, and detailed instructions can be viewed in Section 1.2.4.3.2.2. After adjusting the configuration file, regenerate the experiment by running ./generate_FV3LAM_wflow.py.

1.2.2.5. Appendix

1.2.2.5.1. Sample Commands for Working in the Cloud or on HPC Systems

Users working on systems with limited disk space in their /home directory may set the SINGULARITY_CACHEDIR and SINGULARITY_TMPDIR environment variables to point to a location with adequate disk space. On NOAA Cloud systems, the sudo su/exit commands may also be required; users on other systems may be able to omit these. For example:

mkdir /lustre/cache
mkdir /lustre/tmp
sudo su
export SINGULARITY_CACHEDIR=/lustre/cache
export SINGULARITY_TMPDIR=/lustre/tmp
exit

Note

/lustre is a fast but non-persistent file system used on NOAA Cloud systems. To retain work completed in this directory, tar the files and move them to the /contrib directory, which is much slower but persistent.

1.2.2.5.2. Allocate a Compute Node

Users working on HPC systems that do not have Rocoto installed must install Rocoto or allocate a compute node. All other users may continue to start up the container.

Note

All NOAA Level 1 systems have Rocoto pre-installed.

The appropriate commands for allocating a compute node will vary based on the user’s system and resource manager (e.g., Slurm, PBS). If the user’s system has the Slurm resource manager, the allocation command will follow this pattern:

salloc -N 1 -n <cores-per-node> -A <account> -t <time> -q <queue/qos> --partition=<system> [-M <cluster>]

For more information on the salloc command options, see Slurm’s documentation.

If users have the PBS resource manager installed on their system, the allocation command will follow this pattern:

qsub -I -lwalltime=<time> -A <account> -q <destination> -lselect=1:ncpus=36:mpiprocs=36

For more information on the qsub command options, see the PBS Manual §2.59.3, (p. 1416).

These commands should output a hostname. Users can then run ssh <hostname>. After “ssh-ing” to the compute node, they can run the container from that node. To run larger experiments, it may be necessary to allocate multiple compute nodes.