Running NONMEM in Parallel: bbr Tips and Tricks

Introduction

This file shows some helpful patterns for using bbr with NONMEM, mostly relating to running models with parallel execution, either multiple cores on a local machine (head node, laptop, etc.) or by executing on a grid.

library(bbr)

MODEL_DIR <- system.file("model", "nonmem" , "pk-parallel", package = "bbr")

Global settings in bbi.yaml

The bbi.yaml file is created when bbi_init() is originally called. It usually lives in the same folder as your control streams. This file contains “global” settings for running bbi.

You can set a default number of threads (the number of cores bbi will use for each model) in this file. This will be used whenever you call submit_model(), but can be overriden by passing submit_model(..., .bbi_args = list(threads = SOME_INTEGER)).
It defaults to parallel: false so you need to change to parallel: true or you will need to pass .bbi_args = list(parallel=TRUE) to each of your submit_model() calls to make them respect the threads argument.
You can also pass through .bbi_args to bbi_init() to set the global defaults when you first create the bbi.yaml

bbi_init(
  .dir = MODEL_DIR,
  .nonmem_dir = "/opt/NONMEM",
  .nonmem_version = "nm75",
  .bbi_args = list(
    parallel = TRUE,
    threads = 4
  )
)

Local execution

For small runs, use a 4-core or 8-core head node and just do submit_model(..., .mode = "local").

This will execute the model on your head node, means you will not have to wait for worker nodes to spin up.
This is especially nice for new models that are fairly simple and you just want a quick result back.

mod <- read_model(file.path(MODEL_DIR, 100))
submit_model(
  mod,
  .mode = "local",
  .bbi_args = list(parallel = TRUE, threads = 4) # not needed if set in bbi.yaml
)

Running:
  bbi nonmem run local /data/home/sethg/bbr/inst/model/nonmem/pk-parallel/100.ctl --overwrite --threads=4 --parallel
In /data/home/sethg/bbr/inst/model/nonmem/pk-parallel
Process finished.

You can also set options("bbr.bbi_exe_mode" = "local") to default to local execution for all models, though this will reset with each new R session.

.wait = FALSE

You can pass use submit_model(..., .wait = FALSE) to get you console back immediately.

This is only relevant for .mode = "local" because grid jobs give you your console as soon as the job is submitted to the grid (which typically takes less than a second).

mod <- read_model(file.path(MODEL_DIR, 200))
proc <- submit_model(
  mod,
  .mode = "local",
  .wait = FALSE,        # this is what gets you your console back
  .bbi_args = list(parallel = TRUE, threads = 8) # not needed if set in bbi.yaml
)

There are a few ways to check on your model while it runs:

Print model object. The first entry with either say “Not Run”, “Incomplete Run”, or “Finished Running”.

mod


── Status ──────────────────────────────────────────────────────────────────────
• Incomplete Run

── Absolute Model Path ─────────────────────────────────────────────────────────
• /data/home/sethg/bbr/inst/model/nonmem/pk-parallel/200

── YAML & Model Files ──────────────────────────────────────────────────────────
• /data/home/sethg/bbr/inst/model/nonmem/pk-parallel/200.yaml
• /data/home/sethg/bbr/inst/model/nonmem/pk-parallel/200.ctl

── Description ─────────────────────────────────────────────────────────────────
• original acop model

── Tags ────────────────────────────────────────────────────────────────────────
• acop tag
• other tag

── BBI Args ────────────────────────────────────────────────────────────────────
• overwrite: TRUE
• threads: 4

You can use bbr::tail_*() helpers to check on the .lst or OUTPUT files.

tail_lst(mod)

Mon Sep 27 12:37:49 EDT 2021
$PROBLEM From bbr: see 107.yaml for details

...

 WARNINGS AND ERRORS (IF ANY) FOR PROBLEM    1

 (WARNING  2) NM-TRAN INFERS THAT THE DATA ARE POPULATION.

tail_output(mod, .head = 0, .tail = 10)

...
             1.1744E-01 -2.8718E+02 -1.4260E+02 -1.2769E+02  3.1651E+02

0ITERATION NO.:    1    OBJECTIVE VALUE:   31860.4697608811        NO. OF FUNC. EVALS.:  15
 CUMULATIVE NO. OF FUNC. EVALS.:       26
 NPARAMETR:  4.9843E-01  3.6746E+00  1.0057E+00  4.4079E+00  1.9381E+00  9.9679E-01  9.9702E-01  4.9976E-01  2.0001E-01  9.9998E-03
             1.0000E-02  2.0002E-01  1.0003E-02  2.0001E-01  4.9994E-02
 PARAMETER:  9.9686E-02  1.0499E-01  1.0057E-01  1.1020E-01  9.6903E-02  9.9679E-02  9.9702E-02  9.9952E-02  1.0001E-01  9.9996E-02
             1.0000E-01  1.0005E-01  1.0003E-01  1.0002E-01  9.9941E-02
 GRADIENT:   2.9997E+03 -1.2157E+04 -3.9326E+03  5.4698E+04  2.2187E+04  1.5492E+03  1.5490E+03  2.5233E+02 -2.5887E+02  2.7454E+00
             3.2905E+01  3.4014E+01 -9.9727E+01 -1.5446E+02  2.2149E+02

You can get the path to these files and use tail -f in the terminal to “follow” the files as they change.

lst_path <- build_path_from_model(mod, ".lst")
cat(paste("tail -f", lst_path))

tail -f /data/home/sethg/bbr/inst/model/nonmem/pk-parallel/200/200.lst

You can call methods of the processx object including in the bbi_process object.
- To do this, you need to remember to assign the output of submit_model() to a variable (proc, above)
- See the processx docs for details on available methods.

proc$process$is_alive()

TRUE

Warm up the grid

You can submit a dummy job to the grid to get it to bring up a worker, while you finish preparing your run. For SGE, this is done with qsub.

system("echo 'sleep 10' | qsub")

Unable to run job: warning: sethg's job is not allowed to run in any queue
Your job 1 ("STDIN") has been submitted
Exiting.

This message is just telling you that there are no worker nodes currently up, but one will begin to spin up now.

Try your model locally first

If you have a long-running model, it is often nice to launch it locally first, just to see if it starts.

This avoids waiting for the grid to come up before seeing your model die with something like DATASET ERROR or some silly syntax error.

mod <- read_model(file.path(MODEL_DIR, 300))
proc <- submit_model(
  mod,
  .mode = "local",
  .wait = FALSE        # this is what gets you your console back
)

Watch it until it gets to monitoring the search and then kill it and relaunch it on grid.

tail_output(mod, .tail = 15)

License Registered to: Metrum Research Group
Expiration Date:    14 JUL 2022
Current Date:       27 SEP 2021
...
 IWRS=IWRESI
 IPRD=IPREDI
 IRS=IRESI

 MONITORING OF SEARCH:


0ITERATION NO.:    0    OBJECTIVE VALUE:   32098.8565339901        NO. OF FUNC. EVALS.:  11
 CUMULATIVE NO. OF FUNC. EVALS.:       11
 NPARAMETR:  5.0000E-01  3.5000E+00  1.0000E+00  4.0000E+00  2.0000E+00  1.0000E+00  1.0000E+00  5.0000E-01  2.0000E-01  1.0000E-02
             1.0000E-02  2.0000E-01  1.0000E-02  2.0000E-01  5.0000E-02
 PARAMETER:  1.0000E-01  1.0000E-01  1.0000E-01  1.0000E-01  1.0000E-01  1.0000E-01  1.0000E-01  1.0000E-01  1.0000E-01  1.0000E-01
             1.0000E-01  1.0000E-01  1.0000E-01  1.0000E-01  1.0000E-01
 GRADIENT:   1.6936E+03 -2.6923E+04 -3.0654E+03 -5.5039E+04  1.6715E+04  1.7314E+03  1.6062E+03  2.6154E+02 -7.7264E+01  2.0742E+01
             1.1744E-01 -2.8718E+02 -1.4260E+02 -1.2769E+02  3.1651E+02

There are several ways to kill the process:

Use killall nonmem in the terminal (or with system) to kill all NONMEM jobs running locally.

system("killall nonmem")

You can kill with the kill() method of the processx object (see above about how to get to this object).

proc$process$kill()

Then relaunch the model on the grid, using .bbi_args = list(overwrite = TRUE) to overwrite the incomplete results.

submit_model(
  mod,
  .bbi_args = list(
    overwrite = TRUE,
    parallel = TRUE,
    threads = 8
  )
)

Running:
  bbi nonmem run sge /data/home/sethg/bbr/inst/model/nonmem/pk-parallel/300.ctl --overwrite --threads=8 --parallel
In /data/home/sethg/bbr/inst/model/nonmem/pk-parallel
Process finished.

Note: .mode = "sge" (submit to the grid) is the default for submit_model() so there is no need to pass it as an argument.

Monitoring

You can monitor your grid jobs with tail_lst() or tail_output() as above, or use qstat -f in the terminal, or with the Grafana dashboard (if you are on Metworx).

system("qstat -f")

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
      7 0.55500 Run_300    sethg        qw    09/27/2021 12:53:10     8

Submit a batch of models to the grid

You can use submit_models() to submit a batch of models to the grid.

Each model will be a separate grid job
The grid will “auto-scale”; bringing up new workers to handle all of the jobs
This is also a very easy way to use nested parallelism:
- Each job (each model) will be executed in parallel on a worker node.
- If you pass .bbi_args = list(parallel = TRUE, threads = SOME_INTEGER) then each model will use SOME_INTEGER cores on its worker node.

mods <- map(c(200, 300, 400), ~ read_model(file.path(MODEL_DIR, .x)))

submit_models(
  mods,
  .bbi_args = list(
    overwrite = TRUE,
    parallel = TRUE,
    threads = 8
  )
)

You can see all three jobs queuing up

system("qstat -f")

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
      8 0.55500 Run_200    sethg        qw    09/27/2021 12:56:18     8        
      9 0.55500 Run_300    sethg        qw    09/27/2021 12:56:18     8        
     10 0.55500 Run_400    sethg        qw    09/27/2021 12:56:18     8

You can kill jobs running on the grid using qdel in the terminal (or with system() from R).
- qdel SOME_NUMBER kills the job specified by it’s “job id” (the first column in the qstat -f output)
- qdel -u YOUR_USERNAME kills all jobs launched by you

this_user <- Sys.getenv("USER")
system(paste("qdel -u", this_user))

Testing different numbers of threads

For a big model, try running a few iterations with different numbers of threads. This way you can see the gains from adding more threads and decide how many to use as you continue to work on this model.

First create a dummy copy of your model, and cap the iterations to 10 or 20 by changing MAX_EVAL or NITER in your control stream, so that it finishes fairly quickly
Use the test_threads() function below to:
- Copy the model a few times with copy_model_from()
- Submit the model to run with a few different options for threads

#' Takes a model object and runs it with various threads values
#'
#' @param .mod bbi_model object to copy/test
#' @param thread_opts Integer vector of threads values to test
#' @param .mode Passed through to bbr::submit_models(.mode)
#'
#' @return A list of the model objects for the submitted models.
test_threads <- function(.mod, threads_opts, .mode = "sge") {
  mods <- map(threads_opts, ~ {
    .m <- copy_model_from(
      .mod,
      paste0(get_model_id(.mod), "_", .x, "threads")
    ) %>%
      add_bbi_args(list(parallel = TRUE, threads = .x))
  })

  submit_models(mods, .mode = .mode, .wait = FALSE)

  mods
}

# running it
mod <- read_model(file.path(MODEL_DIR, 200)) # this should be the copy with capped iterations
mods <- test_threads(mod, c(8, 16, 24))

Submitting 3 models with 3 unique configurations.

After the models all finish, use this check_threads() function to see how long each took to run.

#' Check estimation time for models run with various threads values
#'
#' @param mods list of bbi model objects created by `test_threads()`
#'
#' @return A tibble with columns `threads` (number of threads) and `time`
#'   (elapsed estimation time in seconds for test models).
check_threads <- function(mods) {
  purrr::map_dfr(mods, ~ {
    s <- model_summary(.x)
    threads <- as.numeric(stringr::str_extract(s$absolute_model_path, "\\d+(?=threads$)"))
    tibble::tibble(threads = threads, time = s$run_details$estimation_time)
  })
}

check_threads(mods)

# A tibble: 3 x 2
  threads    time
    <int>   <dbl>
1       8   149.5
2      16    88.2
3      24    65.9

Seth Green