🌱 [archived] Setting a seed in R, when using parallel simulation

Generally speaking, if the code does any simulations, it is a good practice to set a seed to make the code reproducible. Setting a seed ensures that the same (pseudo-)random numbers will be generated each time the script is executed. Surprisingly, I found really few posts dedicated to any convention, best practice, or routine of setting a seed in R. Further, when using multiple cores (parallelisation) for simulations, things can get slightly more complicated.

Disclaimer: This post is outdated and was archived for back compatibility: please use with care! This post does not reflect the author’s current point of view and might deviate from the current best practices.

Seeds in R

In base R there are two main objects to handle seeds: set.seed() and .Random.seed. For a vast number of problems it is enough to use set.seed(), which supplies an integer as a seed. The workflow, then, is as simple as follows:

runif(2)
# > [1] 0.8636221 0.1782020

set.seed(1991)
runif(2)
# > [1] 0.1506231 0.2308308

runif(2)
# > [1] 0.0134826 0.5340390

set.seed(1991)
runif(2) # exactly the same random numbers as before
# > [1] 0.1506231 0.2308308

The second object, .Random.seed, allows saving and restoring the random number generator (RNG) state. Under the hood .Random.seed is a simple atomic integer vector, the first element of which specifies the kind of RNG and normal generator. For instance, the first element of 207 is referred to "L'Ecuyer-CMRG" RNG method, and "Box-Muller" for normal distribution. The rest of the elements of .Random.seed store the current random seed.

This object I find of a particular use because it can be saved without an explicit seed setting. What I mean is one does not need to provide an integer to set.seed(), which might be annoying, but rather just saving current seed:

seed <- .Random.seed
runif(2)
# > [1] 0.5696378 0.3737989

runif(2)
# > [1] 0.7199003 0.5540470

runif(2)
# > [1] 0.4383970 0.6494643

.Random.seed <- seed
runif(2) # exactly the same random numbers as before
# > [1] 0.5696378 0.3737989

The object .Random.seed lives in the global environment, and therefore, should be set there. It can cause some issues if you trying to set the .Random.seed inside the function, not caring too much about environments. It means that changing it in the function by simple assignment won’t change a seed (the value will be set in execution environment). The following straightforward idea can be used for saving a current seed or setting a custom one inside the function:

reproducible_runif <- function(seed = NULL) {

    if(is.null(seed)) {
        seed <- .Random.seed
    } else {
        # .Random.seed <<- seed
        # mind the double arrow to assign in the parent enviroment or
        assign(x = ".Random.seed", value = seed, envir = .GlobalEnv)
    }

    return(list(x = runif(1), seed = seed))

}

Then, this function will return a random number, that can be reproduced:

r1 <- reproducible_runif()
r1$x
# > [1] 0.4215304

runif(10)
# >  [1] 0.1862207 0.2660995 0.5863689 0.1063663 0.5530690 0.9392229 0.9710050
#    [8] 0.1265786 0.1526233 0.1713895


r2 <- reproducible_runif(seed = r1$seed) # use the seed from the initial call
r2$x # exactly the same as for r1
# > [1] 0.4215304

References:

Seeds for parallel

The story is slightly different when using multiple cores (parallel execution). In this post I use a base package parallel and macOS, but the concept is pretty much the same for other packages and non-unix systems. The idea here to run independent simulations on each core.

Before stepping into details, let’s consider an illustrative example. We run a classical function parallel::mclapply() that returns a random uniform number for each iteration. This function supplies a vector of ten elements as X argument, a simple wrapper around runif(1) to ignore elements of X, the number of cores (in my case 2 physical cores), and also we set mc.set.seed = FALSE. We run this expression two times (unlist is used for a more compact representation):

library(parallel)

rn1 <- unlist(
    mclapply(X = 1:10,
             FUN = function(x) runif(1),
             mc.cores = 2,
             mc.set.seed = FALSE)
)

rn1
# > [1] 0.3495050 0.3495050 0.4159384 0.4159384 0.5376814 0.5376814 0.3279605
#   [8] 0.3279605 0.1527834 0.1527834

identical(rn1[seq(1, 10, by = 2)], rn1[seq(1, 10, by = 2)])
# > [1] TRUE

One can immediately notice a suspicious thing – every second element equals to the previous one. The explanation is very simple: the same workspace is restored from the master process for each worker (or process). It means that .Random.seed will be extracted from the parent process, and therefore, RNG state will be the same for each worker. As result, the same sequence of random numbers will be generated by each of workers.

Of course this issue is not desirable. The alternative method is to have separate (distinct) seeds for each worker. The potential problem would be that the generated numbers might get into steps (i.e. been periodically repeated, therefore, correlated between streams). To resolve this parallel package utilizes "L'Ecuyer-CMRG" RNG, which has a quite long period with a small seed, ensuring streams do not get into steps easily. To set the RND to "L'Ecuyer-CMRG" one runs RNGkind("L'Ecuyer-CMRG"), also changing argument mc.set.seed of mclapply to TRUE:

RNGkind("L'Ecuyer-CMRG")

rng1 <- unlist(
    mclapply(X = 1:10,
             FUN = function(x) runif(1),
             mc.cores = 2,
             mc.set.seed = TRUE)
)

rng1

# > [1] 0.67681994 0.54730337 0.05398847 0.19480448 0.94954659 0.35727778
#   [7] 0.17057359 0.83029494 0.37063552 0.24445617

Elements now are different, and "L'Ecuyer-CMRG" uses nextRNGStream() to generate a next “uncorrelated” seed. The pseudo code (taken from vignette("parallel")) explains this concept:

# > RNGkind("L'Ecuyer-CMRG")
# > set.seed(2002) # something
# > M <- 16 ## start M workers
# > s <- .Random.seed
# > for (i in 1:M) {
# +     s <- nextRNGStream(s)
# +     # send s to worker i as .Random.seed
# + }

Let’s run the same expression one more time:

rng2 <- unlist(
    mclapply(X = 1:10,
             FUN = function(x) runif(1),
             mc.cores = 2,
             mc.set.seed = TRUE)
)

rng2

# > [1] 0.67681994 0.54730337 0.05398847 0.19480448 0.94954659 0.35727778
#   [7] 0.17057359 0.83029494 0.37063552 0.24445617

identical(rng1, rng2)
# > [1] TRUE

The second rng2 is absolutely identical to rng1 that was run before. Coincidence? Nope. The thing is the .Random.seed of master process is NOT affected by worker processes (see pseudo-code). That is why we will have the same numbers during a second, third, and any other run, unless the .Random.seed will be changed (e.g. by runif(1) in a master process).

Note that even if mc.set.seed is TRUE, but RNG is different from "L'Ecuyer-CMRG", then using set.seed() won’t establish reproducibility.

In the end, the package parallel is a little bit vague when it comes to RNG, so that I have to read vignette("parallel") (Section 6), dozens of cross-refereed helps (?mclapply RNG section refers to ?mcparallel, which requires to read ?nextRNGStream), and finally deep dive into sours non-exported function via parallel:::mc.set.seed.