Iegor Rudnytskyi, PhD

📦 [archived] Managing dependencies in packages

2019-12-03T00:00:00+00:00

Managing usual dependencies of a package is clearly covered in R packages by Hadley Wickham. Typically, that would be the end of a tutorial or a post. However, teaching recently how to develop a package, I encountered a couple of super interesting and non-trivial questions that would not have a conventional solution. I guess this post would be a perfect place to share my thoughts on that meter, as well as a nice excuse to restart blogging.

Non-CRAN packages

When developing the package, the standard place to list dependencies (i.e., external packages that your package needs) is Imports: in DESCRIPTION. Full stop here. These packages are required to be installed so that your package works. And they will be installed automatically when installing your package via install.packages() (see default behavior of dependencies argument). However, packages in Imports: field are supposed to be published on CRAN. That could be an issue if your package uses functionality from packages that are not (yet) published on CRAN. This is the exact question I was asked by one of my students: where do I specify non-CRAN dependencies?

I was sure that there exists a common workflow to do it. After a minute of extensive research, I found out that CRAN policy explains it quite vaguely. Further, there were three Stackoverflow questions about it (see below in References). The answer that I found was quite satisfactory: Dirk Eddelbuettel proposes to list the package in Sugests: and specify the additional repository in special free-form filed Additional_repositories:. He also suggests using drat package to create CRAN-like R packages repository, which from my view is a bit overkill. So my solution would be to list the name of the package in Suggests: and mention the link to its GitHub repo (almost surely the source is stored on GitHub) in Additional_repositories:.

Update: As it was kindly pointed out by Sébastien Rochette, devtools supports a Remotes: field exactly for that purpose. Simply specify the repos in the format username/reponame separated by commas (one can also add the type of the source if it is not GitHub, e.g., gitlab::username/reponame). And that is it.

That would be the nice end of the story but how would you let know the end-user that you need this package to be pre-installed? The workaround I found is to rise a message from the function, where this dependence is used and ask the user to install it, for example:

my_function <- function() {
    if (!("nonCRANpkg" %in% rownames(installed.packages()))) {
        message("Please install package nonCRANpkg.")
    }
}

The problem is that the user should come back to the installation process at the point when they use my_function() . In addition, it probably affects the expected output of the function or even worse if the function is internal one and not exported into the namespace. That is why, from my personal view, the installation of all dependencies should be tackled way before the first call of my_function(). And here the function .onAttach() comes in handy. This function allows displaying messages when the package is loading. We simply need to inform the user that they need to install the dependence before using our package (mind the difference between message() and packageStartupMessage()):

.onAttach <- function(libname, pkgname) {

    if (!("nonCRANpkg" %in% rownames(installed.packages()))) {
        packageStartupMessage(
            paste0(
                "Please install `nonCRANpkg` by",
                " `devtools::install_github('username/nonCRANpkg')`"
            )
        )
    }

}

To summarize in a nutshell: mention the package name in the field Suggests: of DESCRIPTION, link to its repo in Remotes: (in the same file), and write a simple .onAttach() function (it should be stored in the file zzz.R).

Shiny demo app

It is always a cool idea to compliment the package with a Shiny app so that a user can have an interactive interface to play around with the functionality of the package. We typically store scripts of those demo apps in inst\shiny-examples\name_of_app and add a function runDemo() to run them (see a wonderful post by Dean Attali in the references for details). Those apps are very likely to have their own dependencies, as well as they definitely require shiny namespace to be loaded. That is why we see all these library() calls at the beginning of Shiny apps’ scripts.

Obviously, (1) we want to ensure that the user has all required packages installed, and (2) avoid using library() in package’s scripts. The solution is very simple – specify all Shiny app dependencies in Imports: and use the usual :: to access functions from respective namespaces.

To sum up all the previous take-home points, I created a dummypkg for illustration, which is stored at GitHub repo irudnyts\dummypkg. It contains a barebone example of non-CRAN dependencies, as well as a tiny Shiny app with dependencies. Managing those dependencies is super important since we do not want our packages to look like jack-in-the-boxes.

Many thanks go to Ana Lucy Bejarano Montalvo who inspired me by asking those questions and Sébastien Rochette for pointing out Remotes: filed.

References

🖊 [archived] R Coding Style Guide

2019-01-14T00:00:00+00:00

Language is a tool that allows human beings to interact and communicate with each other. The clearer we express ourselves, the better the idea is transferred from our mind to the other. The same applies to programming languages: concise, clear and consistent codes are easier to read and edit. It is especially important, if you have collaborators, which depend on your code. However, even if you don’t, keep in mind that at some point in time, you might come back to your code, for example, to fix an error. And if you did not follow consistently your coding style, reviewing your code can take much longer, than expected. In this context, taking care of your audience means to make your code as readable as possible.

There is no such thing as a “correct” coding style, as there is no such thing as the best color. At the end of the day, coding style is a set of developers’ preferences. If you are coding alone, sticking to your coding style and being consistent is more than enough. The story is a bit different if you are working in a team: it is crucial to agree on a convention beforehand and make sure that everyone follows it.

Even though there is no official style guide, R is mature and steady enough to have an “unofficial” convention. In this post, you will learn these “unofficial” rules, their deviations, and most common styles.

Naming

Naming files

The convention actually depends on whether you develop a file for a package, or as a part of data analysis process. There are, however, common rules:

File names should use .R extension.
```
  # Good
  read.R

  # Bad
  read
```

File names should be meaningful.

  # Good
  model.R

  # Bad
  Untitled1.R

File names should not contain / and spaces. Instead, a dash (-) or underscore (_) should be used.
```
  # Good
  fir_regression.R
  fir-regression.R

  # Bad
  fit regression.R
```
File names should use letters from Basic Latin, and NOT from Latin-1 Supplement.
```
  # Good
  tidy.R

  # Bad
  rangé.R
```

If the file is a part of data analysis, then it makes sense to follow the following recommendations:

There should be no files that differ only by the letter case in the same folder and file names should be lowercase. There is nothing bad in having capital case names, just bear in mind case sensitivity and case preservation of your system. Case sensitivity means test.R and Test.R can coexist in the same folder. For instace, macOS file system (APFS) is not case sensitive by default.
```
  # Good
  analyse.R

  # Bad
  Analyse.R
```

Use meaningful verbs for file names.

  # Good
  validate-vbm.R

  # Bad
  regression.R

If files should be run in a particular order, then use ascending names.
```
  01-read.R
  02-clean.R
  02-plot.R
```

If the file is used in a package, then slightly different rules should be folowed:

Mind special names:
- AllClasses.R (or AllClass.R), a file that stores all S4 classes definitions.
- AllGenerics.R (or AllGeneric.R), a file that stores all S4 generic functions.
- zzz.R, a file that contains .onLoad() and friends.
If the file contains only one function, name it by the function name.
Use methods- prefix for S4 class methods.

Naming variables

Generally, names should be as short as possible, still meaningful nouns.

  # Good
  fit_rt
  split_1
  imdb_page

  # Bad
  fit_regression_tree
  cross_validation_split_one
  foo

Variable names should be typically lowercase.
```
  # Good
  event

  # Bad
  Event
```
NEVER separate words within the name by . (reserved for an S3 dispatch) or use CamelCase (reserved for S4 classes definitions). Instead, use an underscore (_).
```
  # Good
  event_window

  # Bad
  event.window
  EventWindow
```

DO NOT use names of existing function and variables (especially, built-in ones).

  # Bad
  T <- 10 # T is a shortcut of TRUE in R
  c <- "constant"

Naming functions

Many points of naming variables are similar for naming functions:

Generally, function names should be verbs.
```
  # Good
  add()

  # Bad
  addition()
```

Use . ONLY for dispatching S3 generic.

  # Good
  bw_test()

  # Bad
  bw.test()

Add the underscore (_) prefix to a standard evaluation (SE) equivalent of a function (summarize vs summarize_ ).

Naming S4 classes

Class names should be nouns in CamelCase with initial capital case letter.

Syntax

Line length

The maximum length of lines is limited to 80 characters (thanks to IBM Punch Card).

It is possible to display the margin in RStudio Source editor:

Go to Tools -> Global Options… -> Code -> Display
Click on “Show margin”
Set “Margin column” to 80

Spacing

Put spaces around all infix binary operators (=, +, *, ==, &&, <-, %*%, etc.).

  # Good
  x == y
  a <- a ^ 2 + 1

  # Bad
  x==y
  a<-a^2+1

Put spaces around “=” in function calls (except for Bioconductor).

  # Good
  mean(x = c(1, NA, 2), na.rm = TRUE)

  # Bad
  mean(x=c(1, NA, 2), na.rm=TRUE)

Do NOT place space for subsetting ($ and @), namespace manipulation (:: and :::), and for sequence generation (:).

  # Good
  car$cyl
  dplyr::select
  1:10

  # Bad
  car $cyl
  dplyr:: select
  1: 10

Put a space after a comma.

  # Good
  mtcars[, "cyl"]
  mtcars[1, ]
  mean(x = c(1, NA, 2), na.rm = TRUE)

  # Bad
  mtcars[,"cyl"]
  mtcars[1 ,]
  mean(x = c(1, NA, 2),na.rm = TRUE)

Use a space before left parentheses, except in a function call.

  # Good
  for (element in element_list)
  if (grade == 5.5)
  sum(1:10)

  # Bad
  for(element in element_list)
  if(grade == 5.5)
  sum (1:10)

No spacing around code in parenthesis or square brackets.

  # Good
  if (debug) message("debug mode")
  species["tiger", ]

  # Bad
  if ( debug ) message("debug mode")
  species[ "tiger" ,]

Curly braces

An opening curly brace should NEVER go on its own line and should always be followed by a new line.

  # Good
  if (is_used) {
      # do something
  }

  if (is_used) {
      # do something
  } else {
      # do something else
  }

  # Bad
  if (is_used)
  {
      # do something
  }

  if (is_used) { # do something }
  else { # do something else }

A closing curly brace should always go on its own line, unless it’s followed by else.

  # Good
  if (is_used) {
      # do something
  } else {
      # do something else
  }

  # Bad
  if (is_used) {
      # do something
  }
  else {
      # do something else
  }

Always indent the code inside curly braces (see next section).

  # Good
  if (is_used) {
      # do something
      # and then something else
  }

  # Bad
  if (is_used) {
  # do something
  # and then something else
  }

Curly braces and new lines can be avoided, if a statement after if is very short.
```
  # Good
  if (is_used) return(rval)
```

Indentation

ALWAYS indent your code!

No tabs or mixes of tabs and spaces.
There are two common number of spaces for indentation: two (Hadley and others) and four (Bioconductor). My own rule of thumb: I use four spaces indentation for data analyses scripts, and two spaces while developing packages.
Choose the number of spaces of indentation upfront and stick to it. Never mix different number of spaces in one project.
To set the number of spaces in the project, go to Tools -> Global options… -> Code -> Editing. Check the following boxes: “Insert spaces for tab” (with “Tab width” equal to chosen number), “Auto-indent code after paste”, and “Vertically align arguments in auto-indent”.

Magic shortcut: Command+I (Ctrl+I for Windows/Linux) will indent a selected chunk of code. Together with Command+A (select all) it is a very powerful tool, which saves time.

Try a little exercise: paste the following code in your RStudio source editor, select it, and hit Command+I:

for(i in 1:10) {
if(i %% 2 == 0)
print(paste(i, "is even"))
}

New line

Very often function definition does not fit into one line. In this case, excessive arguments should be moved to a new line, starting with the opening parenthesis.

  long_function_name <- function(arg1, arg2, arg3, arg4,
                                 long_argument_name1 = TRUE)

If arguments expand more than into two lines, then each argument should be placed on a separate line.

  long_function_name <- function(long_argument_name1 = c("value1", "value2"),
                                 long_argument_name2 = TRUE,
                                 long_argument_name3 = NULL,
                                 long_argument_name4 = FALSE)

The same applies to a function call: excessive arguments should be indented where the closing parenthesis is located, if only two lines are sufficient.
```
  plot(table(rpois(100, 5)), type = "h", col = "red", lwd = 10,
       main = "rpois(100, lambda = 5)")
```

Otherwise, each argument can go into a separate line, starting with a new line after the opening parenthesis.

  list(
      mean = mean(x),
      sd = sd(x),
      var = var(x),
      min = min(x),
      max = max(x),
      median = median(x)
  )

If the condition in if statement expands into several lines, than each condition should end with a logical operator, NOT start with it.

  # Good
  if (some_very_long_name_1 == 1 &&
      some_very_long_name_2 == 1 ||
      some_very_long_name_3 %in% some_very_long_name_4)

  # Bad
  if (some_very_long_name_1 == 1
      && some_very_long_name_2 == 1
      || some_very_long_name_3 %in% some_very_long_name_4)

I know some people who are completely against it. See the next item why I believe it is better.

If the statement, which contains operators, expands into several lines, then each line should end with an operator and not begin with it. Sometimes, it makes sense to split a formula into meaningful chunks.
```
  # Good
  normal_pdf <- 1 / sqrt(2 * pi * d_sigma ^ 2) *
      exp(-(x - d_mean) ^ 2 / 2 / s ^ 2)

  # Bad
  normal_pdf <- 1 / sqrt(2 * pi * d_sigma ^ 2)
      * exp(-(x - d_mean) ^ 2 / 2 / d_sigma ^ 2)
```
Not only it is ugly, but also syntactically wrong. In the second case, R will consider these two lines as two distinct statements: the first line will assign the value of 1 / sqrt(2 * pi * d_sigma ^ 2) to normal_pdf, and the second line will throw an error, since * does not have the first argument.

Each grammar statement of dplyr (after %>%) and ggplot2 (after +) should start with a new line.

  mtcars %>%
      filter(cyl == 4) %>%
      group_by(am) %>%
      summarize(avg_mpg = mean(mpg))

  ggplot(mtcars) +
      geom_point(aes(x = mpg, y = qsec, color = factor(am))) +
      geom_line(aes(x = mpg, y = qsec, color = factor(am)))

Comments

Comment your code. Always. Your collaborators and future-you will be very grateful. Comments start with # followed by space and text of the comment.
```
  # This is a comment.
```
Comments should explain the why, not the what. Comments should not replicate the code by a plain langue, but rather explain the overall intention of the command.
```
  # Good
  # define iterator
  i <- 1

  # Bad
  # set i to 1
  i <- 1
```

Short comments can be placed on the same line of the code.

  plot(price, weight) # plot a scatter chart of price and weight

To comment/uncomment selected chunk, use Command+Shift+C.
Use roxygen2 comments for a package development (i.e., #') to comment functions.

It makes sense to split the source into logical chunks by # followed by - or =.

  # Read data
  #---------------------------------------------------------------------------

  # Tidy data
  #---------------------------------------------------------------------------

Other recommendations

Use <- for assignment, NOT =.
Use library() instead of require(), unless it is a conscious choice. Package names should be characters (avoid NSE - non-standard evaluation).
```
  # Good
  library("dplyr")

  # Bad
  require(dplyr)
```
In a function call, arguments can be specified by position, complete name, or partial name. Never specify by partial name and never mix by position and complete name.
```
  # Good
  mean(x, na.rm = TRUE)
  rnorm(10, 0.2, 0.3)

  # Bad
  mean(x, na = TRUE)
  rnorm(mean = 0.2, 10, 0.3)
```
While developing a package, specify arguments by name.

The required (with no default value) arguments should be first, followed by optional arguments.

  # Good
  raise_to_power(x, power = 2.7)

  # Bad
  raise_to_power(power = 2.7, x)

The ... argument should either be in the beginning or in the end.

  # Good
  standardize(..., scale = TRUE, center = TRUE)
  save_chart(chart, file, width, height, ...)

  # Bad
  standardize(scale = TRUE, ..., center = TRUE)
  save_chart(chart, ..., file, width, height)

Good practice rule is to set default arguments inside the function using NULL idiom, and avoid dependence between arguments:

  # Good
  histogram <- function(x, bins = NULL) {
      if (is.null(bins)) bins <- nclass.Sturges(x)
      ...
  }

  # Bad
  histogram <- function(x, bins = nclass.Sturges(x)) {
      ...
  }

Always validate arguments in a function.
While developing a package, specify the namespace of each used function, except if it is from base package.
Do NOT put more than one statement (command) per line. Do NOT use semicolon as termination of the command.
```
  # Good
  x <- 1
  x <- x + 1

  # Bad
  x <- 1; x <- x + 1
```
Avoid using setwd("/Users/irudnyts/path/that/only/I/have"). Almost surely your collaborators will have different paths, which makes the project not portable. Instead, use here::here() function from here() package.
Avoid using rm(list = ls()). This statement deletes all objects from the global environment, and gives you an illusion of a fresh R start.

If you have read until this moment, you deserve a treat. There is a magic key combination Command+Shift+A that reformats selected code: add spaces and indents it. Do not use it excessively though!

References

Advanced R
Google’s R Style Guide
Bioconductor Coding Style
Efficient R programming
Colin Gillespie’s R style guide
The State of Naming Conventions in R
Consistent naming conventions in R
Project-oriented workflow
Picture is taken from R Memes For Statistical Fiends Facebook page.

📁 [archived] Project-oriented workflow

2019-01-07T00:00:00+00:00

Be honest with yourself, how many times have you wanted to restart an on-going project from scratch throwing away the current folder? Or how many times have you had to rename files and adjust folder structure to make your project simple and clear? Not to mention, all these thousands of versions of your scripts that are dangling around in your mail box. Tired of this? Then, get on board and read my comments on how to make your project reproducible, portable, and self-contained.

Introduction

We start by working up some intuition about these three key aspects rather than trying to grasp explicit technical definitions. In data science context, reproducibility means that the whole analysis can be recreated (or repeated) from scratch: executing scripts based on raw data must yield exactly the same results. It means, for instance, that if the analysis involves generating random numbers, then one has to set a seed (an initial state of a random generator) to obtain the same random split each time. Ideally, everyone should also have an access to data and software to replicate your analysis (it is not always the case, since data can be private), but this is already a domain of open science.

Portability means that regardless of the operating system or a computer, given a minimal prerequisites, the project should work. For instance, if the project uses a particular package that works only on Windows, then it is not portable. The project is also not considered portable, if it utilizes a particular computer settings, such as absolute paths instead of relative to your project folder (e.g., when reading the data or saving plots to files). Normally, you should be able to run the code on your collaborator’s machine without changing any lines in the scripts.

We call a project self-contained, when you have everything you need at hand (i.e., in the folder of your project) and your project does not affect anything it did not create. It is a bad idea to use a function that has been defined in the other of your projects. Not only anyone else who does not have the second project will suffer, but yourself, when your current project will be used on the other machine. Furthermore, if you need, for instance, to save processed data, then it should be saved separately, and not overwrite raw data. There is another term that has a similar meaning – isolated, which is related to dependencies of the project. This topic is extensively covered in the section on packrat dependency management system.

This post is an attempt to summarize the use of “sexy” tools and techniques to improve above-mentioned aspects of project significantly. Of course, one can immediately feel that these aspects are interrelated. As a consequence, techniques and practices we consider further improve several elements at a time, rather than focusing on a particular one. For instance, using consistent folder structure will make your project reproducible and portable, while properly managed dependencies will ensure that the project is self-contained and portable. That is why further content is organized by focusing on tools rather than on stand-alone aspects. But do not get fooled, it is not a yet another git / RStudio tutorial. There are dozens of tutorials, and I do not try to compete with them. Instead, I want to give an overview of useful things based entirely on my experience.

Now, you might ask yourself: why it is such a big deal? Well, first off, it gives more credibility to the research, because it can be verified and validated by a third party ( your peers). Furthermore, keeping the flow of analysis reproducible, portable and self-contained makes easier to proceed and to extend the project. At first glance, it might look like you spend more time organizing your project than doing actual analysis. However, in the long run you will save much more time that you can anticipate.

Version control system

If you are reading this post I bet you have heard of (if not used) version control system. It allows to manage changes to files, especially of the source code history. Naming all advantages of VCS would be a hard task, and I only wish to emphasize the main ones. First off to begin with, VCS allows storing the versions of files properly. One can always revert to any previous version of any file of the project, not having tons of versions of the same file. If you keep your project on a hosting service, then it also backs up your most important files. Furthermore, distributed VCS makes it possible to collaborate in a straightforward way: your fellows have an access to the latest version of any files of the project at any time. Let’s face the truth, sending files via email or Dropbox is too messy. It is not dangerous even if you work on the same file at the same time, because VCS can merge the changes afterwards. Finally, branching deserves a separate mention, that is a possibility to deviate from the main flow of the analysis by having an independent stream, which can be merged back afterwards. Most common VCS are git, SVN (Subversion), Mercurial.

Remember I mentioned that your collaborators always have access to files? Well, it is only true if your machine is plugged into a network. Surely that might not always be the case. To cope with this issue hosting services are used, such as GitHub (works with git and SVN), GitLab (git), Bitbucket (git and Mercurial), SourceForge (git, SVN), etc. These guys host your repository (repo for short, a folder with all your project files) making it possible to share and publish. While most of the VCS are command line tools, hosting services provide a very convenient web-based interfaces in addition to their own sweet features.

Long time ago, when people mostly used Emacs and Eclipse as IDE for R, SourceForge in conjunction with SVN dominated. Nowadays, most of R projects are hosted on GitHub and use git. GitHub has many nice features, like Issues (that can be used for bug tracking, to-lists, etc), Pull requests, an integration with Slack messenger, etc. Also, GitHub is very easy to intgerate to RStudio.

There is quite a number of tutorials on this topic. I personally find Hadley’s chapter in R packages a very concise yet explicit cookbook. It covers main skills you might need, e.g. how to write good commits, etc.

Speaking about git, it is hard to avoid the topic of collaborating workflows. In a nutshell, there are three main workflows: centralized workflow, feature branch workflow, and forking workflow. Typically, a research project involves only a small number of collaborators who trust each other. If that’s the case, it makes sense to employ the centralized one, when everyone pushes into the central master branch. To deep dive into details of other workflows, please see the Bitbucket tutorial.

Last but not least, it is very important to master git commands and use them via shell. For simple commands one can still use built-in RStudio git interface. However, once you are ready to use extensively git, shell becomes essential.

Dependency management tool

It is very likely that your data science project depends on non-base R packages. R provides a very convenient way of installing packages via install.packages(), which by default stores all packages in one global repository. In most cases it is more than enough. However, sometimes different projects may depend on different versions of packages. For instance, the first project uses a function that has been deprecated from the current version. At the same time, the second project utilizes a function that appears only in the recent version of the package. A good example of such package would be ggplot2, which evolves significantly over the time and many functions of which have been deprecated.

The solution to this problem is to store these packages of specific versions in the local folder of the project so that each project will have its own private package library. If you have previously used Python or Ruby, similar tools are virtual environments and bundle, respectively. In R we have several tools, such as packrat, jetpack, and others. The packrat is more common and stable, and below I briefly show how to use it.

Storing required packages in the folder of the project ensures the project is self-contained (meaning everything that the project needs is inside its folder), portable (you can move your project to another machine not worrying too much about dependencies), and reproducible (the same versions of packages yield the same result). In the official packrat web-page, the term self-consistent is replaced by isolated: indeed, this package manager not only makes sure that everything at hand, but also insures things won’t be overwritten and other projects won’t be affected (for instance, by installing a newer version of the dependence).

Installation of packrat is effortless, install.packages("packrat") should do the trick (on macOS it might require Command Line Tools to be installed first). To start using packrat you have two ways: if you use RStudio, then simply initialize a new project with packrat as shown below or use the command packrat::init() in the existing project (mind argument project, which by default is the working directory of R).

From now on all your packages will be stored in packrat folder. You should not modify anything by hand in this directory. I am not going to go over each component of this folder (one can read about it here), but several folders are worth mentioning. Files packrat/packrat.lock and packrat/packrat.opts contain the list of dependencies and specify the options of the tool, respectively. Then, packrat/lib/ is a repository, where your installed packages live, the actual private package library. Finally, your bundled packages are located in packrat/src/.

One has two ways to configure packrat: either with packrat::set_opts() or via RStudio (Tools -> Project Options… -> Packrat). Both methods will modify packrat/packrat.opts file. We add only one modification to default options: we need to check Automatically snapshot local changes in RStudio or to evoke packrat::set_opts(auto.snapshot = TRUE). We also leave Git ignore packrat library and Git ignore packrat sources as is, that is checked and unchecked. Installed packages in packrat/lib/ are platform-specific. Thus, carrying them to the other platforms does not make any sense. At the same time, they can be installed from bundled packages in packrat/src/, which will be transferred together with other files of the project.

The workflow of installing, removing, and updating packages is the same as in normal R, that is by install.packages(), etc. As long as we set auto.snapshot to TRUE, you do not need to make a snapshot each time by packrat::snapshot(), packrat will do it for you automatically.

The most amazing thing about packrat is if you move the project to the other computer, all you need to do is to start R from the project directory – packrat will set up the private package library automatically.

Project folder structure

The size of the project increases exponentially. A project started as a harmless code snippet can easily pile up into a huge snowball of hundreds files with an unstructured folder tree. To avoid this, it is important to define the folder structure before stepping into analysis. Depending on whether the project is a package or a case study, it should have a significantly different skeleton.

The folder structure of R packages is a subject to a regulation of community (CRAN and Bioconductor). It is well-defined and can be explored in R packages book, therefore, I skip it in this post.

As opposed to R packages, there is no a single right folder structure for analysis projects. Below, I present a simple yet extensible folder structure for data analysis project, based on several references that cover this issue.

The parent folder that will contain all project’s subfolders should have the same name as your project. Pick a good one. Spending an extra 5 minutes will save you from regrets in the future. The name should be short, concise, written in lower-case, and not containing any special symbols. One can apply similar strategies as for naming packages.

name_of_project/
|-  data
|   |-  raw
|   |-  processed
|-  figures
|-  packrat
|-  reports
|-  results
|-  scripts
|   |- deprecated
|-  .gitignore
|-  name_of_project.Rproj
|-  README.md

The folder data typically contains two subfolders, namely, raw and processed. The content of raw directory is data files of any kind, such as .csv, SAS, Excel, text and database files, etc. The content of this folder is read only, so that no scripts should change the original files or create new ones inside it. For this purpose processed directory is used: all processed, cleaned, and tidied datasets are saved here. It is a good practice to save files in R specific format, rather than in .csv, since the saving in .csv is a less efficient way of storing data (both in terms of space and time of reading/writing). The preference is given to .rds files over .RData (see why in Content of R files section). Again, files should have representative names (merged_calls.rds vs dataset_1.rds). Note that it should be possible to regenerate those datasets from the raw data. In other words, if you remove all files from this folder, it must be possible to restore all of them by executing your scripts that use only the data from raw directory.

The folder figures is the place where you may store plots, diagrams, and other figures. There is not much to say about it. Common extensions of such files are .eps, .png, .pdf, etc. Again, file names in this folder should be meaningful (the name img1.png does not represent anything).

All reports live in a directory with the corresponding name reports. These reports can be of any formats, such as LaTeX, Markdown, R Markdown, Jupyter Notebooks, etc. Currently, more and more people prefer rich documents with text and executable code to LaTeX and such.

Not all output object of the analysis are data files. For example, you have calibrated and fitted your deep learning network to the data, which took about an hour. Of course, it would be painful to retrain the model each time you run the script, and you want to save this model. Then, it is reasonable to save it in results with .rds extension.

Perhaps the most important folder is scripts. There you keep all your R scripts and codes. That is the exact place to use prefix numbers, if files should be run in a particular order. If you have files in other scripted languages (e.g., Python), it is better to keep them in this folder as well. There is also an important subfolder called deprecated. Whenever you want to remove one or the other script, it is a good idea to move it to deprecated at first iteration, and only then delete. The script you want to remove can contain functions or analysis used by other collaborators. Moving it firstly to deprecated ensures that the file is not used by other collaborators. It is not required, of course, because git keeps all versions, and it is always possible to revert. But from my experience, it is highly convenient.

There are three important files in the project folder: .gitignore, name_of_project.Rproj, and README.md. The file .gitignore lists files that won’t be added to Git system: LaTeX or C build artifacts, system files, very large files, or files generated for particular cases (e.g., packrat\lib). The name_of_project.Rproj contains options and meta-data of the project: encoding, the number of spaces used for indentation, whether or not to restore a workspace with launch, etc. The README.md briefly describes all high-level information about the project.

The proposed folder structure is far from being exhaustive. You might need to introduce other folders, such as paper (where .tex version of a paper lives), sources (a place for your compiled code, e.g., C++), references, presentations, NEWS.md, TODO.md, etc. At the same time, keeping empty folders could be misleading, and it is better to remove them (unless you are planning to store anything in them in the future). Moreover, git does not track empty folders.

Several R packages, namely ProjectTemplate, template, and template are dedicated to project structures. It is also possible to construct a project tree by forking manuscriptPackage or sample-r-project repos. Using a package or forking a repo allows automated structure generation, but at the same time introduces many redundant and unnecessary folders and files.

Finally, some scientists believe that all R projects should be in a shape of a package. Indeed, one can store data in \data, R scripts in \R, documentation in \man, and the paper in \vignette. The nice thing about it is that anyone familiar with an R package structure can immediately grasp where each type of file is located. On the other hand, the structure of R packages is tailored to serve its purpose – make a coherent tool for data scientists and not to produce a data product: there is no distinction between function definitions and applications, no proper place for reports, and finally there is no place for other script languages that you can use (e.g, Bash, Python, etc.).

Content of R files

While there are no rules on how to organize your R code, there are several dos and dont’s that most of the time are not taught explicitly. I list them below in no particular order:

Do not use the function install.packages() inside your scripts. You are not supposed to (re)install packages each time you run your files. By default it is assumed that all packages that are used by a script are already installed. If you use packrat, packages will be installed automatically from bundles.

If there are many packages to install and you do not use packrat, I suggest to create a file configure.R, that will install all packages:
```
  pkgs <- c("ggplot2", "plyr")
  install.packages(pkgs)
```
The snippet above profits from the fact that install.packages() is a vectorized function. Anyway, most times, install.packages() is supposed to be called from the console, not the script.
Do not use the function require(), unless it is a conscious choice. In contrast to library(), require() does not throw an error (only a warning) if the package is not installed.

Use a character representation of the package name.

  # Good
  library("ggplot2")

  # Bad
  library(ggplot2)

Load only those packages that are actually used in the script. Load packages at the beginning of the script.
Do not use rm(list = ls()) that erase your global environment. First, it could accidentally delete accidentally an important long-time-to-build object. Second, it gives the illusion of the fresh start of R.
Do not use setwd("/Users/irudnyts/path/that/only/I/have"). It is very unlikely that someone except you will have the same path to the project. Instead, use a package here and relative paths. The package here automatically recognizes the path to the project, and starts from there:
```
  # Good
  library("here")

  cars <- read.csv(file = here("data", "raw", "cars.csv"))

  # Bad
  setwd("/Users/irudnyts/path/that/only/I/have/data/raw")
  cars <- read.csv(file = "cars.csv")
```
If your script involves random generation, then set a seed by set.seed() function to get the same random split each time:
```
  # Good
  set.seed(1991)
  x <- rnorm(100)

  # Bad
  x <- rnorm(100)
```

Do not repeat yourself (DRY). In R context it means the following: if the code is repeated more than two times, you had better wrapped it into a function (the example is borrowed from Advanced R):

  # Better
  fix_missing <- function(x) {
      x[x == -99] <- NA
      x
  }
  df[] <- lapply(df, fix_missing)

  # Bad
  df$a[df$a == -99] <- NA
  df$b[df$b == -99] <- NA
  df$c[df$c == -98] <- NA
  df$d[df$d == -99] <- NA
  df$e[df$e == -99] <- NA
  df$f[df$g == -99] <- NA

Separate function definitions from their applications. I typically keep a file util.R, where all my functions are defined.
Use saveRDS() instead of save():

initializing a new data analysis project in RStudio and getting your things together

Prerequisites:

Installed and configured git
Installed R and RStudio
Existing account in GitHub
Installed and configured packrat

Steps:

Pick a good name (e.g., beer).
In RStudio create a project:
- Navigate to File -> New project…
- Select New Directory
- Select New project
- Insert your picked name into Directory name
- Check Create a git repository and Use packrat with this project
This creates a folder with the name of the project, initializes a git repo, generates an .Rproj file, initializes packrat, and creates .gitignore file.
Configure packrat as described above.
Populate folders with files. Typically, at the beginning, it is only data/raw.
Create a README.md file.
Launch Terminal and navigate your working directory (of Terminal, not R) to your project folder by, for instance, cd /Users/irudnyts/Documents/projects/beer.
Record changes by git add --all and commit by git commit -m "Initialize the project". Traditionally the message of the first commit is simple "First commit", but I prefer to write something more conscious. Now all you changes are recorded locally. Note also that git does not record empty folders.
Create a new repo in GitHub:
- Fill in Repository name with the same name as your project.
- Fill in Description with one line that briefly explains the intent of the project and ends with full stop.
- Hit Create repository.
Connect your local repo to your GitHub repo by
```
 git remote add origin git@github.com:irudnyts/beer.git
 git push -u origin master
```
Refresh the page in your browser to ensure that changes appear at GitHub repo.

Outro & acknowledgement

About a year ago I came across a brilliant post by Jenny Bryan. I was amazed by how elegantly she formalized and summarized many simple tricks that make the life of a data scientist more pleasant. I was so inspired that I could not miss the opportunity to present these ideas in tutorials to students during the fall semester. The idea that I contributed to the process of making projects more conscious was very satisfactory, and based on these tutorials I start the series of posts.

Many ideas and concepts are based on the works of Hadley Wickham and Jenny Bryan. Many thanks!

References

🌱 [archived] Setting a seed in R, when using parallel simulation

2018-07-12T00:00:00+00:00

Generally speaking, if the code does any simulations, it is a good practice to set a seed to make the code reproducible. Setting a seed ensures that the same (pseudo-)random numbers will be generated each time the script is executed. Surprisingly, I found really few posts dedicated to any convention, best practice, or routine of setting a seed in R. Further, when using multiple cores (parallelisation) for simulations, things can get slightly more complicated.

Seeds in R

In base R there are two main objects to handle seeds: set.seed() and .Random.seed. For a vast number of problems it is enough to use set.seed(), which supplies an integer as a seed. The workflow, then, is as simple as follows:

runif(2)
# > [1] 0.8636221 0.1782020

set.seed(1991)
runif(2)
# > [1] 0.1506231 0.2308308

runif(2)
# > [1] 0.0134826 0.5340390

set.seed(1991)
runif(2) # exactly the same random numbers as before
# > [1] 0.1506231 0.2308308

The second object, .Random.seed, allows saving and restoring the random number generator (RNG) state. Under the hood .Random.seed is a simple atomic integer vector, the first element of which specifies the kind of RNG and normal generator. For instance, the first element of 207 is referred to "L'Ecuyer-CMRG" RNG method, and "Box-Muller" for normal distribution. The rest of the elements of .Random.seed store the current random seed.

This object I find of a particular use because it can be saved without an explicit seed setting. What I mean is one does not need to provide an integer to set.seed(), which might be annoying, but rather just saving current seed:

seed <- .Random.seed
runif(2)
# > [1] 0.5696378 0.3737989

runif(2)
# > [1] 0.7199003 0.5540470

runif(2)
# > [1] 0.4383970 0.6494643

.Random.seed <- seed
runif(2) # exactly the same random numbers as before
# > [1] 0.5696378 0.3737989

The object .Random.seed lives in the global environment, and therefore, should be set there. It can cause some issues if you trying to set the .Random.seed inside the function, not caring too much about environments. It means that changing it in the function by simple assignment won’t change a seed (the value will be set in execution environment). The following straightforward idea can be used for saving a current seed or setting a custom one inside the function:

reproducible_runif <- function(seed = NULL) {

    if(is.null(seed)) {
        seed <- .Random.seed
    } else {
        # .Random.seed <<- seed
        # mind the double arrow to assign in the parent enviroment or
        assign(x = ".Random.seed", value = seed, envir = .GlobalEnv)
    }

    return(list(x = runif(1), seed = seed))

}

Then, this function will return a random number, that can be reproduced:

r1 <- reproducible_runif()
r1$x
# > [1] 0.4215304

runif(10)
# >  [1] 0.1862207 0.2660995 0.5863689 0.1063663 0.5530690 0.9392229 0.9710050
#    [8] 0.1265786 0.1526233 0.1713895


r2 <- reproducible_runif(seed = r1$seed) # use the seed from the initial call
r2$x # exactly the same as for r1
# > [1] 0.4215304

References:

Seeds for parallel

The story is slightly different when using multiple cores (parallel execution). In this post I use a base package parallel and macOS, but the concept is pretty much the same for other packages and non-unix systems. The idea here to run independent simulations on each core.

Before stepping into details, let’s consider an illustrative example. We run a classical function parallel::mclapply() that returns a random uniform number for each iteration. This function supplies a vector of ten elements as X argument, a simple wrapper around runif(1) to ignore elements of X, the number of cores (in my case 2 physical cores), and also we set mc.set.seed = FALSE. We run this expression two times (unlist is used for a more compact representation):

library(parallel)

rn1 <- unlist(
    mclapply(X = 1:10,
             FUN = function(x) runif(1),
             mc.cores = 2,
             mc.set.seed = FALSE)
)

rn1
# > [1] 0.3495050 0.3495050 0.4159384 0.4159384 0.5376814 0.5376814 0.3279605
#   [8] 0.3279605 0.1527834 0.1527834

identical(rn1[seq(1, 10, by = 2)], rn1[seq(1, 10, by = 2)])
# > [1] TRUE

One can immediately notice a suspicious thing – every second element equals to the previous one. The explanation is very simple: the same workspace is restored from the master process for each worker (or process). It means that .Random.seed will be extracted from the parent process, and therefore, RNG state will be the same for each worker. As result, the same sequence of random numbers will be generated by each of workers.

Of course this issue is not desirable. The alternative method is to have separate (distinct) seeds for each worker. The potential problem would be that the generated numbers might get into steps (i.e. been periodically repeated, therefore, correlated between streams). To resolve this parallel package utilizes "L'Ecuyer-CMRG" RNG, which has a quite long period with a small seed, ensuring streams do not get into steps easily. To set the RND to "L'Ecuyer-CMRG" one runs RNGkind("L'Ecuyer-CMRG"), also changing argument mc.set.seed of mclapply to TRUE:

RNGkind("L'Ecuyer-CMRG")

rng1 <- unlist(
    mclapply(X = 1:10,
             FUN = function(x) runif(1),
             mc.cores = 2,
             mc.set.seed = TRUE)
)

rng1

# > [1] 0.67681994 0.54730337 0.05398847 0.19480448 0.94954659 0.35727778
#   [7] 0.17057359 0.83029494 0.37063552 0.24445617

Elements now are different, and "L'Ecuyer-CMRG" uses nextRNGStream() to generate a next “uncorrelated” seed. The pseudo code (taken from vignette("parallel")) explains this concept:

# > RNGkind("L'Ecuyer-CMRG")
# > set.seed(2002) # something
# > M <- 16 ## start M workers
# > s <- .Random.seed
# > for (i in 1:M) {
# +     s <- nextRNGStream(s)
# +     # send s to worker i as .Random.seed
# + }

Let’s run the same expression one more time:

rng2 <- unlist(
    mclapply(X = 1:10,
             FUN = function(x) runif(1),
             mc.cores = 2,
             mc.set.seed = TRUE)
)

rng2

# > [1] 0.67681994 0.54730337 0.05398847 0.19480448 0.94954659 0.35727778
#   [7] 0.17057359 0.83029494 0.37063552 0.24445617

identical(rng1, rng2)
# > [1] TRUE

The second rng2 is absolutely identical to rng1 that was run before. Coincidence? Nope. The thing is the .Random.seed of master process is NOT affected by worker processes (see pseudo-code). That is why we will have the same numbers during a second, third, and any other run, unless the .Random.seed will be changed (e.g. by runif(1) in a master process).

Note that even if mc.set.seed is TRUE, but RNG is different from "L'Ecuyer-CMRG", then using set.seed() won’t establish reproducibility.

In the end, the package parallel is a little bit vague when it comes to RNG, so that I have to read vignette("parallel") (Section 6), dozens of cross-refereed helps (?mclapply RNG section refers to ?mcparallel, which requires to read ?nextRNGStream), and finally deep dive into sours non-exported function via parallel:::mc.set.seed.

💾 [archived] Installing MySQL on MacOS (and using it with R)

2018-03-27T00:00:00+00:00

A couple of days ago I was asked to install MySQL on MacOS 10.13, and I was surprised that it was not a one-click installation, as in case of R. Unfortunately, even for me a documentation was a bit confusing, and I think it might be useful to have a guide of the installation process.

1. Download .dmg file and install MySQL

One has to download .dmg file from here. The app should be installed like a regular Mac app, and the procedure is well covered here.

At the end of the installation, when one has reached a summary, a separate windows will pop up with a temporary password (as in a screenshot below). This password should be kept somewhere.

2. Set aliases

In order to avoid changing directories all the time before evoking mysql we can set aliases for mysql and mysqladmin commands. To do so one has to open Terminal and execute the following commands (assuming that MySQL was installed to a default folder):

alias mysql=/usr/local/mysql/bin/mysql
alias mysqladmin=/usr/local/mysql/bin/mysqladmin

3. Start MySQL sever

Everything should go smooth so far. Now we need to start our sever. One can do it in Terminal:

cd /Library/LaunchDaemons
sudo launchctl load -F com.oracle.oss.mysql.mysqld.plist

or in System Preferences…

… by clicking on “Start”.

4. Change the temporary password

Now we need to run MySQL to change a temporary password for a ‘root’ user. After calling the following command, Terminal will ask for a password which we saved when installing MySQL in the first step:

mysql -u root -p

If everything is done correctly, you should see something like this:

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 24
Server version: 5.7.21

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

To change the password we simply call this command, where “MyNewPass” as you already guessed is a new password:

SET PASSWORD FOR 'root'@'localhost' = PASSWORD('MyNewPass');

And then quit MySQL:

QUIT

5. (Optional) Install Sequel Pro IDE for MySQL

I find Sequel Pro a quite useful and beautiful IDE for MySQL. To install it one has to download a .dmg file, open it, and drag & drop “Sequel Pro.app” to applications’ folder.

To connect to a local MySQL one has choose Socket in menu and fill in a username (default “root”) and the password that we changes in the previous step.

6. Use MySQL in conjuntion with R

RMySQL provides a full interface for connecting R to MySQL. There are dozens of tutorials on how to use this package, and one can easily google them. We just want to ensure that everything works smoothly. First off, MySQL Server should be launched (as in Step 3). Then, we install and load the package, and finally, using user/password pair connect to a certain database.

install.packages("RMySQL")
library(RMySQL)

install.packages("RMySQL")
library(RMySQL)

con <- dbConnect(MySQL(),
                 user = "root", password = "MyNewPass",
                 dbname = "test", host = "localhost")

dbListTables(con)
# [1] "CalendarMonths"

dbDisconnect(con)
# [1] TRUE

Enjoy!

📈 [archived] Simulating Poisson process (part 2)

2018-03-13T00:00:00+00:00

In previous post we discussed two common methods of Poisson process simulation. The reason why this trivial problem was of my interest is the fact that this is simplification of a larger scale problem of a classical ruin process.

Let me remind that I focus on an extenssion of Cramér–Lundberg model with positive jumps, that is:

\[X(t) = u + ct + \sum_{i = 1}^{N_1(t)}X_i - \sum_{j = 1}^{N_2(t)}Y_j,\]

where:

$u$: is an initial capital;
$c$ is a premium rate;
$N_1(t)$ and $N_2(t)$ are Poisson processes of capital injections and claims with rates $\lambda_1$ and $\lambda_2$, respectively;
$X_i$ and $Y_j$ are i.i.d. random variables modeling sizes of capital injections and claims, respectively.

We simplify this model to a bare minimum, which reflects the behaviour of Poisson processes, that is we set $u = 0$, $c = 0$. Further, we assume deterministic unit jumps: $X_i = 1$ and $Y_j = 1$. As result we have the following model:

\[X(t) = N_1(t) - N_2(t).\]

Nice thing about this model is that we know exact distribution of $X(t)$. It is called Skellam distribution, which is covered in skellam package. Therefore, we can compare Monte-Carlo simulated estimates to their exact equivaletns.

Method 1

For Gerber-Shui function we need to simulate a path until the ruin. It means that the time horizon is not known a priori. As consequence, the algorithm that first simulates the number of jumps for a given time is not applicable. Therefore, the only possibility to simulate a path until the ruin is to exploit the fact that interarrival times of jumps are exponentially distributed. In case of only negative jumps the algorithm is quite simple: in while loop we add jumps to a path until the process is ruined (or any other stopping conditions, e.g. maximum iterations acheived).

Things are slightly more complicated when the model includes positive jumps. If both positive and negative jumps’ times were known to us, then it would be possible to sort them, and add to a path in ascending order, as in an illustration below.

However, we do not know the arrival times of jumps, as they should be simulated. Also it would be naive to add a jump of one type at an iteration and then of another type in the next iteration, because underlying Possion process can have different rates (and, therefore, there is no garantee that one type jump follows the other type). The approach I propose here is very similar to a playground game “tag”. We generate arrival jump’s time for both types. Then, for one that occurred earlier (A), we need to catch up with the opposite type’s (B) time, that is generate more jumps of type (A) until the time of the later type (B) is achieved.

Algorithm:

- generate first times to negative and time to positive jump
- repeat
    * if (time to last positive jump > time to last negative jump)
        # add last negative jump to path
        # if (stopping condition) exit loop
        # repeat
            % generate time to negative jump
            % if (time to last positive jump < time to last negative jump)
                @ exit loop
            % add negative jump to path
            % if (stopping condition) exit loop
        # if (stopping condition) exit loop
    * else
        # add positive jump
        # if (stopping condition) exit loop
        # repeat
            % generate time to positive jump
            % if (time to last positive jump > time to last negative jump)
                @ exit loop
            % add positive jump to path
            % if (stopping condition) exit loop
        # if (stopping condition) exit loop

Let me illustrate a couple of iterations to give a feeling of the algorithm. We simulate positive and negative jumps’ arrival:

Positive jump occurs later, therefore, we need to catch up with negative jumps. We simulate next negative arrival…

And next negative arrival…

One more…

And finally the negative jumps over positive and now we need to catch up with positve ones…

And so forth and so on. In the algorithm above stopping condition could be anythin, for instance, a maximum number of jumps is attained, the maximum number of iterations is attained, the maximum time span is attained, the path is ruined, etc. Below, I propose an implementation with stopping time that uses a maximum time span.

library(magrittr)
library(ggplot2)
library(skellam)

sim_p1 <- function(lambda_p = 1, lambda_n = 1, t = 100) {

    # utility function: get last element of a vector
    last <- function(x) ifelse(length(x) > 0, yes = x[length(x)], no = 0)

    # initialize process
    path <- matrix(NA, nrow = 1, ncol = 2)
    colnames(path) <- c("time", "X")
    path[1, ] <- c(0, 0)

    # function for adding negative jump to a path
    add_jump_n <- function() {

        # add a new time arrival to arrival times vector
        time_n <<- c(time_n, current_time_n)

        # add a negative jump to the path
        path <<- rbind(
            path,
            c(current_time_n, path[nrow(path), 2])
        )

        path <<- rbind(
            path,
            c(path[nrow(path), 1], path[nrow(path), 2] - 1)
        )
    }

    # function for adding positive jump to a path
    add_jump_p <- function() {

        # add a new time arrival to arrival times vector
        time_p <<- c(time_p, current_time_p)

        # add a positive jump to the path
        path <<- rbind(
            path,
            c(current_time_p, path[nrow(path), 2])
        )

        path <<- rbind(
            path,
            c(path[nrow(path), 1], path[nrow(path), 2] + 1)
        )
    }

    # check whether the path is reached maximum time span
    is_max_time_span_attained <- function() path[nrow(path), 1] >= t

    time_n <- numeric() # time of negative jumps arrivals
    time_p <- numeric() # time of positive jumps arrivals

    current_time_n <- rexp(1, lambda_n) # current time arrival of the negative
    current_time_p <- rexp(1, lambda_p) # current time arrival of the positive

    repeat{

        if(current_time_p > current_time_n) {

            add_jump_n()

            if(is_max_time_span_attained()) break

            repeat {

                current_time_n <- last(time_n) + rexp(1, lambda_n)
                if(current_time_p < current_time_n) break

                add_jump_n()

                if(is_max_time_span_attained()) break
            }

            if(is_max_time_span_attained()) break


        } else {

            add_jump_p()

            if(is_max_time_span_attained()) break

            repeat {
                current_time_p <- last(time_p) + rexp(1, lambda_p)
                if(current_time_p > current_time_n) break

                add_jump_p()

                if(is_max_time_span_attained()) break
            }

            if(is_max_time_span_attained()) break

        }
    }

    # dropping last step to be before t
    indices <- path[, 1] <= t
    path <- path[indices, , drop = FALSE]
    path <- rbind(path,
                  c(t, path[nrow(path), 2]))

    rval <- list(
        path = path,
        time_p = time_p,
        time_n = time_n
    )

    return(rval)
}

Let’s compare estimated expected value and variance of interarrival jumps with theoretical ones, for both positive and negative jumps. With default parameters $\lambda_1 = 1$ and $\lambda_2 = 1$, they all should be around one. This is confirmed in the code snipped below:

set.seed(2018)

p1 <- sim_p1(t = 1000)
mean(diff(p1$time_p)); var(diff(p1$time_p))
# [1] 0.9713893
# [1] 0.9399382
mean(diff(p1$time_n)); var(diff(p1$time_n))
# [1] 1.067944
# [1] 1.121641

Method 2

The second method is a bit less cumbersome. We simulate separately the number of positive and negative jumps in the interval $(0, t)$ by Poisson distribution with respective rates. Then, we generate arrival times of jumps by the uniform distribution, which is then sorted. Finally, the full path should be built, that is adding positive and negative jumps in a loop. The idea is as before: we compare what kind of jump (negative vs positive) occure earlier, and add the earliest one. Note, that it is possible that several negative jumps occure before a positive one, and vice versa. Also it is possible that only positive or only negative jumps are left, therefore, we need to incorporate this in the loop.

The implementation is presetned below:

sim_p2 <- function(lambda_p = 1, lambda_n = 1, t = 100) {

    # simulate numbers of positive and negative jumps
    number_p_jumps <- rpois(n = 1, lambda = lambda_p * t)
    number_n_jumps <- rpois(n = 1, lambda = lambda_n * t)

    # simulate the time of jumps' arrivals
    p_jumps_arrival <- runif(n = number_p_jumps) %>% sort() %>% multiply_by(t)
    n_jumps_arrival <- runif(n = number_n_jumps) %>% sort() %>% multiply_by(t)

    # keep the time of jumps' arrivals in separate variables
    time_p <- p_jumps_arrival
    time_n <- n_jumps_arrival

    # initialize process
    path <- matrix(NA, nrow = 1, ncol = 2)
    colnames(path) <- c("time", "X")
    path[1, ] <- c(0, 0)


    while(length(p_jumps_arrival) != 0 | length(n_jumps_arrival) != 0) {

        if(length(p_jumps_arrival) != 0 & length(n_jumps_arrival) != 0) {
            if(p_jumps_arrival[1] < n_jumps_arrival[1]) {

                # add positive jump

                path <- rbind(
                    path,
                    c(p_jumps_arrival[1], path[nrow(path), 2])
                )

                path <- rbind(
                    path,
                    c(path[nrow(path), 1], path[nrow(path), 2] + 1)
                )

                p_jumps_arrival <- p_jumps_arrival[-1]

            } else {

                # add negative jump

                path <- rbind(
                    path,
                    c(n_jumps_arrival[1], path[nrow(path), 2])
                )

                path <- rbind(
                    path,
                    c(path[nrow(path), 1], path[nrow(path), 2] - 1)
                )

                n_jumps_arrival <- n_jumps_arrival[-1]

            }
        } else {
            if(length(p_jumps_arrival) != 0) {

                # add positive jump

                path <- rbind(
                    path,
                    c(p_jumps_arrival[1], path[nrow(path), 2])
                )

                path <- rbind(
                    path,
                    c(path[nrow(path), 1], path[nrow(path), 2] + 1)
                )

                p_jumps_arrival <- p_jumps_arrival[-1]

            }
            if(length(n_jumps_arrival) != 0) {

                # add negative jump

                path <- rbind(
                    path,
                    c(n_jumps_arrival[1], path[nrow(path), 2])
                )

                path <- rbind(
                    path,
                    c(path[nrow(path), 1], path[nrow(path), 2] - 1)
                )

                n_jumps_arrival <- n_jumps_arrival[-1]
            }
        }
    }

    # add last step
    path <- rbind(path,
                  c(t, path[nrow(path), 2]))

    rval <- list(
        path = path,
        time_p = time_p,
        time_n = time_n
    )

    return(rval)
}

Again, we need to be assured that estimated expected values and variances for positive and negative jumps are close to one:

p2 <- sim_p2(t = 1000)
mean(diff(p2$time_p)); var(diff(p2$time_p))
# [1] 1.006591
# [1] 1.054261
mean(diff(p2$time_n)); var(diff(p2$time_n))
# [1] 0.9923275
# [1] 0.8606385

Convergence

Finally, we want visually check if the estimatros are non-biased, and how fast they converge (i.e. which method has a smaller variance). We focus on the expected value of a path, as well as on the probability of the path to be below a certain value. For this we simulate a vast number (1000) of paths using each of methods. Then, we estimae the expected value as a function of the number of simulations by aggregating first x paths.

n <- 1000

paths1 <- replicate(n = n, expr = sim_p1(), simplify = FALSE)
paths2 <- replicate(n = n, expr = sim_p2(), simplify = FALSE)

First, we look at the expected value, which sould be zero given both lambdas equal one:

means1 <- sapply(
    1:n,
    function(x) {
        mean(sapply(paths1[1:x], function(y) y$path[nrow(y$path), 2]))
    }
)

means2 <- sapply(
    1:n,
    function(x) {
        mean(sapply(paths2[1:x], function(y) y$path[nrow(y$path), 2]))
    }
)

means <- rbind(
    data.frame(n = 1:n, mean = means1, method = "1"),
    data.frame(n = 1:n, mean = means2, method = "2")
)

ggplot(means) +
    geom_line(aes(n, mean, color = method)) +
    geom_hline(yintercept = 0) +
    theme_bw() +
    theme(text = element_text(size = 24))

From the plot it seems that both methods converge with the same speed to the correct value. Huray! How about probabilities?

For probabilites we used a threashold of 10 (arbitrary choosen), and a true value calculated by pskellam package. For default argument t = 10, the distribution of $X(10) \sim Skellam(\lambda_1 \cdot t, \lambda_2 \cdot t)$.

probs1 <- sapply(
    1:n,
    function(x) {
        mean(sapply(paths1[1:x], function(y) y$path[nrow(y$path), 2] <= 10))
    }
)

probs2 <- sapply(
    1:n,
    function(x) {
        mean(sapply(paths2[1:x], function(y) y$path[nrow(y$path), 2] <= 10))
    }
)

probs <- rbind(
    data.frame(n = 1:n, prob = probs1, method = "1"),
    data.frame(n = 1:n, prob = probs2, method = "2")
)

ggplot(probs) +
    geom_line(aes(n, prob, color = method)) +
    geom_hline(yintercept = pskellam(q = 10, lambda1 = 100, lambda2 = 100)) +
    theme_bw() +
    theme(text = element_text(size = 24))

Again, the probabilites converge to the correct value with approximately the same speed. It means that there are no bias in neither methods, and we can continue extending the function of simulating ruin processes.

📈 [archived] Simulating Poisson process (part 1)

2018-03-09T00:00:00+00:00

A couple of weeks ago a colleague of mine asked me for a help to estimate Gerber-Shiu function by Monte-Carlo methods. The function is used in ruin theory for risk processes. One can think about this function as of equialence to a moment generating function. That is if the function is known, it is easy to derive a certain measurments of interest, for instance, a ruin probability. My colleague wants to estimate this function for an extenssion of Cramér–Lundberg model that includes positive jumps (capital injections). From the first glance it seems as a trivial task, but when I started approaching it, this problem turned out to be not so easy to solve.

To estimate Gerber-Shiu function a large number of paths should be simulated. That’s why I firstly started with a function that simulates a path of the process. Basically, the random part of the model consists of two independent Poisson processes. There are three ways to simulate a Poisson process. The first method assumes simulating interarrival jumps’ times by Exponential distribution. The second method is to simulate the number of jumps in the given time period by Poisson distribution, and then the time of jumps by Uniform random variables. The third method requires a certain grid. Typically, only the former two methods are used.

During my first attamped I used the first method (i.e. simulating interarrival time by Exponential r.v.s). In order to check myself, I estimated ruin probabilities and compared with numerically derived in literature. For some reason, the estimated values of such simulated processes were not in line with numerical ones. I tried the second method, which yielded values closer to true ones. On the other hand, numerical values might be also bised due to the precision error. To find which values are correct I simplified the process to have only deterministic unit jumps, but still measurments were bised (this will be discussed in details in the next post). Further simplification led to a simple Poisson process, which is a focus of this post.

The mentioned above two methods of Poisson process simulation are widely covered in all simulation books. However, I have not found any information which method is better or at least any information about the speed of convergence. So I implemented my versions of algorithms (both algorithms can be found in references below). Note that my implementation is probably far away from the efficient one, but my goal is rather compare visually how fast these algorithms converge.

Method 1

This algorithm exploits the fact that interarrival times are exponentially distributed. We simulate the arrival times until the maximum time horizon is achieved.

sim_pp1 <- function(t, rate) {

    path <- matrix(0, nrow = 1, ncol = 2)

    jumps_time <- rexp(1, rate)

    while(jumps_time[length(jumps_time)] < t) {

        jump <- matrix(c(jumps_time[length(jumps_time)], path[nrow(path), 2],
                         jumps_time[length(jumps_time)], path[nrow(path), 2]  + 1),
                       nrow = 2, ncol = 2, byrow = TRUE)

        path <- rbind(path, jump)

        jumps_time <- c(jumps_time,
                        jumps_time[length(jumps_time)] + rexp(1, rate))
    }

    path <- rbind(path,
                  c(t, path[nrow(path), 2]))

    list(path, jumps_time)
}

Method 2

This method simulates the number of jumps by Possion random variable with the rate equals to the product of the time horizon and the process’s rate. Then, to calculate arrival times, random variables with uniform distribution are generated and ordered after (again, these algorithms are well-known and described in details in references).

sim_pp2 <- function(t, rate) {

    path <- matrix(0, nrow = 1, ncol = 2)

    jumps_number <- rpois(1, lambda = rate * t)
    jumps_time <- runif(n = jumps_number, min = 0, max = t) %>% sort()

    for(j in seq_along(jumps_time)) {
        jump <- matrix(c(jumps_time[j], path[nrow(path), 2],
                         jumps_time[j], path[nrow(path), 2]  + 1),
                       nrow = 2, ncol = 2, byrow = TRUE)
        path <- rbind(path, jump)
    }

    path <- rbind(path,
                  c(t, path[nrow(path), 2]))

    list(path, jumps_time)

}

Validation

Now, let’s check a couple of thigs, such as mean and vairance of interarrival times and their histogram for both methods.

For the first method:

library("ggplot2")
library("magrittr")

set.seed(1)

path1 <- sim_pp1(1000, 1)
mean(diff(path1[[2]])); var(diff(path1[[2]]))
# [1] 1.029312
# [1] 0.9722406

data.frame(it = diff(path1[[2]])) %>%
    ggplot() +
    geom_histogram(aes(it, y = ..density..)) +
    stat_function(fun = dexp) +
    theme_bw() +
    theme(text = element_text(size = 24))

And for the second:

path2 <- sim_pp2(1000, 1)
mean(diff(path2[[2]])); var(diff(path2[[2]]))
# [1] 1.006302
# [1] 1.066079

data.frame(it = diff(path2[[2]])) %>%
    ggplot() +
    geom_histogram(aes(it, y = ..density..)) +
    stat_function(fun = dexp) +
    theme_bw() +
    theme(text = element_text(size = 24))

It seems that all values are in line with theory, that is the expected value and variance of interarrival times both equals to one (given the unit rate of Poisson process), as well as the shape of histograms.

Convergence

Mathmatically, both methods should have more or less the same speed of convergence. To check this we simulate 2000 paths with both methods and then estimate the expected value of the process at time ten as a function of the number of simulations.

t <- 10
n <- 2000
rate = 1

paths1 <- replicate(n = n, expr = sim_pp1(t, rate), simplify = FALSE)
means1 <- sapply(1:n,
                 function(x) {
                     pathes <- paths1[1:x]
                     mean(sapply(pathes, function(y) y[[1]][nrow(y[[1]]), 2]))
                 })

paths2 <- replicate(n = n, expr = sim_pp2(t, rate), simplify = FALSE)
means2 <- sapply(1:n,
                 function(x) {
                     pathes <- paths2[1:x]
                     mean(sapply(pathes, function(y) y[[1]][nrow(y[[1]]), 2]))
                 })

rbind(data.frame(n = 1:n, mean = means1, method = "1"),
    data.frame(n = 1:n, mean = means2, method = "2")) %>%
    ggplot() +
    geom_line(aes(x = n, y = mean, color = method)) +
    geom_hline(yintercept = rate * t) +
    theme_bw() +
    theme(text = element_text(size = 24))

Indeed, visually the estimation of expected value convergence approximately with the same speed. However, I had problems with probabilities, and below I performed the same procedure but for the probability of a path to be below ten.

paths1 <- replicate(n = 2000, expr = sim_pp1(10, 1), simplify = FALSE)
probs1 <- sapply(1:2000,
                 function(x) {
                     pathes <- paths1[1:x]
                     mean(sapply(pathes, function(y) y[[1]][nrow(y[[1]]), 2]) <= 10)
                 })

paths2 <- replicate(n = 2000, expr = sim_pp2(10, 1), simplify = FALSE)
probs2 <- sapply(1:2000,
                 function(x) {
                     pathes <- paths2[1:x]
                     mean(sapply(pathes, function(y) y[[1]][nrow(y[[1]]), 2]) <= 10)
                 })

rbind(data.frame(n = 1:n, prob = probs1, method = "1"),
      data.frame(n = 1:n, prob = probs2, method = "2")) %>%
    ggplot() +
    geom_line(aes(x = n, y = prob, color = method)) +
    geom_hline(yintercept = ppois(q = 10, lambda = t * rate)) +
    theme_bw() +
    theme(text = element_text(size = 24))

Again, methods seem to have the same performance. This is a good sign, because now I can compare methods for slightly more complicated models not being affraid that differences might be due to Poisson process simulation algorithms.

📊 [archived] Multinomial regression in R

2018-01-24T00:00:00+00:00

In my current project on Long-term care at some point we were required to use a regression model with multinomial responses. I was very surprised that in contrast to well-covered binomial GLM for binary response case, multinomial case is poorly described. Surely, there are half-dozen packages overlapping each other, however, there is no sound tutorial or vignette. Hopefully, my post will improve the current state.

We can distinguish two types of multinominal responses, namely nominal and ordinal. For nominal response a variable can possess a value from predefined finite set and these values are not ordered. For instance a variable color can be either green or blue or green. In machine learning the problem is often referred to as a classification. In contrast to nominal case, for ordinal repose variable the set of values has the relative ordering. For example, a variable size can be small < middle < large. Furthermore, depending on a link function we can have logit or probit models.

Nominal response models

According to Agresti (2002) we can the problem can be formulated by two similar approaches: through baseline-category logits or multivariate GLM. In general, these two approaches are equivalent with identical maximum-likelihood estimates, the only thing which is different is the formula representation.

Baseline-category logits (multinomial logit model)

The baseline-category logits is implemented as a function in three distinct packages, namely nnet::multinom() (referred as to log-linear model), mlogit::mlogit, mnlogit::mnlogit (claims to be more efficient implementation than mlogit, see comparison of perfomances of these packages).

Let $p_j = \mathbb{P}(Y = j \mid \boldsymbol{x})$ is a probability of dependent variable $Y$ to have value $j$ given a vector of explanatory variables’ values $\boldsymbol{x}$. In total, there are $J$ categories, and obviously, due to second axiom of probability $\sum_j p_j = 1$. We fix a baseline category at level $J$ (or at any other level), and the model is as follows:

\[\log \frac{p_j}{p_J} = \alpha_j + \boldsymbol{\beta}'_j \boldsymbol{x}, \quad j = 1, ..., J - 1,\]

describing the effects of explanatory $\boldsymbol{x}$ on logits of odds between a level $j$ and baseline level. Of course, using these $J-1$ equations and the second axiom it’s possible to come back to probabilities (which is a nice exercise, by the way):

\[p_j = \frac{\exp(\alpha_j + \boldsymbol{\beta}'_j \boldsymbol{x})}{1 + \sum_{h = 1}^{J-1}\exp(\alpha_h + \boldsymbol{\beta}'_h \boldsymbol{x})}\]

For each group $j$ the set of parameters $\alpha_j$ and $\boldsymbol{\beta}_j$ are distinct. Let’s now estimate those $\alpha_j, \quad \boldsymbol{\beta}_j, \quad j = 1, …, J - 1$ by different packages and make sure that estimates are identical. I use marital.nz data from VGAM package.

# install.packages("VGAM")
library(VGAM)
data(marital.nz)
#   age ethnicity            mstatus
# 1  29  European             Single
# 2  55  European  Married/Partnered
# 3  44  European  Married/Partnered
# 4  53  European Divorced/Separated
# 5  45  European  Married/Partnered
# 7  30  European             Single
unique(marital.nz$mstatus)
# [1] Single             Married/Partnered  Divorced/Separated Widowed           
# Levels: Divorced/Separated Married/Partnered Single Widowed

The data contains “marital data mainly from a large NZ company collected in the early 1990s”. Dependent variable mstatus has four unordered classes Divorced/Separated, Married/Partnered, Single, and Widowed. We use age as the only exploratory variable.

Package nnet

library(nnet)
fit_nnet <- multinom(mstatus ~ age, marital.nz)
coef(fit_nnet)
#                   (Intercept)          age
# Married/Partnered    2.778686 -0.003538729
# Single               6.368064 -0.152745520
# Widowed             -6.753123  0.099333903

Package mlogit

library(mlogit)
fit_mlogit <- mlogit(mstatus ~ 0 | age, data = marital.nz, shape = "wide")
matrix(fit_mlogit$coefficients, ncol = 2)
#           [,1]         [,2]
# [1,]  2.778666 -0.003538297
# [2,]  6.368056 -0.152745424
# [3,] -6.753157  0.099334560

Package mnlogit

library(mnlogit)
marital.nz_long <- mlogit.data(data = marital.nz, choice = "mstatus")
fit_mnlogit <- mnlogit(mstatus ~ 1 | age | 1, marital.nz_long)
matrix(fit_mnlogit$coefficients, ncol = 2, byrow = TRUE)
#           [,1]         [,2]
# [1,]  2.778666 -0.003538297
# [2,]  6.368056 -0.152745424
# [3,] -6.753157  0.099334560

Even though the latter package is very efficient and customizable, there are several points I am not a big fan of. First off, mnlogit works only with long data instead of common and familiar for regression wide. That’s why we had to use mlogit.data to convert the data. Second, the formula’s syntax is too confusing despite its customizability. Of course, the list is not exhaustive, other packages exists, e.g. brglm2.

Multinomial logit model as multivariate GLM

For this model instead of treating the response variable as a scalar we set to be a vector of $J-1$ elements ($J$-th is redundant). Then, $\boldsymbol{y_i} = (y_{i,1}, …, y_{i, J-1})’$ and $\boldsymbol{\mu_i} = (p_{i,1}, …, p_{i, J-1})’$. Therefore,

\[g_j(\boldsymbol{\mu}_i) = \log \frac{\mu_{i,j}}{1 - (\mu_{i,1}+...+\mu_{i, J-1})}\]

and

\[\boldsymbol{g}(\boldsymbol{\mu}_i) = \boldsymbol{X}_i \boldsymbol{\beta}\]

where $\boldsymbol{g}$ is a vector of link functions.

The package vgam deals exactly with cases of multivariate GLM and GAM. Let’s compute estimates for this model, which should coincide with previously calculated ones:

library(VGAM)
fit_vgam <- vglm(mstatus ~ age, multinomial(refLevel = 1),
                 data = marital.nz)
matrix(fit_vgam@coefficients, ncol = 2)
#           [,1]         [,2]
# [1,]  2.778666 -0.003538297
# [2,]  6.368056 -0.152745424
# [3,] -6.753157  0.099334560

Ordinal response model: proportional odds model

For ordinal response variable the model is slightly different. Let $Y$ be a categorical response variable with $J$ categories which are ordered $1<…<J$. Therefore, it is possible to define cumulative probabilities as

\[\mathbb{P}(Y \leq j \mid \boldsymbol{x}) = p_1 + ... + p_j, \quad j = 1, ..., J\]

Then, cumulative logits are:

\[\text{logit}(\mathbb{P}(Y \leq j \mid \boldsymbol{x})) = \log\frac{\mathbb{P}(Y \leq j \mid \boldsymbol{x})}{1 - \mathbb{P}(Y \leq j \mid \boldsymbol{x})} = \log\frac{p_1 + ... + p_j}{p_{j+1} + ...+ p_J}, \quad j = 1, ..., J - 1\]

Let’s now define the cumulative logits and exploratory variables $\boldsymbol{x}$:

\[\text{logit}(\mathbb{P}(Y \leq j \mid \boldsymbol{x})) = \alpha_j + \boldsymbol{\beta}' \boldsymbol{x}, \quad j = 1, ..., J-1\]

Note that $\boldsymbol{\beta}$ are the same for each logit. However, intercepts can be different and necessarily are non-decreasing.

The model got its name from its property:

\[\text{logit}(\mathbb{P}(Y \leq j \mid \boldsymbol{x}_1)) - \text{logit}(\mathbb{P}(Y \leq j \mid \boldsymbol{x}_2)) = \log\frac{\mathbb{P}(Y \leq j \mid \boldsymbol{x}_1) / \mathbb{P}(Y \geq j \mid \boldsymbol{x}_1)}{\mathbb{P}(Y \leq j \mid \boldsymbol{x}_2) / \mathbb{P}(Y \geq j \mid \boldsymbol{x}_2)} = \boldsymbol{\beta}' (\boldsymbol{x}_1 - \boldsymbol{x}_2)\]

Again, there are at least four packages, which calibrate the proportional odds model. Let’s quickly compare those estimates using Italian household data for 2006 dataset ecb06it from VGAMdata package. We try to explain ordinal variable education of 8 levels by numeric age.

# install.packages("VGAMdata")
library(VGAMdata)
data(ecb06it)
# str(ecb06.it)
head(ecb06.it[, c("age", "education")])
#    age     education
# 1   58    highschool
# 4   81 primaryschool
# 5   52    highschool
# 9   67  middleschool
# 12  56  middleschool
# 16  72 primaryschool

Package MASS

Perhaps the most famous function is MASS::polr.

library(MASS)
fit_polr <- polr(formula = education ~ age, data = ecb06.it)
summary(fit_polr)$coefficients[, 1, drop = FALSE]
#                                  Value
# age                        -0.06417893
# none|primaryschool         -6.95688936
# primaryschool|middleschool -4.51869196
# middleschool|profschool    -3.06471919
# profschool|highschool      -2.73295822
# highschool|bachelors       -0.96907401
# bachelors|masters          -0.89517059
# masters|higherdegree        2.42815131

Package VGAM

fit_vglm <- vglm(formula = education ~ age, family = propodds, data = ecb06.it)
as.matrix(fit_vglm@coefficients)
#                      [,1]
# (Intercept):1  6.95576156
# (Intercept):2  4.51825182
# (Intercept):3  3.06430069
# (Intercept):4  2.73254206
# (Intercept):5  0.96867493
# (Intercept):6  0.89470432
# (Intercept):7 -2.42867591
# age           -0.06417086

Package ordinal

library(ordinal)
fit_clm <- clm(formula = education ~ age, data = ecb06.it)
as.matrix(fit_clm$coefficients)
#                                  [,1]
# none|primaryschool         -6.9557784
# primaryschool|middleschool -4.5182645
# middleschool|profschool    -3.0643131
# profschool|highschool      -2.7325541
# highschool|bachelors       -0.9686858
# bachelors|masters          -0.8947152
# masters|higherdegree        2.4286635
# age                        -0.0641711

Nice thing about this package is that it allows for using different link functions, i.e. "logit", "probit", "cloglog", "loglog", and "cauchit". To my regret I know only "logit" and "probit" from this list.

Package rms

library(rms)
fit_lrm <- lrm(formula = education ~ age, data = ecb06.it)
as.matrix(fit_lrm$coefficients)
#                        [,1]
# y>=primaryschool  6.9557784
# y>=middleschool   4.5182645
# y>=profschool     3.0643131
# y>=highschool     2.7325541
# y>=bachelors      0.9686858
# y>=masters        0.8947152
# y>=higherdegree  -2.4286635
# age              -0.0641711

This function was rather unstable. Adding more exploratory variable have thrown an error a couple of times.

Coefficients are consistent (difference in signs are explained by $\mathbb{P}(Y \leq j)$ and $\mathbb{P}(Y \geq j)$), which is good.

Perhaps, now you have a question which package to use? Well, I do not know, just choose one and stick to it. I will use probably VGAM, as long as it covers various models and seems like nicely documented.

References:

Agresti, A. (2002) Categorical Data, Second edition, Wiley
STAT504

🥕 [archived] Dortmund real estate market analysis: data preprocessing with caret

2017-10-25T00:00:00+00:00

This is rather a short note, which is more related to an amazing package caret, than to our data set. The package allows for manipulating the model with less typing, for instance cross-validation or data preprocessing can be done by just specifying a couple of arguments in the key function of package train.

Perhaps, the median imputation and $k$-nearest neighbors algorithm I will live for the better times, since Dortmund real estate data set contains no missed values. Furthermore, caret makes it possible to apply various transformation of data, e.g. centering, scaling, principle component analysis (PCA), independent component analysis (ICA) etc. As one can remember, the model with the smallest out-of-sample RMSE is GAM with inverse Gaussian responses and log-link function. GAM assumes applying smooth functions to regressors, and thus, centering and scaling won’t improve our metric. On the other hand, I am not a big fan of centering and scaling, since both process make the sample dependent. In other words, subtracting the sample mean from each observation makes this observation dependent on the whole sample.

On the other hand, PCA might be very useful, since rooms and area are correlated. Can this improve the model? Let’s experiment and see. As usual, we start from loading packages and data.

packages <- c("mgcv", "magrittr", "vtreat", "caret")
sapply(packages, library, character.only = TRUE, logical.return = TRUE)
# options(scipen=999)
rm(list = ls())

setwd("/Users/irudnyts/Documents/data/")
property <- read.csv("dortmund.csv")

Below our GAM with IG outcome and log-link function is rewritten in caret syntax yielding the same RMSE:

model <- train(price ~ rooms + area, data = property,
               method = "gam", family = inverse.gaussian(link = "log"))
pred <- predict(model, property[, c("area", "rooms")])
(pred - property[, "price"]) ^ 2 %>% mean() %>% sqrt()
# [1] 134.9973

I don’t utilize the power of the function trainControl(), which can be used for cross-validation, in order to obtain consistency with previous posts. The model with PCA transformation is shown below:

model_pca <- train(price ~ rooms + area, data = property,
                   method = "gam", family = inverse.gaussian(link = "log"),
                   preProcess = "pca")
pred <- predict(model_pca, property[, c("area", "rooms")])
(pred - property[, "price"]) ^ 2 %>% mean() %>% sqrt()
# [1] 146.4144

The in-sample RMSE is higher than our best model, which is not very encouraging. Even though it makes a little sense to go further, we calculate our-of-sample RMSE.

set.seed(3)
folds <- kWayCrossValidation(nRows = nrow(property), nSplits = 3)

pred <- rep(NA, nrow(property))
for(fold in folds) {
    model_pca <- train(price ~ rooms + area, data = property[fold$train, ],
                       method = "gam", family = inverse.gaussian(link = "log"),
                       preProcess = "pca")

    pred[fold$app] <- predict(model_pca, property[fold$app, ])
}

sqrt(mean((pred - property$price) ^ 2))
# [1] 151.7058

I can summarize in one short sentence: PCA is not helpful for this case.

🔬 [archived] Dortmund real estate market analysis: neural networks

2017-10-21T00:00:00+00:00

At every turn in a non-technical post about AI for broader audience an author deems their duty to mention a deep learning as panacea for all woes. Well, it’s not. Deep learning is just one of various models, which might or might not perform better then the other techniques. At the end of the day, in a nutshell, it’s just regular neural networks with multiple hidden layers between the input and output layers (well, it’s rather a oversimplification, but you got it right). In this post I am curious whether it’s possible for neural networks approach to beat our best model so far (GAM with response’s inverse Gaussian distribution).

The nice thing about neural networks is it allows for interactions between variables. Remember we included several interaction terms to our simple linear, GLM and GAM models? The construction of neural networks’ model assumes much more complicated interactions, we do not have to worry about that. The more hidden layers we use the more complex these interactions can be.

When at first I tried to use TensorFlow and keras I admit my guilt to R users, I did it in Python. Let me just quickly go over the code chunks and I will come back to R (the code of which is pretty similar).

First, libraries and data should be loaded. As API to TensorFlow the package keras is used, and further, I load pandas to enable DataFrame, as well as numpy’s arrays. Also the regressors (area, rooms) and outcomes (price) are stored in separate variables.

import keras
from keras.layers import Dense
from keras.models import Sequential

import pandas as pd
import numpy as np

property = pd.read_csv('/Users/irudnyts/Documents/data/dortmund.csv')
x = property[["area", "rooms"]].values
y = property[["price"]].values

We need to initialize Sequential model (layers are connected sequentially). We use a neural network with two hidden layers, each of 50 neurons, and a rectified linear unit (ReLU) activation function. Input layer has 2 neurons (area and rooms), and for output layer we have only one output neuron (price). We use standard adam (Adaptive Monument Estimation) optimizer and standard mean squared error for loss (objective) function. Finally we slightly increase number of epochs to 15 for a fit.

model = Sequential()

model.add(Dense(50, activation = 'relu', input_shape = (2,)))
model.add(Dense(50, activation = 'relu'))
model.add(Dense(1))

model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.fit(x = x, y = y, validation_split = 0.3, epochs = 15)

After the model is specified, compiled, and fitted we can predict and calculate in-sample RMSE.

predicted = model.predict(x = x)

np.sqrt(np.mean((predicted - y) ** 2))
# 166.2982065207604

For in-sample RMSE the result is not bad. However, it was only a first quick and dirty try. At this moment I decided to switch back to R, and here the equivalent code:

packages <- c("ggplot2", "magrittr", "keras", "vtreat")
sapply(packages, library, character.only = TRUE, logical.return = TRUE)

property <- read.csv("/Users/irudnyts/Documents/data/dortmund.csv")
x <- property[, 2:3] %>% as.matrix()
y <- property[, 1]

model <- keras_model_sequential()

model %>%
    layer_dense(units = 50, activation = "relu", input_shape = 2) %>%
    layer_dense(units = 50, activation = "relu") %>%
    layer_dense(units = 1)

model %>% compile(
    loss = "mean_squared_error",
    optimizer = optimizer_adam()
)

model %>% fit(x = x, y = y, epochs = 15)

pred <- model %>% predict(x = x)

sqrt(mean((pred - y) ^ 2))
# 166.152

Looks pretty similar to Python code, right? OK, let’s play around with tuning parameters and empirically find the optimal ones. For this purpose, we define a function that calculates RMSE for neural network with a given number of layers and neurons (assuming each layer has the same number of neurons). Note, while fitting a model we use custom stopping time, i.e. if the mean squared error is not improved more than min_delta = 0.01, then train is stop at current epoch.

get_rmse <- function(n_layers, n_neurons) {
    stopifnot(n_layers < 2)
    model <- keras_model_sequential()
    model %>%
        layer_dense(units = n_neurons, activation = "relu", input_shape = 2)
    for(i in 2:n_layers) {
        model %>%
            layer_dense(units = n_neurons, activation = "relu")
    }
    model %>% layer_dense(units = 1)

    model %>% compile(
        loss = "mean_squared_error",
        optimizer = optimizer_adam()
    )

    model %>% fit(x = x, y = y, epochs = 15,
                  callbacks = callback_early_stopping(min_delta = 0.01,
                                                      monitor = 'loss'))

    pred <- model %>% predict(x = x)
    sqrt(mean((pred - y) ^ 2))
}

Then, we train models with 50 neurons and several levels of layers, namely from 2 to 5:

layers_summary <- data.frame(n_layers = 2:5,
                             rmse = sapply(2:5, get_rmse, n_neurons = 50))
# n_layers     rmse
#        2 166.7639
#        3 165.5580
#        4 165.2328
#        5 164.3651

It seems that 2 layers is more than enough. Let’s now define the number of neurons for each layer:

neurons <- c(seq(from = 10, to = 20, by = 2), seq(from = 30, to = 100, by = 10))
neurons_summary <- data.frame(n_neurons = neurons,
                              rmse = sapply(neurons, get_rmse, n_layers = 2))
# n_neurons     rmse
#        10 296.4352
#        12 165.5779
#        14 167.7926
#        16 210.7701
#        18 166.0520
#        20 166.3228
#        30 165.0766
#        40 165.8220
#        50 165.6295
#        60 165.6339
#        70 165.5375
#        80 165.9305
#        90 164.7960
#       100 164.7582

The model with around 20 neurons looks stable. Thus, for our final model we use 2 layers with 20 neurons to calculate our-of-sample RMSE. The model will use slightly larger number of potential epochs, since we decrease min_delta to 0.0005 to let the model train a bit more.

set.seed(3)
folds <- kWayCrossValidation(nRows = nrow(property), nSplits = 3)

pred <- rep(NA, nrow(property))
for(fold in folds) {
    model <- keras_model_sequential()
    model %>%
        layer_dense(units = 20, activation = "relu", input_shape = 2) %>%
        layer_dense(units = 20, activation = "relu") %>%
        layer_dense(units = 1)
    model %>% compile(
        loss = "mean_squared_error",
        optimizer = optimizer_adam()
    )
    model %>% fit(x = x[fold$train, ], y = y[fold$train],
                  epochs = 30,
                  callbacks = callback_early_stopping(monitor = "loss",
                                                      min_delta = 0.0005))

    pred[fold$app] <- model %>% predict(x = x[fold$app, ])
    rm(model)
}

sqrt(mean((pred - y) ^ 2))
# 165.5461

Fortunately or unfortunately, the model has not outperform our previous models, and the leader is still GAM with IG outcome. On the other hand, we have used the simplest (and when I am saying simplest I do mean simplest) neural networks. With this post I finish the cycle of Dortmund real estate data analysis.