<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="http://irudnyts.github.io//feed.xml" rel="self" type="application/atom+xml" /><link href="http://irudnyts.github.io//" rel="alternate" type="text/html" /><updated>2026-05-26T13:02:42+00:00</updated><id>http://irudnyts.github.io//feed.xml</id><title type="html">Iegor Rudnytskyi, PhD</title><subtitle>Outdated, please refer to LinkedIn page</subtitle><entry><title type="html">📦 [archived] Managing dependencies in packages</title><link href="http://irudnyts.github.io//managing-dependencies-in-packages/" rel="alternate" type="text/html" title="📦 [archived] Managing dependencies in packages" /><published>2019-12-03T00:00:00+00:00</published><updated>2019-12-03T00:00:00+00:00</updated><id>http://irudnyts.github.io//managing-dependencies-in-packages</id><content type="html" xml:base="http://irudnyts.github.io//managing-dependencies-in-packages/"><![CDATA[<p>Managing usual dependencies of a package is clearly covered in <a href="http://r-pkgs.had.co.nz">R packages by Hadley Wickham</a>. Typically, that would be the end of a tutorial or a post. However, teaching recently how to develop a package, I encountered a couple of super interesting and non-trivial questions that would not have a conventional solution. I guess this post would be a perfect place to share my thoughts on that meter, as well as a nice excuse to restart blogging.</p>

<blockquote>
  <p><strong>Disclaimer:</strong> This post is outdated and was archived for back compatibility: please use with care! This post does not reflect the author’s current point of view and might deviate from the current best practices.</p>
</blockquote>

<p><img src="https://irudnyts.github.io/images/posts/2019-12-03-managing-dependencies-in-packages/lego.png" alt="" /></p>

<h2 id="non-cran-packages">Non-CRAN packages</h2>

<p>When developing the package, the standard place to list dependencies (i.e., external packages that your package needs) is <code class="language-plaintext highlighter-rouge">Imports:</code> in <code class="language-plaintext highlighter-rouge">DESCRIPTION</code>. Full stop here. These packages are required to be installed so that your package works. And they will be installed automatically when installing your package via <code class="language-plaintext highlighter-rouge">install.packages()</code> (see default behavior of <code class="language-plaintext highlighter-rouge">dependencies</code> argument). However, packages in <code class="language-plaintext highlighter-rouge">Imports:</code> field are supposed to be published on CRAN. That could be an issue if your package uses functionality from packages that are not (yet) published on CRAN. This is the exact question I was asked by one of my students: where do I specify non-CRAN dependencies?</p>

<p>I was sure that there exists a common workflow to do it. After a minute of extensive research, I found out that CRAN policy explains it quite vaguely. Further, there were three Stackoverflow questions about it (see below in References). The answer that I found was quite satisfactory: Dirk Eddelbuettel proposes to list the package in <code class="language-plaintext highlighter-rouge">Sugests:</code> and specify the additional repository in special free-form filed <code class="language-plaintext highlighter-rouge">Additional_repositories:</code>. He also suggests using <code class="language-plaintext highlighter-rouge">drat</code> package to create CRAN-like R packages repository, which from my view is a bit overkill. So my solution would be to list the name of the package in <code class="language-plaintext highlighter-rouge">Suggests:</code> and mention the link to its GitHub repo (almost surely the source is stored on GitHub) in <code class="language-plaintext highlighter-rouge">Additional_repositories:</code>.</p>

<p><strong>Update:</strong> As it was kindly pointed out by Sébastien Rochette, <code class="language-plaintext highlighter-rouge">devtools</code> supports a <code class="language-plaintext highlighter-rouge">Remotes:</code> field exactly for that purpose. Simply specify the repos in the format <code class="language-plaintext highlighter-rouge">username/reponame</code> separated by commas (one can also add the type of the source if it is not GitHub, e.g., <code class="language-plaintext highlighter-rouge">gitlab::username/reponame</code>). And that is it.</p>

<p>That would be the nice end of the story but how would you let know the end-user that you need this package to be pre-installed? The workaround I found is to rise a message from the function, where this dependence is used and ask the user to install it, for example:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">my_function</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="s2">"nonCRANpkg"</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">rownames</span><span class="p">(</span><span class="n">installed.packages</span><span class="p">())))</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="n">message</span><span class="p">(</span><span class="s2">"Please install package nonCRANpkg."</span><span class="p">)</span><span class="w">
    </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The problem is that the user should come back to the installation process at the point when they use <code class="language-plaintext highlighter-rouge">my_function() </code>. In addition, it probably affects the expected output of the function or even worse if the function is internal one and not exported into the namespace. That is why, from my personal view, the installation of all dependencies should be tackled way before the first call of <code class="language-plaintext highlighter-rouge">my_function()</code>. And here the function <code class="language-plaintext highlighter-rouge">.onAttach()</code> comes in handy. This function allows displaying messages when the package is loading. We simply need to inform the user that they need to install the dependence before using our package (mind the difference between <code class="language-plaintext highlighter-rouge">message()</code> and <code class="language-plaintext highlighter-rouge">packageStartupMessage()</code>):</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">.onAttach</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">libname</span><span class="p">,</span><span class="w"> </span><span class="n">pkgname</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

    </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="s2">"nonCRANpkg"</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">rownames</span><span class="p">(</span><span class="n">installed.packages</span><span class="p">())))</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="n">packageStartupMessage</span><span class="p">(</span><span class="w">
            </span><span class="n">paste0</span><span class="p">(</span><span class="w">
                </span><span class="s2">"Please install `nonCRANpkg` by"</span><span class="p">,</span><span class="w">
                </span><span class="s2">" `devtools::install_github('username/nonCRANpkg')`"</span><span class="w">
            </span><span class="p">)</span><span class="w">
        </span><span class="p">)</span><span class="w">
    </span><span class="p">}</span><span class="w">

</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>To summarize in a nutshell: mention the package name in the field <code class="language-plaintext highlighter-rouge">Suggests:</code> of <code class="language-plaintext highlighter-rouge">DESCRIPTION</code>, link to its repo in <code class="language-plaintext highlighter-rouge">Remotes:</code> (in the same file), and write a simple <code class="language-plaintext highlighter-rouge">.onAttach()</code> function (it should be stored in the file <code class="language-plaintext highlighter-rouge">zzz.R</code>).</p>

<h2 id="shiny-demo-app">Shiny demo app</h2>

<p>It is always a cool idea to compliment the package with a Shiny app so that a user can have an interactive interface to play around with the functionality of the package. We typically store scripts of those demo apps in <code class="language-plaintext highlighter-rouge">inst\shiny-examples\name_of_app</code> and add a function <code class="language-plaintext highlighter-rouge">runDemo()</code> to run them (see a wonderful post by Dean Attali in the references for details). Those apps are very likely to have their own dependencies, as well as they definitely require <code class="language-plaintext highlighter-rouge">shiny</code> namespace to be loaded. That is why we see all these <code class="language-plaintext highlighter-rouge">library()</code> calls at the beginning of Shiny apps’ scripts.</p>

<p>Obviously, (1) we want to ensure that the user has all required packages installed, and (2) avoid using <code class="language-plaintext highlighter-rouge">library()</code> in package’s scripts. The solution is very simple – specify all Shiny app dependencies in <code class="language-plaintext highlighter-rouge">Imports:</code> and use the usual <code class="language-plaintext highlighter-rouge">::</code> to access functions from respective namespaces.</p>

<p>To sum up all the previous take-home points, I created a <code class="language-plaintext highlighter-rouge">dummypkg</code> for illustration, which is stored at GitHub repo <a href="https://github.com/irudnyts/dummypkg"><code class="language-plaintext highlighter-rouge">irudnyts\dummypkg</code></a>. It contains a barebone example of non-CRAN dependencies, as well as a tiny Shiny app with dependencies. Managing those dependencies is super important since we do not want our packages to look like jack-in-the-boxes.</p>

<p>Many thanks go to Ana Lucy Bejarano Montalvo who inspired me by asking those questions and Sébastien Rochette for pointing out <code class="language-plaintext highlighter-rouge">Remotes:</code> filed.</p>

<h2 id="references">References</h2>
<ol>
  <li><a href="http://r-pkgs.had.co.nz">R packages by Hadley Wickham</a></li>
  <li><a href="https://cran.r-project.org/web/packages/devtools/vignettes/dependencies.html">Devtools dependencies</a></li>
  <li><a href="https://stackoverflow.com/questions/33335321/include-non-cran-package-in-cran-package">Include non-CRAN package in CRAN package</a></li>
  <li><a href="https://stackoverflow.com/questions/43773066/r-package-building-how-to-import-a-function-from-a-package-not-on-cran">R package building: How to import a function from a package not on CRAN</a></li>
  <li><a href="https://stackoverflow.com/questions/36105257/how-to-make-r-package-recommend-a-package-hosted-on-github?rq=1">How to make R package recommend a package hosted on GitHub?</a></li>
  <li><a href="https://stackoverflow.com/questions/29419776/r-package-dependencies-not-installed-from-additional-repositories">R package dependencies not installed from Additional_repositories
</a></li>
  <li><a href="https://deanattali.com/2015/04/21/r-package-shiny-app/">Supplementing your R package with a Shiny app</a></li>
  <li><a href="http://pngimg.com/download/51459">Lego PNG image with transparent background</a></li>
  <li><a href="https://github.com/ThinkR-open/golem">A Framework for Building Robust Shiny Apps</a></li>
</ol>]]></content><author><name></name></author><summary type="html"><![CDATA[Managing usual dependencies of a package is clearly covered in R packages by Hadley Wickham. Typically, that would be the end of a tutorial or a post. However, teaching recently how to develop a package, I encountered a couple of super interesting and non-trivial questions that would not have a conventional solution. I guess this post would be a perfect place to share my thoughts on that meter, as well as a nice excuse to restart blogging.]]></summary></entry><entry><title type="html">🖊 [archived] R Coding Style Guide</title><link href="http://irudnyts.github.io//r-coding-style-guide/" rel="alternate" type="text/html" title="🖊 [archived] R Coding Style Guide" /><published>2019-01-14T00:00:00+00:00</published><updated>2019-01-14T00:00:00+00:00</updated><id>http://irudnyts.github.io//r-coding-style-guide</id><content type="html" xml:base="http://irudnyts.github.io//r-coding-style-guide/"><![CDATA[<p>Language is a tool that allows human beings to interact and communicate with each other. The clearer we express ourselves, the better the idea is transferred from our mind to the other. The same applies to programming languages: concise, clear and consistent codes are easier to read and edit. It is especially important, if you have collaborators, which depend on your code. However, even if you don’t, keep in mind that at some point in time, you might come back to your code, for example, to fix an error. And if you did not follow consistently your coding style, reviewing your code can take much longer, than expected. In this context, taking care of your audience means to make your code as readable as possible.</p>

<blockquote>
  <p><strong>Disclaimer:</strong> This post is outdated and was archived for back compatibility: please use with care! This post does not reflect the author’s current point of view and might deviate from the current best practices.</p>
</blockquote>

<blockquote>
  <p>Good coding style is like using correct punctuation. You can manage without
it, but it sure makes things easier to read.
<cite> Hadley Wickham </cite></p>
</blockquote>

<p>There is no such thing as a “correct” coding style, as there is no such thing as the best color. At the end of the day, coding style is a set of developers’ preferences. If you are coding alone, sticking to your coding style and being consistent is more than enough. The story is a bit different if you are working in a team: it is crucial to agree on a convention beforehand and make sure that everyone follows it.</p>

<p><img src="https://irudnyts.github.io/images/posts/2019-01-14-r-coding-style-guide/assignment.jpg" alt="" /></p>

<p>Even though there is no official style guide, R is mature and steady enough to have an “unofficial” convention. In this post, you will learn these “unofficial” rules, their deviations, and most common styles.</p>

<h2 id="naming">Naming</h2>

<h3 id="naming-files">Naming files</h3>

<p>The convention actually depends on whether you develop a file for a package, or as a part of data analysis process. There are, however, <strong>common rules</strong>:</p>

<ul>
  <li>
    <p>File names should use <code class="language-plaintext highlighter-rouge">.R</code> extension.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">read.R</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">read</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>File names should be meaningful.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">model.R</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">Untitled1.R</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>File names should not contain <code class="language-plaintext highlighter-rouge">/</code> and spaces. Instead, a dash (<code class="language-plaintext highlighter-rouge">-</code>) or underscore (<code class="language-plaintext highlighter-rouge">_</code>) should be used.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">fir_regression.R</span><span class="w">
  </span><span class="n">fir</span><span class="o">-</span><span class="n">regression.R</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">fit</span><span class="w"> </span><span class="n">regression.R</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>File names should use letters from <a href="https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block)">Basic Latin</a>, and NOT from <a href="https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)">Latin-1 Supplement</a>.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">tidy.R</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">rangé.R</span><span class="w">
</span></code></pre></div>    </div>
  </li>
</ul>

<p>If the file is <strong>a part of data analysis</strong>, then it makes sense to follow the following recommendations:</p>

<ul>
  <li>
    <p>There should be no files that differ only by the letter case in the same folder and file names should be lowercase. There is nothing bad in having capital case names, just bear in mind case sensitivity and case preservation of your system. Case sensitivity means <code class="language-plaintext highlighter-rouge">test.R</code> and <code class="language-plaintext highlighter-rouge">Test.R</code> can coexist in the same folder. For instace, macOS file system (APFS) is not case sensitive by default.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">analyse.R</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">Analyse.R</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Use meaningful verbs for file names.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">validate</span><span class="o">-</span><span class="n">vbm.R</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">regression.R</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>If files should be run in a particular order, then use ascending names.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="m">01</span><span class="o">-</span><span class="n">read.R</span><span class="w">
  </span><span class="m">02</span><span class="o">-</span><span class="n">clean.R</span><span class="w">
  </span><span class="m">02</span><span class="o">-</span><span class="n">plot.R</span><span class="w">
</span></code></pre></div>    </div>
  </li>
</ul>

<p>If the file is used <strong>in a package</strong>, then slightly different rules should be folowed:</p>

<ul>
  <li>
    <p>Mind special names:</p>

    <ul>
      <li><code class="language-plaintext highlighter-rouge">AllClasses.R</code> (or <code class="language-plaintext highlighter-rouge">AllClass.R</code>), a file that stores all S4 classes definitions.</li>
      <li><code class="language-plaintext highlighter-rouge">AllGenerics.R</code> (or <code class="language-plaintext highlighter-rouge">AllGeneric.R</code>), a file that stores all S4 generic functions.</li>
      <li><code class="language-plaintext highlighter-rouge">zzz.R</code>, a file that contains <code class="language-plaintext highlighter-rouge">.onLoad()</code> and friends.</li>
    </ul>
  </li>
  <li>
    <p>If the file contains only one function, name it by the function name.</p>
  </li>
  <li>
    <p>Use <code class="language-plaintext highlighter-rouge">methods-</code> prefix for S4 class methods.</p>
  </li>
</ul>

<h3 id="naming-variables">Naming variables</h3>

<ul>
  <li>
    <p>Generally, names should be as short as possible, still meaningful nouns.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">fit_rt</span><span class="w">
  </span><span class="n">split_1</span><span class="w">
  </span><span class="n">imdb_page</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">fit_regression_tree</span><span class="w">
  </span><span class="n">cross_validation_split_one</span><span class="w">
  </span><span class="n">foo</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Variable names should be typically lowercase.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">event</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">Event</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>NEVER separate words within the name by <code class="language-plaintext highlighter-rouge">.</code> (reserved for an S3 dispatch) or use CamelCase (reserved for S4 classes definitions). Instead, use an underscore (<code class="language-plaintext highlighter-rouge">_</code>).</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">event_window</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">event.window</span><span class="w">
  </span><span class="n">EventWindow</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>DO NOT use names of existing function and variables (especially, built-in ones).</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="nb">T</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">10</span><span class="w"> </span><span class="c1"># T is a shortcut of TRUE in R</span><span class="w">
  </span><span class="n">c</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"constant"</span><span class="w">
</span></code></pre></div>    </div>
  </li>
</ul>

<h3 id="naming-functions">Naming functions</h3>

<p>Many points of naming variables are similar for naming functions:</p>

<ul>
  <li>
    <p>Generally, function names should be verbs.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">add</span><span class="p">()</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">addition</span><span class="p">()</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Use <code class="language-plaintext highlighter-rouge">.</code> ONLY for dispatching S3 generic.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">bw_test</span><span class="p">()</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">bw.test</span><span class="p">()</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Add the underscore (<code class="language-plaintext highlighter-rouge">_</code>) prefix to a standard evaluation (SE) equivalent of a function (<code class="language-plaintext highlighter-rouge">summarize</code> vs <code class="language-plaintext highlighter-rouge">summarize_</code> ).</p>
  </li>
</ul>

<h3 id="naming-s4-classes">Naming S4 classes</h3>

<p>Class names should be nouns in CamelCase with initial capital case letter.</p>

<h2 id="syntax">Syntax</h2>

<h3 id="line-length">Line length</h3>

<p>The maximum length of lines is limited to 80 characters (thanks to IBM Punch Card).</p>

<p>It is possible to display the margin in RStudio Source editor:</p>

<ul>
  <li>Go to Tools -&gt; Global Options… -&gt; Code -&gt; Display</li>
  <li>Click on “Show margin”</li>
  <li>Set “Margin column” to 80</li>
</ul>

<p><img src="https://irudnyts.github.io/images/posts/2019-01-14-r-coding-style-guide/length.png" alt="" /></p>

<h3 id="spacing">Spacing</h3>

<ul>
  <li>
    <p>Put spaces around all infix binary operators (<code class="language-plaintext highlighter-rouge">=</code>, <code class="language-plaintext highlighter-rouge">+</code>, <code class="language-plaintext highlighter-rouge">*</code>, <code class="language-plaintext highlighter-rouge">==</code>, <code class="language-plaintext highlighter-rouge">&amp;&amp;</code>, <code class="language-plaintext highlighter-rouge">&lt;-</code>, <code class="language-plaintext highlighter-rouge">%*%</code>, etc.).</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">x</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">y</span><span class="w">
  </span><span class="n">a</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">x</span><span class="o">==</span><span class="n">y</span><span class="w">
  </span><span class="n">a</span><span class="o">&lt;-</span><span class="n">a</span><span class="o">^</span><span class="m">2+1</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Put spaces around “=” in function calls (except for Bioconductor).</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">na.rm</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Do NOT place space for subsetting (<code class="language-plaintext highlighter-rouge">$</code> and <code class="language-plaintext highlighter-rouge">@</code>), namespace manipulation (<code class="language-plaintext highlighter-rouge">::</code> and <code class="language-plaintext highlighter-rouge">:::</code>), and for sequence generation (<code class="language-plaintext highlighter-rouge">:</code>).</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">car</span><span class="o">$</span><span class="n">cyl</span><span class="w">
  </span><span class="n">dplyr</span><span class="o">::</span><span class="n">select</span><span class="w">
  </span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">car</span><span class="w"> </span><span class="o">$</span><span class="n">cyl</span><span class="w">
  </span><span class="n">dplyr</span><span class="o">::</span><span class="w"> </span><span class="n">select</span><span class="w">
  </span><span class="m">1</span><span class="o">:</span><span class="w"> </span><span class="m">10</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Put a space after a comma.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">mtcars</span><span class="p">[,</span><span class="w"> </span><span class="s2">"cyl"</span><span class="p">]</span><span class="w">
  </span><span class="n">mtcars</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">
  </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">mtcars</span><span class="p">[,</span><span class="s2">"cyl"</span><span class="p">]</span><span class="w">
  </span><span class="n">mtcars</span><span class="p">[</span><span class="m">1</span><span class="w"> </span><span class="p">,]</span><span class="w">
  </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Use a space before left parentheses, except in a function call.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">element</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">element_list</span><span class="p">)</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">grade</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">5.5</span><span class="p">)</span><span class="w">
  </span><span class="nf">sum</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">)</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="k">for</span><span class="p">(</span><span class="n">element</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">element_list</span><span class="p">)</span><span class="w">
  </span><span class="k">if</span><span class="p">(</span><span class="n">grade</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">5.5</span><span class="p">)</span><span class="w">
  </span><span class="n">sum</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>No spacing around code in parenthesis or square brackets.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">debug</span><span class="p">)</span><span class="w"> </span><span class="n">message</span><span class="p">(</span><span class="s2">"debug mode"</span><span class="p">)</span><span class="w">
  </span><span class="n">species</span><span class="p">[</span><span class="s2">"tiger"</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="w"> </span><span class="n">debug</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="n">message</span><span class="p">(</span><span class="s2">"debug mode"</span><span class="p">)</span><span class="w">
  </span><span class="n">species</span><span class="p">[</span><span class="w"> </span><span class="s2">"tiger"</span><span class="w"> </span><span class="p">,]</span><span class="w">
</span></code></pre></div>    </div>
  </li>
</ul>

<h3 id="curly-braces">Curly braces</h3>

<ul>
  <li>
    <p>An opening curly brace should NEVER go on its own line and should always be followed by a new line.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">is_used</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="c1"># do something</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">is_used</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="c1"># do something</span><span class="w">
  </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="c1"># do something else</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">is_used</span><span class="p">)</span><span class="w">
  </span><span class="p">{</span><span class="w">
      </span><span class="c1"># do something</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">is_used</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1"># do something }</span><span class="w">
  </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1"># do something else }</span><span class="w">

</span></code></pre></div>    </div>
  </li>
  <li>
    <p>A closing curly brace should always go on its own line, unless it’s followed by else.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">is_used</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="c1"># do something</span><span class="w">
  </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="c1"># do something else</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">is_used</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="c1"># do something</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="c1"># do something else</span><span class="w">
  </span><span class="p">}</span><span class="w">

</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Always indent the code inside curly braces (see next section).</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">is_used</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="c1"># do something</span><span class="w">
      </span><span class="c1"># and then something else</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">is_used</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="c1"># do something</span><span class="w">
  </span><span class="c1"># and then something else</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Curly braces and new lines can be avoided, if a statement after <code class="language-plaintext highlighter-rouge">if</code> is very short.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">is_used</span><span class="p">)</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="n">rval</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
</ul>

<h3 id="indentation">Indentation</h3>

<p>ALWAYS indent your code!</p>

<ul>
  <li>
    <p>No tabs or mixes of tabs and spaces.</p>
  </li>
  <li>
    <p>There are two common number of spaces for indentation: two (Hadley and others) and four (Bioconductor). My own rule of thumb: I use four spaces indentation for data analyses scripts, and two spaces while developing packages.</p>
  </li>
  <li>
    <p>Choose the number of spaces of indentation upfront and stick to it. Never mix different number of spaces in one project.</p>
  </li>
  <li>
    <p>To set the number of spaces in the project, go to Tools -&gt; Global options… -&gt; Code -&gt; Editing. Check the following boxes: “Insert spaces for tab” (with “Tab width” equal to chosen number), “Auto-indent code after paste”, and “Vertically align arguments in auto-indent”.</p>
  </li>
</ul>

<p><img src="https://irudnyts.github.io/images/posts/2019-01-14-r-coding-style-guide/indent.png" alt="" /></p>

<ul>
  <li>Magic shortcut: <code class="language-plaintext highlighter-rouge">Command+I</code> (<code class="language-plaintext highlighter-rouge">Ctrl+I</code> for Windows/Linux) will indent a selected chunk of code. Together with <code class="language-plaintext highlighter-rouge">Command+A</code> (select all) it is a very powerful tool, which saves time.</li>
</ul>

<p>Try a little exercise: paste the following code in your RStudio source editor, select it, and hit <code class="language-plaintext highlighter-rouge">Command+I</code>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">%%</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="s2">"is even"</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<h3 id="new-line">New line</h3>

<ul>
  <li>
    <p>Very often function definition does not fit into one line. In this case, excessive arguments should be moved to a new line, starting with the opening parenthesis.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="n">long_function_name</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">arg1</span><span class="p">,</span><span class="w"> </span><span class="n">arg2</span><span class="p">,</span><span class="w"> </span><span class="n">arg3</span><span class="p">,</span><span class="w"> </span><span class="n">arg4</span><span class="p">,</span><span class="w">
                                 </span><span class="n">long_argument_name1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>If arguments expand more than into two lines, then each argument should be placed on a separate line.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="n">long_function_name</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">long_argument_name1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"value1"</span><span class="p">,</span><span class="w"> </span><span class="s2">"value2"</span><span class="p">),</span><span class="w">
                                 </span><span class="n">long_argument_name2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
                                 </span><span class="n">long_argument_name3</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w">
                                 </span><span class="n">long_argument_name4</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>The same applies to a function call: excessive arguments should be indented where the closing parenthesis is located, if only two lines are sufficient.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="n">plot</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">rpois</span><span class="p">(</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">)),</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"h"</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w">
       </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"rpois(100, lambda = 5)"</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Otherwise, each argument can go into a separate line, starting with a new line after the opening parenthesis.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="nf">list</span><span class="p">(</span><span class="w">
      </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="w">
      </span><span class="n">sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sd</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="w">
      </span><span class="n">var</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="w">
      </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="w">
      </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="w">
      </span><span class="n">median</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">median</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>If the condition in <code class="language-plaintext highlighter-rouge">if</code> statement expands into several lines, than each condition should end with a logical operator, NOT start with it.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">some_very_long_name_1</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
      </span><span class="n">some_very_long_name_2</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">||</span><span class="w">
      </span><span class="n">some_very_long_name_3</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">some_very_long_name_4</span><span class="p">)</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">some_very_long_name_1</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="w">
      </span><span class="o">&amp;&amp;</span><span class="w"> </span><span class="n">some_very_long_name_2</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="w">
      </span><span class="o">||</span><span class="w"> </span><span class="n">some_very_long_name_3</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">some_very_long_name_4</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>

    <p>I know some people who are completely against it. See the next item why I believe it is better.</p>
  </li>
  <li>
    <p>If the statement, which contains operators, expands into several lines, then each line should end with an operator and not begin with it. Sometimes, it makes sense to split a formula into meaningful chunks.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">normal_pdf</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nb">pi</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">d_sigma</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w">
      </span><span class="nf">exp</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">d_mean</span><span class="p">)</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">normal_pdf</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nb">pi</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">d_sigma</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
      </span><span class="o">*</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">d_mean</span><span class="p">)</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">d_sigma</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>

    <p>Not only it is ugly, but also syntactically wrong. In the second case, R will consider these two lines as two distinct statements: the first line will assign the value of <code class="language-plaintext highlighter-rouge">1 / sqrt(2 * pi * d_sigma ^ 2)</code> to <code class="language-plaintext highlighter-rouge">normal_pdf</code>, and the second line will throw an error, since <code class="language-plaintext highlighter-rouge">*</code> does not have the first argument.</p>
  </li>
  <li>
    <p>Each grammar statement of <code class="language-plaintext highlighter-rouge">dplyr</code> (after <code class="language-plaintext highlighter-rouge">%&gt;%</code>) and <code class="language-plaintext highlighter-rouge">ggplot2</code> (after <code class="language-plaintext highlighter-rouge">+</code>) should start with a new line.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="n">mtcars</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
      </span><span class="n">filter</span><span class="p">(</span><span class="n">cyl</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">4</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
      </span><span class="n">group_by</span><span class="p">(</span><span class="n">am</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
      </span><span class="n">summarize</span><span class="p">(</span><span class="n">avg_mpg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">mpg</span><span class="p">))</span><span class="w">

  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">mtcars</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
      </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mpg</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">qsec</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">am</span><span class="p">)))</span><span class="w"> </span><span class="o">+</span><span class="w">
      </span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mpg</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">qsec</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">am</span><span class="p">)))</span><span class="w">
</span></code></pre></div>    </div>
  </li>
</ul>

<h2 id="comments">Comments</h2>

<ul>
  <li>
    <p>Comment your code. Always. Your collaborators and future-you will be very grateful. Comments start with <code class="language-plaintext highlighter-rouge">#</code> followed by space and text of the comment.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># This is a comment.</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Comments should explain the why, not the what. Comments should not replicate the code by a plain langue, but rather explain the overall intention of the command.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="c1"># define iterator</span><span class="w">
  </span><span class="n">i</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="c1"># set i to 1</span><span class="w">
  </span><span class="n">i</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Short comments can be placed on the same line of the code.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="n">plot</span><span class="p">(</span><span class="n">price</span><span class="p">,</span><span class="w"> </span><span class="n">weight</span><span class="p">)</span><span class="w"> </span><span class="c1"># plot a scatter chart of price and weight</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>To comment/uncomment selected chunk, use <code class="language-plaintext highlighter-rouge">Command+Shift+C</code>.</p>
  </li>
  <li>
    <p>Use <code class="language-plaintext highlighter-rouge">roxygen2</code> comments for a package development (i.e., <code class="language-plaintext highlighter-rouge">#'</code>) to comment functions.</p>
  </li>
  <li>
    <p>It makes sense to split the source into logical chunks by <code class="language-plaintext highlighter-rouge">#</code> followed by <code class="language-plaintext highlighter-rouge">-</code> or <code class="language-plaintext highlighter-rouge">=</code>.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Read data</span><span class="w">
  </span><span class="c1">#---------------------------------------------------------------------------</span><span class="w">

  </span><span class="c1"># Tidy data</span><span class="w">
  </span><span class="c1">#---------------------------------------------------------------------------</span><span class="w">

</span></code></pre></div>    </div>
  </li>
</ul>

<h2 id="other-recommendations">Other recommendations</h2>

<ul>
  <li>
    <p>Use <code class="language-plaintext highlighter-rouge">&lt;-</code> for assignment, NOT <code class="language-plaintext highlighter-rouge">=</code>.</p>
  </li>
  <li>Use <code class="language-plaintext highlighter-rouge">library()</code> instead of <code class="language-plaintext highlighter-rouge">require()</code>, unless it is a conscious choice. Package names should be characters (avoid NSE - non-standard evaluation).
    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">library</span><span class="p">(</span><span class="s2">"dplyr"</span><span class="p">)</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">require</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>In a function call, arguments can be specified by position, complete name, or partial name. Never specify by partial name and never mix by position and complete name.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
  </span><span class="n">rnorm</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0.3</span><span class="p">)</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">na</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
  </span><span class="n">rnorm</span><span class="p">(</span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">0.3</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>While developing a package, specify arguments by name.</p>
  </li>
  <li>
    <p>The required (with no default value) arguments should be first, followed by optional arguments.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">raise_to_power</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">power</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2.7</span><span class="p">)</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">raise_to_power</span><span class="p">(</span><span class="n">power</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2.7</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>The <code class="language-plaintext highlighter-rouge">...</code> argument should either be in the beginning or in the end.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">standardize</span><span class="p">(</span><span class="n">...</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">center</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
  </span><span class="n">save_chart</span><span class="p">(</span><span class="n">chart</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">standardize</span><span class="p">(</span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">,</span><span class="w"> </span><span class="n">center</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
  </span><span class="n">save_chart</span><span class="p">(</span><span class="n">chart</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Good practice rule is to set default arguments inside the function using <code class="language-plaintext highlighter-rouge">NULL</code> idiom, and avoid dependence between arguments:</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">histogram</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">bins</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nf">is.null</span><span class="p">(</span><span class="n">bins</span><span class="p">))</span><span class="w"> </span><span class="n">bins</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">nclass.Sturges</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
      </span><span class="n">...</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">histogram</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">bins</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nclass.Sturges</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="n">...</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Always validate arguments in a function.</p>
  </li>
  <li>
    <p>While developing a package, specify the namespace of each used function, except if it is from <code class="language-plaintext highlighter-rouge">base</code> package.</p>
  </li>
  <li>
    <p>Do NOT put more than one statement (command) per line. Do NOT use semicolon as termination of the command.</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  # Good
  x &lt;- 1
  x &lt;- x + 1

  # Bad
  x &lt;- 1; x &lt;- x + 1
</code></pre></div>    </div>
  </li>
  <li>
    <p>Avoid using <code class="language-plaintext highlighter-rouge">setwd("/Users/irudnyts/path/that/only/I/have")</code>. Almost surely your collaborators will have different paths, which makes the project not portable. Instead, use <code class="language-plaintext highlighter-rouge">here::here()</code> function from <code class="language-plaintext highlighter-rouge">here()</code> package.</p>
  </li>
  <li>Avoid using <code class="language-plaintext highlighter-rouge">rm(list = ls())</code>. This statement deletes all objects from the global environment, and gives you an illusion of a fresh R start.</li>
</ul>

<p>If you have read until this moment, you deserve a treat. There is a magic key combination <code class="language-plaintext highlighter-rouge">Command+Shift+A</code> that reformats selected code: add spaces and indents it. Do not use it excessively though!</p>

<h2 id="references">References</h2>

<ul>
  <li><a href="http://adv-r.had.co.nz/Style.html">Advanced R</a></li>
  <li><a href="https://google.github.io/styleguide/Rguide.xml">Google’s R Style Guide</a></li>
  <li><a href="http://bioconductor.org/developers/how-to/coding-style/">Bioconductor Coding Style</a></li>
  <li><a href="https://csgillespie.github.io/efficientR/coding-style.html">Efficient R programming</a></li>
  <li><a href="https://csgillespie.wordpress.com/2010/11/23/r-style-guide/">Colin Gillespie’s R style guide</a></li>
  <li><a href="https://journal.r-project.org/archive/2012-2/RJournal_2012-2_Baaaath.pdf">The State of Naming Conventions in R</a></li>
  <li><a href="https://www.r-bloggers.com/consistent-naming-conventions-in-r/">Consistent naming conventions in R</a></li>
  <li><a href="https://www.tidyverse.org/articles/2017/12/workflow-vs-script/">Project-oriented workflow</a></li>
  <li>Picture is taken from <a href="https://www.facebook.com/Rmemes0/">R Memes For Statistical Fiends</a> Facebook page.</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[Language is a tool that allows human beings to interact and communicate with each other. The clearer we express ourselves, the better the idea is transferred from our mind to the other. The same applies to programming languages: concise, clear and consistent codes are easier to read and edit. It is especially important, if you have collaborators, which depend on your code. However, even if you don’t, keep in mind that at some point in time, you might come back to your code, for example, to fix an error. And if you did not follow consistently your coding style, reviewing your code can take much longer, than expected. In this context, taking care of your audience means to make your code as readable as possible.]]></summary></entry><entry><title type="html">📁 [archived] Project-oriented workflow</title><link href="http://irudnyts.github.io//project-oriented-workflow/" rel="alternate" type="text/html" title="📁 [archived] Project-oriented workflow" /><published>2019-01-07T00:00:00+00:00</published><updated>2019-01-07T00:00:00+00:00</updated><id>http://irudnyts.github.io//project-oriented-workflow</id><content type="html" xml:base="http://irudnyts.github.io//project-oriented-workflow/"><![CDATA[<p>Be honest with yourself, how many times have you wanted to restart an on-going project from scratch throwing away the current folder? Or how many times have you had to rename files and adjust folder structure to make your project simple and clear? Not to mention, all these thousands of versions of your scripts that are dangling around in your mail box. Tired of this? Then, get on board and read my comments on how to make your project <em>reproducible</em>, <em>portable</em>, and <em>self-contained</em>.</p>

<blockquote>
  <p><strong>Disclaimer:</strong> This post is outdated and was archived for back compatibility: please use with care! This post does not reflect the author’s current point of view and might deviate from the current best practices.</p>
</blockquote>

<h2 id="introduction">Introduction</h2>

<p>We start by working up some intuition about these three key aspects rather than trying to grasp explicit technical definitions. In data science context, <em>reproducibility</em> means that the whole analysis can be recreated (or repeated) from scratch: executing scripts based on raw data must yield exactly the same results. It means, for instance, that if the analysis involves generating random numbers, then one has to set a seed (an initial state of a random generator) to obtain the same random split each time. Ideally, everyone should also have an access to data and software to replicate your analysis (it is not always the case, since data can be private), but this is already a domain of open science.</p>

<p><em>Portability</em> means that regardless of the operating system or a computer, given a minimal prerequisites, the project should work. For instance, if the project uses a particular package that works only on Windows, then it is not portable. The project is also not considered portable, if it utilizes a particular computer settings, such as absolute paths instead of relative to your project folder (e.g., when reading the data or saving plots to files). Normally, you should be able to run the code on your collaborator’s machine without changing any lines in the scripts.</p>

<p>We call a project <em>self-contained</em>, when you have everything you need at hand (i.e., in the folder of your project) and your project does not affect anything it did not create. It is a bad idea to use a function that has been defined in the other of your projects. Not only anyone else who does not have the second project will suffer, but yourself, when your current project will be used on the other machine. Furthermore, if you need, for instance, to save processed data, then it should be saved separately, and not overwrite raw data. There is another term that has a similar meaning – <em>isolated</em>, which is related to dependencies of the project. This topic is extensively covered in the section on <strong>packrat</strong> dependency management system.</p>

<p>This post is an attempt to summarize the use of “sexy” tools and techniques to improve above-mentioned aspects of project significantly. Of course, one can immediately feel that these aspects are interrelated. As a consequence, techniques and practices we consider further improve several elements at a time, rather than focusing on a particular one. For instance, using consistent folder structure will make your project reproducible and portable, while properly managed dependencies will ensure that the project is self-contained and portable. That is why further content is organized by focusing on tools rather than on stand-alone aspects. But do not get fooled, it is not a yet another git / RStudio tutorial. There are dozens of tutorials, and I do not try to compete with them. Instead, I want to give an overview of useful things based entirely on my experience.</p>

<p>Now, you might ask yourself: why it is such a big deal? Well, first off, it gives more credibility to the research, because it can be verified and validated by a third party ( your peers). Furthermore, keeping the flow of analysis reproducible, portable and self-contained makes easier to proceed and to extend the project. At first glance, it might look like you spend more time organizing your project than doing actual analysis. However, in the long run you will save much more time that you can anticipate.</p>

<blockquote>
  <p>It’s like agreeing that we will all drive on the left or the right. A hallmark of civilization is following conventions that constrain your behavior a little, in the name of public safety.
<cite> Jenny Bryan </cite></p>
</blockquote>

<h2 id="version-control-system">Version control system</h2>

<p>If you are reading this post I bet you have heard of (if not used) version control system. It allows to manage changes to files, especially of the source code history. Naming all advantages of VCS would be a hard task, and I only wish to emphasize the main ones. First off to begin with, VCS allows <em>storing the versions of files properly</em>. One can always revert to any previous version of any file of the project, not having tons of versions of the same file. If you keep your project on a hosting service, then it also <em>backs up</em> your most important files. Furthermore, distributed VCS makes it possible to <em>collaborate in a straightforward way</em>: your fellows have an access to the latest version of any files of the project at any time. Let’s face the truth, sending files via email or Dropbox is too messy. It is not dangerous even if you work on the same file at the same time, because VCS can <em>merge</em> the changes afterwards. Finally, <em>branching</em> deserves a separate mention, that is a possibility to deviate from the main flow of the analysis by having an independent stream, which can be merged back afterwards. Most common VCS are <strong>git</strong>, SVN (Subversion), Mercurial.</p>

<p><img src="https://irudnyts.github.io/images/posts/2019-01-07-project-oriented-workflow/names.jpg" alt="" /></p>

<p>Remember I mentioned that your collaborators always have access to files? Well, it is only true if your machine is plugged into a network. Surely that might not always be the case. To cope with this issue hosting services are used, such as GitHub (works with git and SVN), GitLab (git), Bitbucket (git and Mercurial), SourceForge (git, SVN), etc. These guys host your repository (repo for short, a folder with all your project files) making it possible to share and publish. While most of the VCS are command line tools, hosting services provide a very convenient web-based interfaces in addition to their own sweet features.</p>

<p>Long time ago, when people mostly used Emacs and Eclipse as IDE for R, SourceForge in conjunction with SVN dominated. Nowadays, most of R projects are hosted on GitHub and use git. GitHub has many nice features, like Issues (that can be used for bug tracking, to-lists, etc), Pull requests, an integration with Slack messenger, etc. Also, GitHub is very easy to intgerate to RStudio.</p>

<p>There is quite a number of tutorials on this topic. I personally find <a href="http://r-pkgs.had.co.nz/git.html">Hadley’s chapter in R packages</a> a very concise yet explicit cookbook. It covers main skills you might need, e.g. how to write good commits, etc.</p>

<p>Speaking about <em>git</em>, it is hard to avoid the topic of collaborating workflows. In a nutshell, there are three main workflows: centralized workflow, feature branch workflow, and forking workflow. Typically, a research project involves only a small number of collaborators who trust each other. If that’s the case, it makes sense to employ the centralized one, when everyone pushes into the central <code class="language-plaintext highlighter-rouge">master</code> branch. To deep dive into details of other workflows, please see <a href="https://www.atlassian.com/git/tutorials/comparing-workflows#centralized-workflow">the Bitbucket tutorial</a>.</p>

<p>Last but not least, it is very important to master <em>git</em> commands and use them via <code class="language-plaintext highlighter-rouge">shell</code>. For simple commands one can still use built-in RStudio git interface. However, once you are ready to use extensively git, <code class="language-plaintext highlighter-rouge">shell</code> becomes essential.</p>

<h2 id="dependency-management-tool">Dependency management tool</h2>

<p>It is very likely that your data science project depends on non-base R packages. R provides a very convenient way of installing packages via <code class="language-plaintext highlighter-rouge">install.packages()</code>, which by default stores all packages in one global repository. In most cases it is more than enough. However, sometimes different projects may depend on different versions of packages. For instance, the first project uses a function that has been deprecated from the current version. At the same time, the second project utilizes a function that appears only in the recent version of the package. A good example of such package would be <code class="language-plaintext highlighter-rouge">ggplot2</code>, which evolves significantly over the time and many functions of which have been deprecated.</p>

<p>The solution to this problem is to store these packages of specific versions in the local folder of the project so that each project will have its own <em>private package library</em>. If you have previously used Python or Ruby, similar tools are virtual environments and bundle, respectively. In R we have several tools, such as <code class="language-plaintext highlighter-rouge">packrat</code>, <code class="language-plaintext highlighter-rouge">jetpack</code>, and <a href="https://community.rstudio.com/t/does-r-provides-a-dependency-management-tool/13378/2">others</a>. The <code class="language-plaintext highlighter-rouge">packrat</code> is more common and stable, and below I briefly show how to use it.</p>

<p>Storing required packages in the folder of the project ensures the project is self-contained (meaning everything that the project needs is inside its folder), portable (you can move your project to another machine not worrying too much about dependencies), and reproducible (the same versions of packages yield the same result). In the official <code class="language-plaintext highlighter-rouge">packrat</code> web-page, the term <em>self-consistent</em> is replaced by <em>isolated</em>: indeed, this package manager not only makes sure that everything at hand, but also insures things won’t be overwritten and other projects won’t be affected (for instance, by installing a newer version of the dependence).</p>

<p>Installation of <code class="language-plaintext highlighter-rouge">packrat</code> is effortless, <code class="language-plaintext highlighter-rouge">install.packages("packrat")</code> should do the trick (on macOS it might require Command Line Tools to be installed first). To start using <code class="language-plaintext highlighter-rouge">packrat</code> you have two ways: if you use RStudio, then simply initialize a new project with <code class="language-plaintext highlighter-rouge">packrat</code> as shown below or use the command <code class="language-plaintext highlighter-rouge">packrat::init()</code> in the existing project (mind argument <code class="language-plaintext highlighter-rouge">project</code>, which by default is the working directory of R).</p>

<p><img src="https://irudnyts.github.io/images/posts/2019-01-07-project-oriented-workflow/packrat1.png" alt="" /></p>

<p>From now on all your packages will be stored in <code class="language-plaintext highlighter-rouge">packrat</code> folder. You should not modify anything by hand in this directory. I am not going to go over each component of this folder (one can read about it <a href="https://rstudio.github.io/packrat/walkthrough.html">here</a>), but several folders are worth mentioning. Files <code class="language-plaintext highlighter-rouge">packrat/packrat.lock</code> and <code class="language-plaintext highlighter-rouge">packrat/packrat.opts</code> contain the list of dependencies and specify the options of the tool, respectively. Then, <code class="language-plaintext highlighter-rouge">packrat/lib/</code> is a repository, where your <em>installed</em> packages live, the actual private package library. Finally, your bundled packages are located in <code class="language-plaintext highlighter-rouge">packrat/src/</code>.</p>

<p>One has two ways to configure <code class="language-plaintext highlighter-rouge">packrat</code>: either with <code class="language-plaintext highlighter-rouge">packrat::set_opts()</code> or via RStudio (Tools -&gt; Project Options… -&gt; Packrat). Both methods will modify <code class="language-plaintext highlighter-rouge">packrat/packrat.opts</code> file. We add only one modification to default options: we need to check <em>Automatically snapshot local changes</em> in RStudio or to evoke <code class="language-plaintext highlighter-rouge">packrat::set_opts(auto.snapshot = TRUE)</code>. We also leave <em>Git ignore packrat library</em> and <em>Git ignore packrat sources</em> as is, that is checked and unchecked. Installed packages in <code class="language-plaintext highlighter-rouge">packrat/lib/</code> are platform-specific. Thus, carrying them to the other platforms does not make any sense. At the same time, they can be installed from bundled packages in <code class="language-plaintext highlighter-rouge">packrat/src/</code>, which will be transferred together with other files of the project.</p>

<p><img src="https://irudnyts.github.io/images/posts/2019-01-07-project-oriented-workflow/packrat2.png" alt="" /></p>

<p>The workflow of installing, removing, and updating packages is the same as in normal R, that is by <code class="language-plaintext highlighter-rouge">install.packages()</code>, etc. As long as we set <code class="language-plaintext highlighter-rouge">auto.snapshot</code> to <code class="language-plaintext highlighter-rouge">TRUE</code>, you do not need to make a snapshot each time by <code class="language-plaintext highlighter-rouge">packrat::snapshot()</code>, <code class="language-plaintext highlighter-rouge">packrat</code> will do it for you automatically.</p>

<p>The most amazing thing about <code class="language-plaintext highlighter-rouge">packrat</code> is if you move the project to the other computer, all you need to do is to start R from the project directory – <code class="language-plaintext highlighter-rouge">packrat</code> will set up the private package library automatically.</p>

<h2 id="project-folder-structure">Project folder structure</h2>

<p>The size of the project increases exponentially. A project started as a harmless code snippet can easily pile up into a huge snowball of hundreds files with an unstructured folder tree. To avoid this, it is important to define the folder structure before stepping into analysis. Depending on whether the project is a package or a case study, it should have a significantly different skeleton.</p>

<p>The folder structure of R packages is a subject to a regulation of community (CRAN and Bioconductor). It is well-defined and can be explored in <a href="http://r-pkgs.had.co.nz/package.html">R packages book</a>, therefore, I skip it in this post.</p>

<p>As opposed to R packages, there is no a single right folder structure for analysis projects. Below, I present a simple yet extensible folder structure for data analysis project, based on several references that cover this issue.</p>

<p>The parent folder that will contain all project’s subfolders should have the same name as your project. Pick a good one. Spending an extra 5 minutes will save you from regrets in the future. The name should be short, concise, written in lower-case, and not containing any special symbols. One can apply similar <a href="http://r-pkgs.had.co.nz/package.html">strategies</a> as for naming packages.</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>name_of_project/
|-  data
|   |-  raw
|   |-  processed
|-  figures
|-  packrat
|-  reports
|-  results
|-  scripts
|   |- deprecated
|-  .gitignore
|-  name_of_project.Rproj
|-  README.md
</code></pre></div></div>

<p>The folder <code class="language-plaintext highlighter-rouge">data</code> typically contains two subfolders, namely, <code class="language-plaintext highlighter-rouge">raw</code> and <code class="language-plaintext highlighter-rouge">processed</code>. The content of <code class="language-plaintext highlighter-rouge">raw</code> directory is data files of any kind, such as <code class="language-plaintext highlighter-rouge">.csv</code>, SAS, Excel, text and database files, etc. The content of this folder is <em>read only</em>, so that no scripts should change the original files or create new ones inside it. For this purpose <code class="language-plaintext highlighter-rouge">processed</code> directory is used: all processed, cleaned, and tidied datasets are saved here. It is a good practice to save files in R specific format, rather than in <code class="language-plaintext highlighter-rouge">.csv</code>, since the saving in <code class="language-plaintext highlighter-rouge">.csv</code> is a less efficient way of storing data (both in terms of space and time of reading/writing). The preference is given to <code class="language-plaintext highlighter-rouge">.rds</code> files over <code class="language-plaintext highlighter-rouge">.RData</code> (see why in Content of R files section). Again, files should have representative names (<code class="language-plaintext highlighter-rouge">merged_calls.rds</code> vs <code class="language-plaintext highlighter-rouge">dataset_1.rds</code>). Note that it should be possible to regenerate those datasets from the raw data. In other words, if you remove all files from this folder, it must be possible to restore all of them by executing your scripts that use only the data from <code class="language-plaintext highlighter-rouge">raw</code> directory.</p>

<p>The folder <code class="language-plaintext highlighter-rouge">figures</code> is the place where you may store plots, diagrams, and other figures. There is not much to say about it. Common extensions of such files are <code class="language-plaintext highlighter-rouge">.eps</code>, <code class="language-plaintext highlighter-rouge">.png</code>, <code class="language-plaintext highlighter-rouge">.pdf</code>, etc. Again, file names in this folder should be meaningful (the name <code class="language-plaintext highlighter-rouge">img1.png</code> does not represent anything).</p>

<p>All reports live in a directory with the corresponding name <code class="language-plaintext highlighter-rouge">reports</code>. These reports can be of any formats, such as LaTeX, Markdown, R Markdown, Jupyter Notebooks, etc. Currently, more and more people prefer rich documents with text and executable code to LaTeX and such.</p>

<p>Not all output object of the analysis are data files. For example, you have calibrated and fitted your deep learning network to the data, which took about an hour. Of course, it would be painful to retrain the model each time you run the script, and you want to save this model. Then, it is reasonable to save it in <code class="language-plaintext highlighter-rouge">results</code> with <code class="language-plaintext highlighter-rouge">.rds</code> extension.</p>

<p>Perhaps the most important folder is <code class="language-plaintext highlighter-rouge">scripts</code>. There you keep all your R scripts and codes. That is the exact place to use prefix numbers, if files should be run in a particular order. If you have files in other scripted languages (e.g., Python), it is better to keep them in this folder as well. There is also an important subfolder called <code class="language-plaintext highlighter-rouge">deprecated</code>. Whenever you want to remove one or the other script, it is a good idea to move it to <code class="language-plaintext highlighter-rouge">deprecated</code> at first iteration, and only then delete. The script you want to remove can contain functions or analysis used by other collaborators. Moving it firstly to <code class="language-plaintext highlighter-rouge">deprecated</code> ensures that the file is not used by other collaborators. It is not required, of course, because git keeps all versions, and it is always possible to revert. But from my experience, it is highly convenient.</p>

<p>There are three important files in the project folder: <code class="language-plaintext highlighter-rouge">.gitignore</code>, <code class="language-plaintext highlighter-rouge">name_of_project.Rproj</code>, and <code class="language-plaintext highlighter-rouge">README.md</code>. The file <code class="language-plaintext highlighter-rouge">.gitignore</code> lists files that won’t be added to Git system: LaTeX or C build artifacts, system files, very large files, or files generated for particular cases (e.g., <code class="language-plaintext highlighter-rouge">packrat\lib</code>). The <code class="language-plaintext highlighter-rouge">name_of_project.Rproj</code> contains options and meta-data of the project: encoding, the number of spaces used for indentation, whether or not to restore a workspace with launch, etc. The <code class="language-plaintext highlighter-rouge">README.md</code> briefly describes all high-level information about the project.</p>

<p>The proposed folder structure is far from being exhaustive. You might need to introduce other folders, such as <code class="language-plaintext highlighter-rouge">paper</code> (where <code class="language-plaintext highlighter-rouge">.tex</code> version of a paper lives), <code class="language-plaintext highlighter-rouge">sources</code> (a place for your compiled code, e.g., C++), <code class="language-plaintext highlighter-rouge">references</code>, <code class="language-plaintext highlighter-rouge">presentations</code>, <code class="language-plaintext highlighter-rouge">NEWS.md</code>, <code class="language-plaintext highlighter-rouge">TODO.md</code>, etc. At the same time, keeping empty folders could be misleading, and it is better to remove them (unless you are planning to store anything in them in the future). Moreover, git does not track empty folders.</p>

<p>Several R packages, namely <a href="http://projecttemplate.net/architecture.html"><code class="language-plaintext highlighter-rouge">ProjectTemplate</code></a>, <a href="https://github.com/Pakillo/template"><code class="language-plaintext highlighter-rouge">template</code></a>, and  <a href="https://github.com/cboettig/template"><code class="language-plaintext highlighter-rouge">template</code></a> are dedicated to project structures. It is also possible to construct a project tree by forking <a href="https://github.com/jhollist/manuscriptPackage">manuscriptPackage</a> or <a href="http://www.statsravingmad.com/measure/sample-r-project-structure/">sample-r-project</a> repos. Using a package or forking a repo allows automated structure generation, but at the same time introduces many redundant and unnecessary folders and files.</p>

<p><img src="https://irudnyts.github.io/images/posts/2019-01-07-project-oriented-workflow/standards.png" alt="" /></p>

<p>Finally, some scientists believe that all R projects should be in a shape of a package. Indeed, one can store data in <code class="language-plaintext highlighter-rouge">\data</code>, R scripts in <code class="language-plaintext highlighter-rouge">\R</code>, documentation in <code class="language-plaintext highlighter-rouge">\man</code>, and the paper in <code class="language-plaintext highlighter-rouge">\vignette</code>. The nice thing about it is that anyone familiar with an R package structure can immediately grasp where each type of file is located. On the other hand, the structure of R packages is tailored to serve its purpose – make a coherent <em>tool</em> for data scientists and not to produce a data product: there is no distinction between function definitions and applications, no proper place for reports, and finally there is no place for other script languages that you can use (e.g, Bash, Python, etc.).</p>

<h2 id="content-of-r-files">Content of R files</h2>

<p>While there are no rules on how to organize your R code, there are several dos and dont’s that most of the time are not taught explicitly. I list them below in no particular order:</p>

<ul>
  <li>
    <p>Do not use the function <code class="language-plaintext highlighter-rouge">install.packages()</code> inside your scripts. You are not supposed to (re)install packages each time you run your files. By default it is assumed that all packages that are used by a script are already installed. If you use <code class="language-plaintext highlighter-rouge">packrat</code>, packages will be installed automatically from bundles.</p>

    <p>If there are many packages to install and you do not use <code class="language-plaintext highlighter-rouge">packrat</code>, I suggest to create a file <code class="language-plaintext highlighter-rouge">configure.R</code>, that will install all packages:</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="n">pkgs</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">,</span><span class="w"> </span><span class="s2">"plyr"</span><span class="p">)</span><span class="w">
  </span><span class="n">install.packages</span><span class="p">(</span><span class="n">pkgs</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>

    <p>The snippet above profits from the fact that <code class="language-plaintext highlighter-rouge">install.packages()</code> is a vectorized function. Anyway, most times, <code class="language-plaintext highlighter-rouge">install.packages()</code> is supposed to be called from the console, not the script.</p>
  </li>
  <li>
    <p>Do not use the function <code class="language-plaintext highlighter-rouge">require()</code>, unless it is a conscious choice. In contrast to <code class="language-plaintext highlighter-rouge">library()</code>, <code class="language-plaintext highlighter-rouge">require()</code> does not throw an error (only a warning) if the package is not installed.</p>
  </li>
  <li>
    <p>Use a character representation of the package name.</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Load <em>only</em> those packages that are actually used in the script. Load packages at the beginning of the script.</p>
  </li>
  <li>
    <p>Do not use <code class="language-plaintext highlighter-rouge">rm(list = ls())</code> that erase your global environment. First, it could accidentally delete accidentally an important long-time-to-build object. Second, it gives the illusion of the fresh start of R.</p>
  </li>
  <li>
    <p>Do not use <code class="language-plaintext highlighter-rouge">setwd("/Users/irudnyts/path/that/only/I/have")</code>. It is very unlikely that someone except you will have the same path to the project. Instead, use a package <code class="language-plaintext highlighter-rouge">here</code> and relative paths. The package <code class="language-plaintext highlighter-rouge">here</code> automatically recognizes the path to the project, and starts from there:</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">library</span><span class="p">(</span><span class="s2">"here"</span><span class="p">)</span><span class="w">

  </span><span class="n">cars</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">here</span><span class="p">(</span><span class="s2">"data"</span><span class="p">,</span><span class="w"> </span><span class="s2">"raw"</span><span class="p">,</span><span class="w"> </span><span class="s2">"cars.csv"</span><span class="p">))</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">setwd</span><span class="p">(</span><span class="s2">"/Users/irudnyts/path/that/only/I/have/data/raw"</span><span class="p">)</span><span class="w">
  </span><span class="n">cars</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cars.csv"</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>If your script involves random generation, then set a seed by <code class="language-plaintext highlighter-rouge">set.seed()</code> function to get the same random split each time:</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Good</span><span class="w">
  </span><span class="n">set.seed</span><span class="p">(</span><span class="m">1991</span><span class="p">)</span><span class="w">
  </span><span class="n">x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">100</span><span class="p">)</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">100</span><span class="p">)</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Do not repeat yourself (<em>DRY</em>). In R context it means the following: if the code is repeated more than two times, you had better wrapped it into a function (the example is borrowed from Advanced R):</p>

    <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Better</span><span class="w">
  </span><span class="n">fix_missing</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="n">x</span><span class="p">[</span><span class="n">x</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">-99</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NA</span><span class="w">
      </span><span class="n">x</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="n">df</span><span class="p">[]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">fix_missing</span><span class="p">)</span><span class="w">

  </span><span class="c1"># Bad</span><span class="w">
  </span><span class="n">df</span><span class="o">$</span><span class="n">a</span><span class="p">[</span><span class="n">df</span><span class="o">$</span><span class="n">a</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">-99</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NA</span><span class="w">
  </span><span class="n">df</span><span class="o">$</span><span class="n">b</span><span class="p">[</span><span class="n">df</span><span class="o">$</span><span class="n">b</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">-99</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NA</span><span class="w">
  </span><span class="n">df</span><span class="o">$</span><span class="n">c</span><span class="p">[</span><span class="n">df</span><span class="o">$</span><span class="n">c</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">-98</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NA</span><span class="w">
  </span><span class="n">df</span><span class="o">$</span><span class="n">d</span><span class="p">[</span><span class="n">df</span><span class="o">$</span><span class="n">d</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">-99</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NA</span><span class="w">
  </span><span class="n">df</span><span class="o">$</span><span class="n">e</span><span class="p">[</span><span class="n">df</span><span class="o">$</span><span class="n">e</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">-99</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NA</span><span class="w">
  </span><span class="n">df</span><span class="o">$</span><span class="n">f</span><span class="p">[</span><span class="n">df</span><span class="o">$</span><span class="n">g</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">-99</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NA</span><span class="w">
</span></code></pre></div>    </div>
  </li>
  <li>
    <p>Separate function definitions from their applications. I typically keep a file <code class="language-plaintext highlighter-rouge">util.R</code>, where all my functions are defined.</p>
  </li>
  <li>
    <p>Use <code class="language-plaintext highlighter-rouge">saveRDS()</code> instead of <code class="language-plaintext highlighter-rouge">save()</code>:</p>
  </li>
</ul>

<blockquote>
  <ul>
    <li><code class="language-plaintext highlighter-rouge">save()</code> saves the objects and their names together in the same file; <code class="language-plaintext highlighter-rouge">saveRDS()</code> only saves the value of a single object (its name is dropped).</li>
    <li><code class="language-plaintext highlighter-rouge">load()</code> loads the file saved by <code class="language-plaintext highlighter-rouge">save()</code>, and creates the objects with the saved names silently (if you happen to have objects in your current environment with the same names, these objects will be overridden); <code class="language-plaintext highlighter-rouge">readRDS()</code> only loads the value, and you have to assign the value to a variable.
<cite> Yihui Xie </cite></li>
  </ul>
</blockquote>

<h2 id="initializing-a-new-data-analysis-project-in-rstudio-and-getting-your-things-together">initializing a new data analysis project in RStudio and getting your things together</h2>

<p>Prerequisites:</p>

<ul>
  <li>Installed and configured git</li>
  <li>Installed R and RStudio</li>
  <li>Existing account in GitHub</li>
  <li>Installed and configured <code class="language-plaintext highlighter-rouge">packrat</code></li>
</ul>

<p>Steps:</p>

<ol>
  <li>
    <p>Pick a good name (e.g., <code class="language-plaintext highlighter-rouge">beer</code>).</p>
  </li>
  <li>
    <p>In RStudio create a project:</p>

    <ul>
      <li>Navigate to File -&gt; New project…</li>
      <li>Select New Directory</li>
      <li>Select New project</li>
      <li>Insert your picked name into Directory name</li>
      <li>Check Create a git repository and Use packrat with this project</li>
    </ul>

    <p>This creates a folder with the name of the project, initializes a git repo, generates an <code class="language-plaintext highlighter-rouge">.Rproj</code> file, initializes <code class="language-plaintext highlighter-rouge">packrat</code>, and creates <code class="language-plaintext highlighter-rouge">.gitignore</code> file.</p>
  </li>
  <li>
    <p>Configure <code class="language-plaintext highlighter-rouge">packrat</code> as described above.</p>
  </li>
  <li>
    <p>Populate folders with files. Typically, at the beginning, it is only <code class="language-plaintext highlighter-rouge">data/raw</code>.</p>
  </li>
  <li>
    <p>Create a <code class="language-plaintext highlighter-rouge">README.md</code> file.</p>
  </li>
  <li>
    <p>Launch <code class="language-plaintext highlighter-rouge">Terminal</code> and navigate your working directory (of <code class="language-plaintext highlighter-rouge">Terminal</code>, not <code class="language-plaintext highlighter-rouge">R</code>) to your project folder by, for instance, <code class="language-plaintext highlighter-rouge">cd /Users/irudnyts/Documents/projects/beer</code>.</p>
  </li>
  <li>
    <p>Record changes by <code class="language-plaintext highlighter-rouge">git add --all</code> and commit by <code class="language-plaintext highlighter-rouge">git commit -m "Initialize the project"</code>. Traditionally the message of the first commit is simple <code class="language-plaintext highlighter-rouge">"First commit"</code>, but I prefer to write something more conscious. Now all you changes are recorded locally. Note also that git does not record empty folders.</p>
  </li>
  <li>
    <p>Create a <a href="https://github.com/new">new repo</a> in GitHub:</p>

    <ul>
      <li>Fill in <code class="language-plaintext highlighter-rouge">Repository name</code> with the same name as your project.</li>
      <li>Fill in <code class="language-plaintext highlighter-rouge">Description</code> with one line that briefly explains the intent of the project and ends with full stop.</li>
      <li>Hit <code class="language-plaintext highlighter-rouge">Create repository</code>.</li>
    </ul>
  </li>
  <li>
    <p>Connect your local repo to your GitHub repo by</p>

    <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code> git remote add origin git@github.com:irudnyts/beer.git
 git push <span class="nt">-u</span> origin master
</code></pre></div>    </div>

    <p>Refresh the page in your browser to ensure that changes appear at GitHub repo.</p>
  </li>
</ol>

<h2 id="outro--acknowledgement">Outro &amp; acknowledgement</h2>

<p>About a year ago I came across a <a href="https://www.tidyverse.org/articles/2017/12/workflow-vs-script/">brilliant post</a> by Jenny Bryan. I was amazed by how elegantly she formalized and summarized many simple tricks that make the life of a data scientist more pleasant. I was so inspired that I could not miss the opportunity to present these ideas in tutorials to students during the fall semester. The idea that I contributed to the process of making projects more conscious was very satisfactory, and based on these tutorials I start the series of posts.</p>

<p>Many ideas and concepts are based on the works of Hadley Wickham and Jenny Bryan. Many thanks!</p>

<h2 id="references">References</h2>

<ul>
  <li><a href="http://happygitwithr.com">Happy Git and GitHub for the useR</a></li>
  <li><a href="http://r-pkgs.had.co.nz/">R packages</a></li>
  <li><a href="https://www.tidyverse.org/articles/2017/12/workflow-vs-script/">Project-oriented workflow</a></li>
  <li><a href="https://yihui.name/en/2017/12/save-vs-saverds/">save() vs saveRDS()</a></li>
  <li><a href="https://www.datacamp.com/community/blog/jupyter-notebook-r#alternatives">Jupyter And R Markdown: Notebooks With R</a></li>
  <li><a href="http://www.statsravingmad.com/measure/sample-r-project-structure/">A sample R project structure</a></li>
  <li><a href="https://github.com/IronistM/sample-r-project">sample-r-project repo</a></li>
  <li><a href="http://rmflight.github.io/posts/2014/07/vignetteAnalysis.html">Creating an analysis as a package and vignette</a></li>
  <li><a href="http://rmflight.github.io/posts/2014/07/analyses_as_packages.html">Analyses as Packages</a></li>
  <li><a href="https://www.r-bloggers.com/packages-vs-projecttemplate/">Packages vs ProjectTemplate</a></li>
  <li><a href="https://nicercode.github.io/blog/2013-05-17-organising-my-project/">Organizing the project directory</a></li>
  <li><a href="https://nicercode.github.io/blog/2013-04-05-projects/">Designing projects</a></li>
  <li><a href="https://swcarpentry.github.io/r-novice-gapminder/02-project-intro/">Project Management With RStudio</a></li>
  <li><a href="https://r-dir.com/blog/2013/11/folder-structure-for-data-analysis.html">Folder Structure for Data Analysis</a></li>
  <li><a href="https://github.com/AndersenLab/IBiS-Bootcamp/wiki/Organizing-files-for-data-analysis">Organizing files for data analysis</a></li>
  <li><a href="https://www.r-bloggers.com/a-meaningful-file-structure-for-r-projects/">A meaningful file structure for R projects</a></li>
  <li><a href="https://peerj.com/preprints/3192.pdf">Packaging data analytical work reproducibly using R (and friends)</a></li>
  <li><a href="https://thomasleeper.com/2015/05/open-science-language/">What’s in a Name? The Concepts and Language of Replication and Reproducibility</a></li>
  <li><a href="https://thomasleeper.com/2016/11/analysis-as-package/">Packaging Your Reproducible Analysis</a></li>
  <li><a href="http://kbroman.org/Tools4RR/assets/lectures/06_org_eda_withnotes.pdf">Tools for Reproducible Research</a></li>
  <li><a href="https://datacarpentry.org/R-ecology-lesson/00-before-we-start.html#r_code_is_great_for_reproducibility">Data Analysis and Visualization in R for Ecologists</a></li>
  <li><a href="https://gist.github.com/jennybc/362f52446fe1ebc4c49f">Stop the working directory insanity</a></li>
  <li><a href="https://github.com/jhollist/manuscriptPackage">manuscriptPackage</a></li>
  <li><a href="https://github.com/cboettig/template">cboettig/template</a></li>
  <li><a href="https://github.com/Pakillo/template">Pakillo/template</a></li>
  <li><a href="https://talesofr.wordpress.com/2017/12/12/a-minimal-project-tree-in-r/">A minimal Project Tree in R</a>
-<a href="http://projecttemplate.net">ProjectTemplate</a></li>
  <li><a href="https://blog.davisvaughan.com/post/writing-a-paper-with-rstudio/">Writing a paper with RStudio</a></li>
  <li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5778115/">Reproducibility vs. Replicability: A Brief History of a Confused Terminology</a></li>
  <li><a href="https://community.rstudio.com/t/does-r-provides-a-dependency-management-tool/13378">Does R provides a dependency management tool?</a></li>
  <li><a href="https://www.atlassian.com/git/tutorials/comparing-workflows#centralized-workflow">Comparing Workflows</a></li>
  <li><a href="https://whattheyforgot.org">What They Forgot to Teach You About R</a></li>
  <li>Pictures are taken from <a href="https://www.facebook.com/Rmemes0/">R Memes For Statistical Fiends</a> Facebook page and <a href="https://xkcd.com">xkcd</a></li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[Be honest with yourself, how many times have you wanted to restart an on-going project from scratch throwing away the current folder? Or how many times have you had to rename files and adjust folder structure to make your project simple and clear? Not to mention, all these thousands of versions of your scripts that are dangling around in your mail box. Tired of this? Then, get on board and read my comments on how to make your project reproducible, portable, and self-contained.]]></summary></entry><entry><title type="html">🌱 [archived] Setting a seed in R, when using parallel simulation</title><link href="http://irudnyts.github.io//setting-a-seed-in-r-when-using-parallel-simulation/" rel="alternate" type="text/html" title="🌱 [archived] Setting a seed in R, when using parallel simulation" /><published>2018-07-12T00:00:00+00:00</published><updated>2018-07-12T00:00:00+00:00</updated><id>http://irudnyts.github.io//setting-a-seed-in-r-when-using-parallel-simulation</id><content type="html" xml:base="http://irudnyts.github.io//setting-a-seed-in-r-when-using-parallel-simulation/"><![CDATA[<p>Generally speaking, if the code does any simulations, it is a good practice to set a seed to make the code reproducible. Setting a seed ensures that the same (pseudo-)random numbers will be generated each time the script is executed. Surprisingly, I found really few posts dedicated to any convention, best practice, or routine of setting a seed in R. Further, when using multiple cores (parallelisation) for simulations, things can get slightly more complicated.</p>

<blockquote>
  <p><strong>Disclaimer:</strong> This post is outdated and was archived for back compatibility: please use with care! This post does not reflect the author’s current point of view and might deviate from the current best practices.</p>
</blockquote>

<h2 id="seeds-in-r">Seeds in R</h2>

<p>In base R there are two main objects to handle seeds: <code class="language-plaintext highlighter-rouge">set.seed()</code> and <code class="language-plaintext highlighter-rouge">.Random.seed</code>. For a vast number of problems it is enough to use <code class="language-plaintext highlighter-rouge">set.seed()</code>, which supplies an integer as a seed. The workflow, then, is as simple as follows:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">runif</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># &gt; [1] 0.8636221 0.1782020</span><span class="w">

</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1991</span><span class="p">)</span><span class="w">
</span><span class="n">runif</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># &gt; [1] 0.1506231 0.2308308</span><span class="w">

</span><span class="n">runif</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># &gt; [1] 0.0134826 0.5340390</span><span class="w">

</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1991</span><span class="p">)</span><span class="w">
</span><span class="n">runif</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="c1"># exactly the same random numbers as before</span><span class="w">
</span><span class="c1"># &gt; [1] 0.1506231 0.2308308</span><span class="w">
</span></code></pre></div></div>

<p>The second object, <code class="language-plaintext highlighter-rouge">.Random.seed</code>, allows saving and restoring the random number generator (RNG) state. Under the hood <code class="language-plaintext highlighter-rouge">.Random.seed</code> is a simple atomic integer vector, the first element of which specifies the kind of RNG and normal generator. For instance, the first element of <code class="language-plaintext highlighter-rouge">207</code> is referred to <code class="language-plaintext highlighter-rouge">"L'Ecuyer-CMRG"</code> RNG method, and <code class="language-plaintext highlighter-rouge">"Box-Muller"</code> for normal distribution. The rest of the elements of <code class="language-plaintext highlighter-rouge">.Random.seed</code> store the current random seed.</p>

<p>This object I find of a particular use because it can be saved without an explicit seed setting. What I mean is one does not need to provide an integer to <code class="language-plaintext highlighter-rouge">set.seed()</code>, which might be annoying, but rather just saving current seed:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">seed</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">.Random.seed</span><span class="w">
</span><span class="n">runif</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># &gt; [1] 0.5696378 0.3737989</span><span class="w">

</span><span class="n">runif</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># &gt; [1] 0.7199003 0.5540470</span><span class="w">

</span><span class="n">runif</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># &gt; [1] 0.4383970 0.6494643</span><span class="w">

</span><span class="n">.Random.seed</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">seed</span><span class="w">
</span><span class="n">runif</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="c1"># exactly the same random numbers as before</span><span class="w">
</span><span class="c1"># &gt; [1] 0.5696378 0.3737989</span><span class="w">
</span></code></pre></div></div>

<p>The object <code class="language-plaintext highlighter-rouge">.Random.seed</code> lives in the global environment, and therefore, should be set there. It can cause some issues if you trying to set the <code class="language-plaintext highlighter-rouge">.Random.seed</code> inside the function, not caring too much about environments. It means that changing it in the function by simple assignment won’t change a seed (the value will be set in execution environment). The following straightforward idea can be used for saving a current seed or setting a custom one inside the function:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">reproducible_runif</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

    </span><span class="k">if</span><span class="p">(</span><span class="nf">is.null</span><span class="p">(</span><span class="n">seed</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="n">seed</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">.Random.seed</span><span class="w">
    </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="c1"># .Random.seed &lt;&lt;- seed</span><span class="w">
        </span><span class="c1"># mind the double arrow to assign in the parent enviroment or</span><span class="w">
        </span><span class="n">assign</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">".Random.seed"</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seed</span><span class="p">,</span><span class="w"> </span><span class="n">envir</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.GlobalEnv</span><span class="p">)</span><span class="w">
    </span><span class="p">}</span><span class="w">

    </span><span class="nf">return</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seed</span><span class="p">))</span><span class="w">

</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Then, this function will return a random number, that can be reproduced:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">r1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">reproducible_runif</span><span class="p">()</span><span class="w">
</span><span class="n">r1</span><span class="o">$</span><span class="n">x</span><span class="w">
</span><span class="c1"># &gt; [1] 0.4215304</span><span class="w">

</span><span class="n">runif</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w">
</span><span class="c1"># &gt;  [1] 0.1862207 0.2660995 0.5863689 0.1063663 0.5530690 0.9392229 0.9710050</span><span class="w">
</span><span class="c1">#    [8] 0.1265786 0.1526233 0.1713895</span><span class="w">


</span><span class="n">r2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">reproducible_runif</span><span class="p">(</span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r1</span><span class="o">$</span><span class="n">seed</span><span class="p">)</span><span class="w"> </span><span class="c1"># use the seed from the initial call</span><span class="w">
</span><span class="n">r2</span><span class="o">$</span><span class="n">x</span><span class="w"> </span><span class="c1"># exactly the same as for r1</span><span class="w">
</span><span class="c1"># &gt; [1] 0.4215304</span><span class="w">
</span></code></pre></div></div>

<p>References:</p>

<ul>
  <li><a href="http://r.789695.n4.nabble.com/Best-way-to-reset-random-seed-when-using-set-seed-in-a-function-td918769.html">1</a></li>
  <li><a href="http://r.789695.n4.nabble.com/How-to-properly-re-set-a-saved-seed-I-ve-got-the-answer-but-no-explanation-td4270483.html">2</a></li>
  <li><a href="https://www.uni-muenster.de/ZIV.BennoSueselbeck/s-html/helpfiles/set.seed.html">3</a></li>
</ul>

<h2 id="seeds-for-parallel">Seeds for parallel</h2>

<p>The story is slightly different when using multiple cores (parallel execution). In this post I use a base package <code class="language-plaintext highlighter-rouge">parallel</code> and macOS, but the concept is pretty much the same for other packages and non-unix systems. The idea here to run independent simulations on each core.</p>

<p>Before stepping into details, let’s consider an illustrative example. We run a classical function <code class="language-plaintext highlighter-rouge">parallel::mclapply()</code> that returns a random uniform number for each iteration. This function supplies a vector of ten elements as <code class="language-plaintext highlighter-rouge">X</code> argument, a simple wrapper around <code class="language-plaintext highlighter-rouge">runif(1)</code> to ignore elements of <code class="language-plaintext highlighter-rouge">X</code>, the number of cores (in my case 2 physical cores), and also we set <code class="language-plaintext highlighter-rouge">mc.set.seed = FALSE</code>. We run this expression two times (<code class="language-plaintext highlighter-rouge">unlist</code> is used for a more compact representation):</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">parallel</span><span class="p">)</span><span class="w">

</span><span class="n">rn1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="w">
    </span><span class="n">mclapply</span><span class="p">(</span><span class="n">X</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">,</span><span class="w">
             </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="m">1</span><span class="p">),</span><span class="w">
             </span><span class="n">mc.cores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
             </span><span class="n">mc.set.seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">rn1</span><span class="w">
</span><span class="c1"># &gt; [1] 0.3495050 0.3495050 0.4159384 0.4159384 0.5376814 0.5376814 0.3279605</span><span class="w">
</span><span class="c1">#   [8] 0.3279605 0.1527834 0.1527834</span><span class="w">

</span><span class="n">identical</span><span class="p">(</span><span class="n">rn1</span><span class="p">[</span><span class="n">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)],</span><span class="w"> </span><span class="n">rn1</span><span class="p">[</span><span class="n">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)])</span><span class="w">
</span><span class="c1"># &gt; [1] TRUE</span><span class="w">
</span></code></pre></div></div>

<p>One can immediately notice a suspicious thing – every second element equals to the previous one. The explanation is very simple: the same workspace is restored from the master process for each worker (or process). It means that <code class="language-plaintext highlighter-rouge">.Random.seed</code> will be extracted from the parent process, and therefore, RNG state will be the same for each worker. As result, the same sequence of random numbers will be generated by each of workers.</p>

<p>Of course this issue is not desirable. The alternative method is to have separate (distinct) seeds for each worker. The potential problem would be that the generated numbers might get into steps (i.e. been periodically repeated, therefore, correlated between streams). To resolve this <code class="language-plaintext highlighter-rouge">parallel</code> package utilizes <code class="language-plaintext highlighter-rouge">"L'Ecuyer-CMRG"</code> RNG, which has a quite long period with a small seed, ensuring streams do not get into steps easily. To set the RND to <code class="language-plaintext highlighter-rouge">"L'Ecuyer-CMRG"</code> one runs <code class="language-plaintext highlighter-rouge">RNGkind("L'Ecuyer-CMRG")</code>, also changing argument <code class="language-plaintext highlighter-rouge">mc.set.seed</code> of <code class="language-plaintext highlighter-rouge">mclapply</code> to <code class="language-plaintext highlighter-rouge">TRUE</code>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">RNGkind</span><span class="p">(</span><span class="s2">"L'Ecuyer-CMRG"</span><span class="p">)</span><span class="w">

</span><span class="n">rng1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="w">
    </span><span class="n">mclapply</span><span class="p">(</span><span class="n">X</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">,</span><span class="w">
             </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="m">1</span><span class="p">),</span><span class="w">
             </span><span class="n">mc.cores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
             </span><span class="n">mc.set.seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">rng1</span><span class="w">

</span><span class="c1"># &gt; [1] 0.67681994 0.54730337 0.05398847 0.19480448 0.94954659 0.35727778</span><span class="w">
</span><span class="c1">#   [7] 0.17057359 0.83029494 0.37063552 0.24445617</span><span class="w">
</span></code></pre></div></div>

<p>Elements now are different, and <code class="language-plaintext highlighter-rouge">"L'Ecuyer-CMRG"</code> uses <code class="language-plaintext highlighter-rouge">nextRNGStream()</code> to generate a next “uncorrelated” seed. The pseudo code (taken from <code class="language-plaintext highlighter-rouge">vignette("parallel")</code>) explains this concept:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># &gt; RNGkind("L'Ecuyer-CMRG")</span><span class="w">
</span><span class="c1"># &gt; set.seed(2002) # something</span><span class="w">
</span><span class="c1"># &gt; M &lt;- 16 ## start M workers</span><span class="w">
</span><span class="c1"># &gt; s &lt;- .Random.seed</span><span class="w">
</span><span class="c1"># &gt; for (i in 1:M) {</span><span class="w">
</span><span class="c1"># +     s &lt;- nextRNGStream(s)</span><span class="w">
</span><span class="c1"># +     # send s to worker i as .Random.seed</span><span class="w">
</span><span class="c1"># + }</span><span class="w">
</span></code></pre></div></div>

<p>Let’s run the same expression one more time:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rng2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="w">
    </span><span class="n">mclapply</span><span class="p">(</span><span class="n">X</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">,</span><span class="w">
             </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="m">1</span><span class="p">),</span><span class="w">
             </span><span class="n">mc.cores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
             </span><span class="n">mc.set.seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">rng2</span><span class="w">

</span><span class="c1"># &gt; [1] 0.67681994 0.54730337 0.05398847 0.19480448 0.94954659 0.35727778</span><span class="w">
</span><span class="c1">#   [7] 0.17057359 0.83029494 0.37063552 0.24445617</span><span class="w">

</span><span class="n">identical</span><span class="p">(</span><span class="n">rng1</span><span class="p">,</span><span class="w"> </span><span class="n">rng2</span><span class="p">)</span><span class="w">
</span><span class="c1"># &gt; [1] TRUE</span><span class="w">
</span></code></pre></div></div>

<p>The second <code class="language-plaintext highlighter-rouge">rng2</code> is absolutely identical to <code class="language-plaintext highlighter-rouge">rng1</code> that was run before. Coincidence? Nope. The thing is the <code class="language-plaintext highlighter-rouge">.Random.seed</code> of master process is NOT affected by worker processes (see pseudo-code). That is why we will have the same numbers during a second, third, and any other run, unless the <code class="language-plaintext highlighter-rouge">.Random.seed</code> will be changed (e.g. by <code class="language-plaintext highlighter-rouge">runif(1)</code> in a master process).</p>

<p>Note that even if <code class="language-plaintext highlighter-rouge">mc.set.seed</code> is <code class="language-plaintext highlighter-rouge">TRUE</code>, but RNG is different from <code class="language-plaintext highlighter-rouge">"L'Ecuyer-CMRG"</code>, then using <code class="language-plaintext highlighter-rouge">set.seed()</code> won’t establish reproducibility.</p>

<p>In the end, the package <code class="language-plaintext highlighter-rouge">parallel</code> is a little bit vague when it comes to RNG, so that I have to read <code class="language-plaintext highlighter-rouge">vignette("parallel")</code> (Section 6), dozens of cross-refereed helps (<code class="language-plaintext highlighter-rouge">?mclapply</code> RNG section refers to <code class="language-plaintext highlighter-rouge">?mcparallel</code>, which requires to read <code class="language-plaintext highlighter-rouge">?nextRNGStream</code>), and finally deep dive into sours non-exported function via <code class="language-plaintext highlighter-rouge">parallel:::mc.set.seed</code>.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Generally speaking, if the code does any simulations, it is a good practice to set a seed to make the code reproducible. Setting a seed ensures that the same (pseudo-)random numbers will be generated each time the script is executed. Surprisingly, I found really few posts dedicated to any convention, best practice, or routine of setting a seed in R. Further, when using multiple cores (parallelisation) for simulations, things can get slightly more complicated.]]></summary></entry><entry><title type="html">💾 [archived] Installing MySQL on MacOS (and using it with R)</title><link href="http://irudnyts.github.io//installing-mysql-on-macos_and_using_it_with_r/" rel="alternate" type="text/html" title="💾 [archived] Installing MySQL on MacOS (and using it with R)" /><published>2018-03-27T00:00:00+00:00</published><updated>2018-03-27T00:00:00+00:00</updated><id>http://irudnyts.github.io//installing-mysql-on-macos_and_using_it_with_r</id><content type="html" xml:base="http://irudnyts.github.io//installing-mysql-on-macos_and_using_it_with_r/"><![CDATA[<p>A couple of days ago I was asked to install MySQL on MacOS 10.13, and I was surprised that it was not a one-click installation, as in case of R. Unfortunately, even for me a documentation was a bit confusing, and I think it might be useful to have a guide of the installation process.</p>

<blockquote>
  <p><strong>Disclaimer:</strong> This post is outdated and was archived for back compatibility: please use with care! This post does not reflect the author’s current point of view and might deviate from the current best practices.</p>
</blockquote>

<h2 id="1-download-dmg-file-and-install-mysql">1. Download .dmg file and install MySQL</h2>

<p>One has to download .dmg file from <a href="https://dev.mysql.com/downloads/mysql/">here</a>. The app should be installed like a regular Mac app, and the procedure is well covered <a href="https://dev.mysql.com/doc/refman/5.6/en/osx-installation-pkg.html">here</a>.</p>

<p>At the end of the installation, when one has reached a summary, a separate windows will pop up with a temporary password (as in a screenshot below). This password should be kept somewhere.</p>

<p><img src="https://irudnyts.github.io/images/posts/2018-03-27-installing-mysql-on-macos/key.png" alt="" /></p>

<h2 id="2-set-aliases">2. Set aliases</h2>

<p>In order to avoid changing directories all the time before evoking <code class="language-plaintext highlighter-rouge">mysql</code> we can set aliases for <code class="language-plaintext highlighter-rouge">mysql</code> and <code class="language-plaintext highlighter-rouge">mysqladmin</code> commands. To do so one has to open Terminal and execute the following commands (assuming that MySQL was installed to a default folder):</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">alias </span><span class="nv">mysql</span><span class="o">=</span>/usr/local/mysql/bin/mysql
<span class="nb">alias </span><span class="nv">mysqladmin</span><span class="o">=</span>/usr/local/mysql/bin/mysqladmin
</code></pre></div></div>

<h2 id="3-start-mysql-sever">3. Start MySQL sever</h2>

<p>Everything should go smooth so far. Now we need to start our sever. One can do it in Terminal:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> /Library/LaunchDaemons
<span class="nb">sudo </span>launchctl load <span class="nt">-F</span> com.oracle.oss.mysql.mysqld.plist
</code></pre></div></div>

<p>or in System Preferences…</p>

<p><img src="https://irudnyts.github.io/images/posts/2018-03-27-installing-mysql-on-macos/sys_pref.png" alt="" /></p>

<p>… by clicking on “Start”.</p>

<p><img src="https://irudnyts.github.io/images/posts/2018-03-27-installing-mysql-on-macos/start.png" alt="" /></p>

<h2 id="4-change-the-temporary-password">4. Change the temporary password</h2>

<p>Now we need to run MySQL to change a temporary password for a ‘root’ user. After calling the following command, Terminal will ask for a password which we saved when installing MySQL in the first step:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mysql <span class="nt">-u</span> root <span class="nt">-p</span>
</code></pre></div></div>

<p>If everything is done correctly, you should see something like this:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Welcome to the MySQL monitor.  Commands end with <span class="p">;</span> or <span class="se">\g</span><span class="nb">.</span>
Your MySQL connection <span class="nb">id </span>is 24
Server version: 5.7.21

Copyright <span class="o">(</span>c<span class="o">)</span> 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type <span class="s1">'help;'</span> or <span class="s1">'\h'</span> <span class="k">for </span>help. Type <span class="s1">'\c'</span> to clear the current input statement.

mysql&gt;
</code></pre></div></div>

<p>To change the password we simply call this command, where “MyNewPass” as you already guessed is a new password:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SET</span> <span class="n">PASSWORD</span> <span class="k">FOR</span> <span class="s1">'root'</span><span class="o">@</span><span class="s1">'localhost'</span> <span class="o">=</span> <span class="n">PASSWORD</span><span class="p">(</span><span class="s1">'MyNewPass'</span><span class="p">);</span>
</code></pre></div></div>

<p>And then quit MySQL:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">QUIT</span>
</code></pre></div></div>

<h2 id="5-optional-install-sequel-pro-ide-for-mysql">5. (Optional) Install Sequel Pro IDE for MySQL</h2>

<p>I find Sequel Pro a quite useful and beautiful IDE for MySQL. To install it one has to download <a href="https://sequelpro.com/download#auto-start">a .dmg file</a>, open it, and drag &amp; drop “Sequel Pro.app” to applications’ folder.</p>

<p>To connect to a local MySQL one has choose Socket in menu and fill in a username (default “root”) and the password that we changes in the previous step.</p>

<p><img src="https://irudnyts.github.io/images/posts/2018-03-27-installing-mysql-on-macos/sql_pro.png" alt="" /></p>

<h2 id="6-use-mysql-in-conjuntion-with-r">6. Use MySQL in conjuntion with R</h2>

<p><a href="https://cran.r-project.org/web/packages/RMySQL/index.html">RMySQL</a> provides a full interface for connecting R to MySQL. There are dozens of tutorials on how to use this package, and one can easily google them. We just want to ensure that everything works smoothly. First off, MySQL Server should be launched (as in Step 3). Then, we install and load the package, and finally, using user/password pair connect to a certain database.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">install.packages</span><span class="p">(</span><span class="s2">"RMySQL"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">RMySQL</span><span class="p">)</span><span class="w">

</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"RMySQL"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">RMySQL</span><span class="p">)</span><span class="w">

</span><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">MySQL</span><span class="p">(),</span><span class="w">
                 </span><span class="n">user</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"root"</span><span class="p">,</span><span class="w"> </span><span class="n">password</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"MyNewPass"</span><span class="p">,</span><span class="w">
                 </span><span class="n">dbname</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"test"</span><span class="p">,</span><span class="w"> </span><span class="n">host</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"localhost"</span><span class="p">)</span><span class="w">

</span><span class="n">dbListTables</span><span class="p">(</span><span class="n">con</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] "CalendarMonths"</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] TRUE</span><span class="w">
</span></code></pre></div></div>

<p>Enjoy!</p>]]></content><author><name></name></author><summary type="html"><![CDATA[A couple of days ago I was asked to install MySQL on MacOS 10.13, and I was surprised that it was not a one-click installation, as in case of R. Unfortunately, even for me a documentation was a bit confusing, and I think it might be useful to have a guide of the installation process.]]></summary></entry><entry><title type="html">📈 [archived] Simulating Poisson process (part 2)</title><link href="http://irudnyts.github.io//simulating-poisson-process-part-2/" rel="alternate" type="text/html" title="📈 [archived] Simulating Poisson process (part 2)" /><published>2018-03-13T00:00:00+00:00</published><updated>2018-03-13T00:00:00+00:00</updated><id>http://irudnyts.github.io//simulating-poisson-process-part-2</id><content type="html" xml:base="http://irudnyts.github.io//simulating-poisson-process-part-2/"><![CDATA[<p>In <a href="https://irudnyts.github.io/simulating-poisson-process-part-1/">previous post</a> we discussed two common methods of Poisson process simulation. The reason why this trivial problem was of my interest is the fact that this is simplification of a larger scale problem of a classical ruin process.</p>

<blockquote>
  <p><strong>Disclaimer:</strong> This post is outdated and was archived for back compatibility: please use with care! This post does not reflect the author’s current point of view and might deviate from the current best practices.</p>
</blockquote>

<p>Let me remind that I focus on an extenssion of Cramér–Lundberg model with positive jumps, that is:</p>

\[X(t) = u + ct + \sum_{i = 1}^{N_1(t)}X_i - \sum_{j = 1}^{N_2(t)}Y_j,\]

<p>where:</p>

<ul>
  <li>$u$: is an initial capital;</li>
  <li>$c$ is a premium rate;</li>
  <li>$N_1(t)$ and $N_2(t)$ are Poisson processes of capital injections and claims with rates $\lambda_1$ and $\lambda_2$, respectively;</li>
  <li>$X_i$ and $Y_j$ are i.i.d. random variables modeling sizes of capital injections and claims, respectively.</li>
</ul>

<p>We simplify this model to a bare minimum, which reflects the behaviour of Poisson processes, that is we set $u = 0$, $c = 0$. Further, we assume deterministic unit jumps: $X_i = 1$ and $Y_j = 1$. As result we have the following model:</p>

\[X(t) = N_1(t) - N_2(t).\]

<p>Nice thing about this model is that we know exact distribution of $X(t)$. It is called <a href="https://en.wikipedia.org/wiki/Skellam_distribution">Skellam distribution</a>, which is covered in <a href="https://cran.r-project.org/web/packages/skellam/index.html">skellam</a> package. Therefore, we can compare Monte-Carlo simulated estimates to their exact equivaletns.</p>

<h2 id="method-1">Method 1</h2>

<p>For Gerber-Shui function we need to simulate a path until the ruin. It means that the time horizon is not known <em>a priori</em>. As consequence, the algorithm that first simulates the number of jumps for a given time is not applicable. Therefore, the only possibility to simulate a path until the ruin is to exploit the fact that interarrival times of jumps are exponentially distributed. In case of only negative jumps the algorithm is quite simple: in while loop we add jumps to a path until the process is ruined (or any other stopping conditions, e.g. maximum iterations acheived).</p>

<p>Things are slightly more complicated when the model includes positive jumps. If both positive and negative jumps’ times were known to us, then it would be possible to sort them, and add to a path in ascending order, as in an illustration below.</p>

<p><img src="https://irudnyts.github.io/images/posts/2018-03-13-simulating-poisson-process-part-2/order.png" alt="" /></p>

<p>However, we do not know the arrival times of jumps, as they should be simulated. Also it would be naive to add a jump of one type at an iteration and then of another type in the next iteration, because underlying Possion process can have different rates (and, therefore, there is no garantee that one type jump follows the other type). The approach I propose here is very similar to a playground game <a href="https://en.wikipedia.org/wiki/Tag_(game)">“tag”</a>. We generate arrival jump’s time for both types. Then, for one that occurred earlier (A), we need to catch up with the opposite type’s (B) time, that is generate more jumps of type (A) until the time of the later type (B) is achieved.</p>

<p>Algorithm:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>- generate first times to negative and time to positive jump
- repeat
    * if (time to last positive jump &gt; time to last negative jump)
        # add last negative jump to path
        # if (stopping condition) exit loop
        # repeat
            % generate time to negative jump
            % if (time to last positive jump &lt; time to last negative jump)
                @ exit loop
            % add negative jump to path
            % if (stopping condition) exit loop
        # if (stopping condition) exit loop
    * else
        # add positive jump
        # if (stopping condition) exit loop
        # repeat
            % generate time to positive jump
            % if (time to last positive jump &gt; time to last negative jump)
                @ exit loop
            % add positive jump to path
            % if (stopping condition) exit loop
        # if (stopping condition) exit loop
</code></pre></div></div>

<p>Let me illustrate a couple of iterations to give a feeling of the algorithm. We simulate positive and negative jumps’ arrival:</p>

<p><img src="https://irudnyts.github.io/images/posts/2018-03-13-simulating-poisson-process-part-2/1.png" alt="" /></p>

<p>Positive jump occurs later, therefore, we need to catch up with negative jumps. We simulate next negative arrival…</p>

<p><img src="https://irudnyts.github.io/images/posts/2018-03-13-simulating-poisson-process-part-2/2.png" alt="" /></p>

<p>And next negative arrival…</p>

<p><img src="https://irudnyts.github.io/images/posts/2018-03-13-simulating-poisson-process-part-2/3.png" alt="" /></p>

<p>One more…</p>

<p><img src="https://irudnyts.github.io/images/posts/2018-03-13-simulating-poisson-process-part-2/4.png" alt="" /></p>

<p>And finally the negative jumps over positive and now we need to catch up with positve ones…</p>

<p><img src="https://irudnyts.github.io/images/posts/2018-03-13-simulating-poisson-process-part-2/5.png" alt="" /></p>

<p>And so forth and so on. In the algorithm above <code class="language-plaintext highlighter-rouge">stopping condition</code> could be anythin, for instance, a maximum number of jumps is attained, the maximum number of iterations is attained, the maximum time span is attained, the path is ruined, etc. Below, I propose an implementation with stopping time that uses a maximum time span.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">magrittr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">skellam</span><span class="p">)</span><span class="w">

</span><span class="n">sim_p1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">lambda_p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda_n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

    </span><span class="c1"># utility function: get last element of a vector</span><span class="w">
    </span><span class="n">last</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">yes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="nf">length</span><span class="p">(</span><span class="n">x</span><span class="p">)],</span><span class="w"> </span><span class="n">no</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">

    </span><span class="c1"># initialize process</span><span class="w">
    </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
    </span><span class="n">colnames</span><span class="p">(</span><span class="n">path</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"time"</span><span class="p">,</span><span class="w"> </span><span class="s2">"X"</span><span class="p">)</span><span class="w">
    </span><span class="n">path</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">

    </span><span class="c1"># function for adding negative jump to a path</span><span class="w">
    </span><span class="n">add_jump_n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">

        </span><span class="c1"># add a new time arrival to arrival times vector</span><span class="w">
        </span><span class="n">time_n</span><span class="w"> </span><span class="o">&lt;&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">time_n</span><span class="p">,</span><span class="w"> </span><span class="n">current_time_n</span><span class="p">)</span><span class="w">

        </span><span class="c1"># add a negative jump to the path</span><span class="w">
        </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w">
            </span><span class="n">path</span><span class="p">,</span><span class="w">
            </span><span class="nf">c</span><span class="p">(</span><span class="n">current_time_n</span><span class="p">,</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">])</span><span class="w">
        </span><span class="p">)</span><span class="w">

        </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w">
            </span><span class="n">path</span><span class="p">,</span><span class="w">
            </span><span class="nf">c</span><span class="p">(</span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
        </span><span class="p">)</span><span class="w">
    </span><span class="p">}</span><span class="w">

    </span><span class="c1"># function for adding positive jump to a path</span><span class="w">
    </span><span class="n">add_jump_p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">

        </span><span class="c1"># add a new time arrival to arrival times vector</span><span class="w">
        </span><span class="n">time_p</span><span class="w"> </span><span class="o">&lt;&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">time_p</span><span class="p">,</span><span class="w"> </span><span class="n">current_time_p</span><span class="p">)</span><span class="w">

        </span><span class="c1"># add a positive jump to the path</span><span class="w">
        </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w">
            </span><span class="n">path</span><span class="p">,</span><span class="w">
            </span><span class="nf">c</span><span class="p">(</span><span class="n">current_time_p</span><span class="p">,</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">])</span><span class="w">
        </span><span class="p">)</span><span class="w">

        </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w">
            </span><span class="n">path</span><span class="p">,</span><span class="w">
            </span><span class="nf">c</span><span class="p">(</span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
        </span><span class="p">)</span><span class="w">
    </span><span class="p">}</span><span class="w">

    </span><span class="c1"># check whether the path is reached maximum time span</span><span class="w">
    </span><span class="n">is_max_time_span_attained</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">t</span><span class="w">

    </span><span class="n">time_n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">numeric</span><span class="p">()</span><span class="w"> </span><span class="c1"># time of negative jumps arrivals</span><span class="w">
    </span><span class="n">time_p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">numeric</span><span class="p">()</span><span class="w"> </span><span class="c1"># time of positive jumps arrivals</span><span class="w">

    </span><span class="n">current_time_n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rexp</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda_n</span><span class="p">)</span><span class="w"> </span><span class="c1"># current time arrival of the negative</span><span class="w">
    </span><span class="n">current_time_p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rexp</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda_p</span><span class="p">)</span><span class="w"> </span><span class="c1"># current time arrival of the positive</span><span class="w">

    </span><span class="k">repeat</span><span class="p">{</span><span class="w">

        </span><span class="k">if</span><span class="p">(</span><span class="n">current_time_p</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">current_time_n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

            </span><span class="n">add_jump_n</span><span class="p">()</span><span class="w">

            </span><span class="k">if</span><span class="p">(</span><span class="n">is_max_time_span_attained</span><span class="p">())</span><span class="w"> </span><span class="k">break</span><span class="w">

            </span><span class="k">repeat</span><span class="w"> </span><span class="p">{</span><span class="w">

                </span><span class="n">current_time_n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">last</span><span class="p">(</span><span class="n">time_n</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rexp</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda_n</span><span class="p">)</span><span class="w">
                </span><span class="k">if</span><span class="p">(</span><span class="n">current_time_p</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">current_time_n</span><span class="p">)</span><span class="w"> </span><span class="k">break</span><span class="w">

                </span><span class="n">add_jump_n</span><span class="p">()</span><span class="w">

                </span><span class="k">if</span><span class="p">(</span><span class="n">is_max_time_span_attained</span><span class="p">())</span><span class="w"> </span><span class="k">break</span><span class="w">
            </span><span class="p">}</span><span class="w">

            </span><span class="k">if</span><span class="p">(</span><span class="n">is_max_time_span_attained</span><span class="p">())</span><span class="w"> </span><span class="k">break</span><span class="w">


        </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">

            </span><span class="n">add_jump_p</span><span class="p">()</span><span class="w">

            </span><span class="k">if</span><span class="p">(</span><span class="n">is_max_time_span_attained</span><span class="p">())</span><span class="w"> </span><span class="k">break</span><span class="w">

            </span><span class="k">repeat</span><span class="w"> </span><span class="p">{</span><span class="w">
                </span><span class="n">current_time_p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">last</span><span class="p">(</span><span class="n">time_p</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rexp</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda_p</span><span class="p">)</span><span class="w">
                </span><span class="k">if</span><span class="p">(</span><span class="n">current_time_p</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">current_time_n</span><span class="p">)</span><span class="w"> </span><span class="k">break</span><span class="w">

                </span><span class="n">add_jump_p</span><span class="p">()</span><span class="w">

                </span><span class="k">if</span><span class="p">(</span><span class="n">is_max_time_span_attained</span><span class="p">())</span><span class="w"> </span><span class="k">break</span><span class="w">
            </span><span class="p">}</span><span class="w">

            </span><span class="k">if</span><span class="p">(</span><span class="n">is_max_time_span_attained</span><span class="p">())</span><span class="w"> </span><span class="k">break</span><span class="w">

        </span><span class="p">}</span><span class="w">
    </span><span class="p">}</span><span class="w">

    </span><span class="c1"># dropping last step to be before t</span><span class="w">
    </span><span class="n">indices</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">path</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">t</span><span class="w">
    </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">indices</span><span class="p">,</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="n">drop</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">]</span><span class="w">
    </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">path</span><span class="p">,</span><span class="w">
                  </span><span class="nf">c</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]))</span><span class="w">

    </span><span class="n">rval</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="w">
        </span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">path</span><span class="p">,</span><span class="w">
        </span><span class="n">time_p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">time_p</span><span class="p">,</span><span class="w">
        </span><span class="n">time_n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">time_n</span><span class="w">
    </span><span class="p">)</span><span class="w">

    </span><span class="nf">return</span><span class="p">(</span><span class="n">rval</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Let’s compare estimated expected value and variance of interarrival jumps with theoretical ones, for both positive and negative jumps. With default parameters $\lambda_1 = 1$ and $\lambda_2 = 1$, they all should be around one. This is confirmed in the code snipped below:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set.seed</span><span class="p">(</span><span class="m">2018</span><span class="p">)</span><span class="w">

</span><span class="n">p1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sim_p1</span><span class="p">(</span><span class="n">t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w">
</span><span class="n">mean</span><span class="p">(</span><span class="n">diff</span><span class="p">(</span><span class="n">p1</span><span class="o">$</span><span class="n">time_p</span><span class="p">));</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">diff</span><span class="p">(</span><span class="n">p1</span><span class="o">$</span><span class="n">time_p</span><span class="p">))</span><span class="w">
</span><span class="c1"># [1] 0.9713893</span><span class="w">
</span><span class="c1"># [1] 0.9399382</span><span class="w">
</span><span class="n">mean</span><span class="p">(</span><span class="n">diff</span><span class="p">(</span><span class="n">p1</span><span class="o">$</span><span class="n">time_n</span><span class="p">));</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">diff</span><span class="p">(</span><span class="n">p1</span><span class="o">$</span><span class="n">time_n</span><span class="p">))</span><span class="w">
</span><span class="c1"># [1] 1.067944</span><span class="w">
</span><span class="c1"># [1] 1.121641</span><span class="w">
</span></code></pre></div></div>

<h2 id="method-2">Method 2</h2>

<p>The second method is a bit less cumbersome. We simulate separately the number of positive and negative jumps in the interval $(0, t)$ by Poisson distribution with respective rates. Then, we generate arrival times of jumps by the uniform distribution, which is then sorted. Finally, the full path should be built, that is adding positive and negative jumps in a loop. The idea is as before: we compare what kind of jump (negative vs positive) occure earlier, and add the earliest one. Note, that it is possible that several negative jumps occure before a positive one, and vice versa. Also it is possible that only positive or only negative jumps are left, therefore, we need to incorporate this in the loop.</p>

<p>The implementation is presetned below:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sim_p2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">lambda_p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda_n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

    </span><span class="c1"># simulate numbers of positive and negative jumps</span><span class="w">
    </span><span class="n">number_p_jumps</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rpois</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lambda_p</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">t</span><span class="p">)</span><span class="w">
    </span><span class="n">number_n_jumps</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rpois</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lambda_n</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">t</span><span class="p">)</span><span class="w">

    </span><span class="c1"># simulate the time of jumps' arrivals</span><span class="w">
    </span><span class="n">p_jumps_arrival</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">number_p_jumps</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">sort</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">multiply_by</span><span class="p">(</span><span class="n">t</span><span class="p">)</span><span class="w">
    </span><span class="n">n_jumps_arrival</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">number_n_jumps</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">sort</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">multiply_by</span><span class="p">(</span><span class="n">t</span><span class="p">)</span><span class="w">

    </span><span class="c1"># keep the time of jumps' arrivals in separate variables</span><span class="w">
    </span><span class="n">time_p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">p_jumps_arrival</span><span class="w">
    </span><span class="n">time_n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">n_jumps_arrival</span><span class="w">

    </span><span class="c1"># initialize process</span><span class="w">
    </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
    </span><span class="n">colnames</span><span class="p">(</span><span class="n">path</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"time"</span><span class="p">,</span><span class="w"> </span><span class="s2">"X"</span><span class="p">)</span><span class="w">
    </span><span class="n">path</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">


    </span><span class="k">while</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">p_jumps_arrival</span><span class="p">)</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">n_jumps_arrival</span><span class="p">)</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

        </span><span class="k">if</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">p_jumps_arrival</span><span class="p">)</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">n_jumps_arrival</span><span class="p">)</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
            </span><span class="k">if</span><span class="p">(</span><span class="n">p_jumps_arrival</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">n_jumps_arrival</span><span class="p">[</span><span class="m">1</span><span class="p">])</span><span class="w"> </span><span class="p">{</span><span class="w">

                </span><span class="c1"># add positive jump</span><span class="w">

                </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w">
                    </span><span class="n">path</span><span class="p">,</span><span class="w">
                    </span><span class="nf">c</span><span class="p">(</span><span class="n">p_jumps_arrival</span><span class="p">[</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">])</span><span class="w">
                </span><span class="p">)</span><span class="w">

                </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w">
                    </span><span class="n">path</span><span class="p">,</span><span class="w">
                    </span><span class="nf">c</span><span class="p">(</span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
                </span><span class="p">)</span><span class="w">

                </span><span class="n">p_jumps_arrival</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">p_jumps_arrival</span><span class="p">[</span><span class="m">-1</span><span class="p">]</span><span class="w">

            </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">

                </span><span class="c1"># add negative jump</span><span class="w">

                </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w">
                    </span><span class="n">path</span><span class="p">,</span><span class="w">
                    </span><span class="nf">c</span><span class="p">(</span><span class="n">n_jumps_arrival</span><span class="p">[</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">])</span><span class="w">
                </span><span class="p">)</span><span class="w">

                </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w">
                    </span><span class="n">path</span><span class="p">,</span><span class="w">
                    </span><span class="nf">c</span><span class="p">(</span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
                </span><span class="p">)</span><span class="w">

                </span><span class="n">n_jumps_arrival</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">n_jumps_arrival</span><span class="p">[</span><span class="m">-1</span><span class="p">]</span><span class="w">

            </span><span class="p">}</span><span class="w">
        </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
            </span><span class="k">if</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">p_jumps_arrival</span><span class="p">)</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

                </span><span class="c1"># add positive jump</span><span class="w">

                </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w">
                    </span><span class="n">path</span><span class="p">,</span><span class="w">
                    </span><span class="nf">c</span><span class="p">(</span><span class="n">p_jumps_arrival</span><span class="p">[</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">])</span><span class="w">
                </span><span class="p">)</span><span class="w">

                </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w">
                    </span><span class="n">path</span><span class="p">,</span><span class="w">
                    </span><span class="nf">c</span><span class="p">(</span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
                </span><span class="p">)</span><span class="w">

                </span><span class="n">p_jumps_arrival</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">p_jumps_arrival</span><span class="p">[</span><span class="m">-1</span><span class="p">]</span><span class="w">

            </span><span class="p">}</span><span class="w">
            </span><span class="k">if</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">n_jumps_arrival</span><span class="p">)</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

                </span><span class="c1"># add negative jump</span><span class="w">

                </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w">
                    </span><span class="n">path</span><span class="p">,</span><span class="w">
                    </span><span class="nf">c</span><span class="p">(</span><span class="n">n_jumps_arrival</span><span class="p">[</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">])</span><span class="w">
                </span><span class="p">)</span><span class="w">

                </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w">
                    </span><span class="n">path</span><span class="p">,</span><span class="w">
                    </span><span class="nf">c</span><span class="p">(</span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
                </span><span class="p">)</span><span class="w">

                </span><span class="n">n_jumps_arrival</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">n_jumps_arrival</span><span class="p">[</span><span class="m">-1</span><span class="p">]</span><span class="w">
            </span><span class="p">}</span><span class="w">
        </span><span class="p">}</span><span class="w">
    </span><span class="p">}</span><span class="w">

    </span><span class="c1"># add last step</span><span class="w">
    </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">path</span><span class="p">,</span><span class="w">
                  </span><span class="nf">c</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]))</span><span class="w">

    </span><span class="n">rval</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="w">
        </span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">path</span><span class="p">,</span><span class="w">
        </span><span class="n">time_p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">time_p</span><span class="p">,</span><span class="w">
        </span><span class="n">time_n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">time_n</span><span class="w">
    </span><span class="p">)</span><span class="w">

    </span><span class="nf">return</span><span class="p">(</span><span class="n">rval</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Again, we need to be assured that estimated expected values and variances for positive and negative jumps are close to one:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">p2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sim_p2</span><span class="p">(</span><span class="n">t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w">
</span><span class="n">mean</span><span class="p">(</span><span class="n">diff</span><span class="p">(</span><span class="n">p2</span><span class="o">$</span><span class="n">time_p</span><span class="p">));</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">diff</span><span class="p">(</span><span class="n">p2</span><span class="o">$</span><span class="n">time_p</span><span class="p">))</span><span class="w">
</span><span class="c1"># [1] 1.006591</span><span class="w">
</span><span class="c1"># [1] 1.054261</span><span class="w">
</span><span class="n">mean</span><span class="p">(</span><span class="n">diff</span><span class="p">(</span><span class="n">p2</span><span class="o">$</span><span class="n">time_n</span><span class="p">));</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">diff</span><span class="p">(</span><span class="n">p2</span><span class="o">$</span><span class="n">time_n</span><span class="p">))</span><span class="w">
</span><span class="c1"># [1] 0.9923275</span><span class="w">
</span><span class="c1"># [1] 0.8606385</span><span class="w">
</span></code></pre></div></div>

<h2 id="convergence">Convergence</h2>

<p>Finally, we want visually check if the estimatros are non-biased, and how fast they converge (i.e. which method has a smaller variance). We focus on the expected value of a path, as well as on the probability of the path to be below a certain value. For this we simulate a vast number (1000) of paths using each of methods. Then, we estimae the expected value as a function of the number of simulations by aggregating first <code class="language-plaintext highlighter-rouge">x</code> paths.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1000</span><span class="w">

</span><span class="n">paths1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">expr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sim_p1</span><span class="p">(),</span><span class="w"> </span><span class="n">simplify</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">paths2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">expr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sim_p2</span><span class="p">(),</span><span class="w"> </span><span class="n">simplify</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>First, we look at the expected value, which sould be zero given both lambdas equal one:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">means1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="w">
    </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w">
    </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="n">mean</span><span class="p">(</span><span class="n">sapply</span><span class="p">(</span><span class="n">paths1</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="n">x</span><span class="p">],</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="n">y</span><span class="o">$</span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">y</span><span class="o">$</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]))</span><span class="w">
    </span><span class="p">}</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">means2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="w">
    </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w">
    </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="n">mean</span><span class="p">(</span><span class="n">sapply</span><span class="p">(</span><span class="n">paths2</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="n">x</span><span class="p">],</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="n">y</span><span class="o">$</span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">y</span><span class="o">$</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]))</span><span class="w">
    </span><span class="p">}</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">means</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w">
    </span><span class="n">data.frame</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">means1</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"1"</span><span class="p">),</span><span class="w">
    </span><span class="n">data.frame</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">means2</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"2"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">ggplot</span><span class="p">(</span><span class="n">means</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">method</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_hline</span><span class="p">(</span><span class="n">yintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme</span><span class="p">(</span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">24</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://irudnyts.github.io/images/posts/2018-03-13-simulating-poisson-process-part-2/mean.png" alt="" /></p>

<p>From the plot it seems that both methods converge with the same speed to the correct value. Huray! How about probabilities?</p>

<p>For probabilites we used a threashold of 10 (arbitrary choosen), and a true value calculated by <code class="language-plaintext highlighter-rouge">pskellam</code> package. For default argument <code class="language-plaintext highlighter-rouge">t = 10</code>, the distribution of $X(10) \sim Skellam(\lambda_1 \cdot t, \lambda_2 \cdot t)$.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">probs1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="w">
    </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w">
    </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="n">mean</span><span class="p">(</span><span class="n">sapply</span><span class="p">(</span><span class="n">paths1</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="n">x</span><span class="p">],</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="n">y</span><span class="o">$</span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">y</span><span class="o">$</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="m">10</span><span class="p">))</span><span class="w">
    </span><span class="p">}</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">probs2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="w">
    </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w">
    </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="n">mean</span><span class="p">(</span><span class="n">sapply</span><span class="p">(</span><span class="n">paths2</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="n">x</span><span class="p">],</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="n">y</span><span class="o">$</span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">y</span><span class="o">$</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="m">10</span><span class="p">))</span><span class="w">
    </span><span class="p">}</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">probs</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w">
    </span><span class="n">data.frame</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">probs1</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"1"</span><span class="p">),</span><span class="w">
    </span><span class="n">data.frame</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">probs2</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"2"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">ggplot</span><span class="p">(</span><span class="n">probs</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">method</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_hline</span><span class="p">(</span><span class="n">yintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pskellam</span><span class="p">(</span><span class="n">q</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">lambda1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">lambda2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme</span><span class="p">(</span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">24</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://irudnyts.github.io/images/posts/2018-03-13-simulating-poisson-process-part-2/prob.png" alt="" /></p>

<p>Again, the probabilites converge to the correct value with approximately the same speed. It means that there are no bias in neither methods, and we can continue extending the function of simulating ruin processes.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In previous post we discussed two common methods of Poisson process simulation. The reason why this trivial problem was of my interest is the fact that this is simplification of a larger scale problem of a classical ruin process.]]></summary></entry><entry><title type="html">📈 [archived] Simulating Poisson process (part 1)</title><link href="http://irudnyts.github.io//simulating-poisson-process-part-1/" rel="alternate" type="text/html" title="📈 [archived] Simulating Poisson process (part 1)" /><published>2018-03-09T00:00:00+00:00</published><updated>2018-03-09T00:00:00+00:00</updated><id>http://irudnyts.github.io//simulating-poisson-process-part-1</id><content type="html" xml:base="http://irudnyts.github.io//simulating-poisson-process-part-1/"><![CDATA[<p>A couple of weeks ago a colleague of mine asked me for a help to estimate Gerber-Shiu function by Monte-Carlo methods. The function is used in ruin theory for risk processes. One can think about this function as of equialence to a moment generating function. That is if the function is known, it is easy to derive a certain measurments of interest, for instance, a ruin probability. My colleague wants to estimate this function for an extenssion of <a href="https://en.wikipedia.org/wiki/Ruin_theory">Cramér–Lundberg model</a> that includes positive jumps (capital injections). From the first glance it seems as a trivial task, but when I started approaching it, this problem turned out to be not so easy to solve.</p>

<blockquote>
  <p><strong>Disclaimer:</strong> This post is outdated and was archived for back compatibility: please use with care! This post does not reflect the author’s current point of view and might deviate from the current best practices.</p>
</blockquote>

<p>To estimate Gerber-Shiu function a large number of paths should be simulated. That’s why I firstly started with a function that simulates a path of the process. Basically, the random part of the model consists of two independent Poisson processes. There are three ways to simulate a Poisson process. The first method assumes simulating interarrival jumps’ times by Exponential distribution. The second method is to simulate the number of jumps in the given time period by Poisson distribution, and then the time of jumps by Uniform random variables. The third method requires a certain grid. Typically, only the former two methods are used.</p>

<p>During my first attamped I used the first method (i.e. simulating interarrival time by Exponential r.v.s). In order to check myself, I estimated ruin probabilities and compared with numerically derived in literature. For some reason, the estimated values of such simulated processes were not in line with numerical ones. I tried the second method, which yielded values closer to true ones. On the other hand, numerical values might be also bised due to the precision error. To find which values are correct I simplified the process to have only deterministic unit jumps, but still measurments were bised (this will be discussed in details in the next post). Further simplification led to a simple Poisson process, which is a focus of this post.</p>

<p>The mentioned above two methods of Poisson process simulation are widely covered in all simulation books. However, I have not found any information which method is better or at least any information about the speed of convergence. So I implemented my versions of algorithms (both algorithms can be found in references below). Note that my implementation is probably far away from the efficient one, but my goal is rather compare visually how fast these algorithms converge.</p>

<h3 id="method-1">Method 1</h3>

<p>This algorithm exploits the fact that interarrival times are exponentially distributed. We simulate the arrival times until the maximum time horizon is achieved.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sim_pp1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="w"> </span><span class="n">rate</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

    </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">

    </span><span class="n">jumps_time</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rexp</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">rate</span><span class="p">)</span><span class="w">

    </span><span class="k">while</span><span class="p">(</span><span class="n">jumps_time</span><span class="p">[</span><span class="nf">length</span><span class="p">(</span><span class="n">jumps_time</span><span class="p">)]</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">t</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

        </span><span class="n">jump</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">jumps_time</span><span class="p">[</span><span class="nf">length</span><span class="p">(</span><span class="n">jumps_time</span><span class="p">)],</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w">
                         </span><span class="n">jumps_time</span><span class="p">[</span><span class="nf">length</span><span class="p">(</span><span class="n">jumps_time</span><span class="p">)],</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">  </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
                       </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">byrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

        </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">path</span><span class="p">,</span><span class="w"> </span><span class="n">jump</span><span class="p">)</span><span class="w">

        </span><span class="n">jumps_time</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">jumps_time</span><span class="p">,</span><span class="w">
                        </span><span class="n">jumps_time</span><span class="p">[</span><span class="nf">length</span><span class="p">(</span><span class="n">jumps_time</span><span class="p">)]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rexp</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">rate</span><span class="p">))</span><span class="w">
    </span><span class="p">}</span><span class="w">

    </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">path</span><span class="p">,</span><span class="w">
                  </span><span class="nf">c</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]))</span><span class="w">

    </span><span class="nf">list</span><span class="p">(</span><span class="n">path</span><span class="p">,</span><span class="w"> </span><span class="n">jumps_time</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<h3 id="method-2">Method 2</h3>

<p>This method simulates the number of jumps by Possion random variable with the rate equals to the product of the time horizon and the process’s rate. Then, to calculate arrival times, random variables with uniform distribution are generated and ordered after (again, these algorithms are well-known and described in details in references).</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sim_pp2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="w"> </span><span class="n">rate</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

    </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">

    </span><span class="n">jumps_number</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rpois</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rate</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">t</span><span class="p">)</span><span class="w">
    </span><span class="n">jumps_time</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">jumps_number</span><span class="p">,</span><span class="w"> </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">t</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">sort</span><span class="p">()</span><span class="w">

    </span><span class="k">for</span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_along</span><span class="p">(</span><span class="n">jumps_time</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="n">jump</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">jumps_time</span><span class="p">[</span><span class="n">j</span><span class="p">],</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w">
                         </span><span class="n">jumps_time</span><span class="p">[</span><span class="n">j</span><span class="p">],</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">  </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
                       </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">byrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
        </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">path</span><span class="p">,</span><span class="w"> </span><span class="n">jump</span><span class="p">)</span><span class="w">
    </span><span class="p">}</span><span class="w">

    </span><span class="n">path</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">path</span><span class="p">,</span><span class="w">
                  </span><span class="nf">c</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="w"> </span><span class="n">path</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">]))</span><span class="w">

    </span><span class="nf">list</span><span class="p">(</span><span class="n">path</span><span class="p">,</span><span class="w"> </span><span class="n">jumps_time</span><span class="p">)</span><span class="w">

</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<h3 id="validation">Validation</h3>

<p>Now, let’s check a couple of thigs, such as mean and vairance of interarrival times and their histogram for both methods.</p>

<p>For the first method:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"magrittr"</span><span class="p">)</span><span class="w">

</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">

</span><span class="n">path1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sim_pp1</span><span class="p">(</span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">mean</span><span class="p">(</span><span class="n">diff</span><span class="p">(</span><span class="n">path1</span><span class="p">[[</span><span class="m">2</span><span class="p">]]));</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">diff</span><span class="p">(</span><span class="n">path1</span><span class="p">[[</span><span class="m">2</span><span class="p">]]))</span><span class="w">
</span><span class="c1"># [1] 1.029312</span><span class="w">
</span><span class="c1"># [1] 0.9722406</span><span class="w">

</span><span class="n">data.frame</span><span class="p">(</span><span class="n">it</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">diff</span><span class="p">(</span><span class="n">path1</span><span class="p">[[</span><span class="m">2</span><span class="p">]]))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">ggplot</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_histogram</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">it</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">..density..</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">stat_function</span><span class="p">(</span><span class="n">fun</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dexp</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme</span><span class="p">(</span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">24</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>
<p><img src="https://irudnyts.github.io/images/posts/2018-03-09-simulating-poisson-process-part-1/h1.png" alt="" /></p>

<p>And for the second:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">path2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sim_pp2</span><span class="p">(</span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">mean</span><span class="p">(</span><span class="n">diff</span><span class="p">(</span><span class="n">path2</span><span class="p">[[</span><span class="m">2</span><span class="p">]]));</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">diff</span><span class="p">(</span><span class="n">path2</span><span class="p">[[</span><span class="m">2</span><span class="p">]]))</span><span class="w">
</span><span class="c1"># [1] 1.006302</span><span class="w">
</span><span class="c1"># [1] 1.066079</span><span class="w">

</span><span class="n">data.frame</span><span class="p">(</span><span class="n">it</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">diff</span><span class="p">(</span><span class="n">path2</span><span class="p">[[</span><span class="m">2</span><span class="p">]]))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">ggplot</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_histogram</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">it</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">..density..</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">stat_function</span><span class="p">(</span><span class="n">fun</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dexp</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme</span><span class="p">(</span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">24</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://irudnyts.github.io/images/posts/2018-03-09-simulating-poisson-process-part-1/h2.png" alt="" /></p>

<p>It seems that all values are in line with theory, that is the expected value and variance of interarrival times both equals to one (given the unit rate of Poisson process), as well as the shape of histograms.</p>

<h3 id="convergence">Convergence</h3>

<p>Mathmatically, both methods should have more or less the same speed of convergence. To check this we simulate 2000 paths with both methods and then estimate the expected value of the process at time ten as a function of the number of simulations.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">t</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">10</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">2000</span><span class="w">
</span><span class="n">rate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">

</span><span class="n">paths1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">expr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sim_pp1</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="w"> </span><span class="n">rate</span><span class="p">),</span><span class="w"> </span><span class="n">simplify</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">means1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w">
                 </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
                     </span><span class="n">pathes</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">paths1</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="n">x</span><span class="p">]</span><span class="w">
                     </span><span class="n">mean</span><span class="p">(</span><span class="n">sapply</span><span class="p">(</span><span class="n">pathes</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="n">y</span><span class="p">[[</span><span class="m">1</span><span class="p">]][</span><span class="n">nrow</span><span class="p">(</span><span class="n">y</span><span class="p">[[</span><span class="m">1</span><span class="p">]]),</span><span class="w"> </span><span class="m">2</span><span class="p">]))</span><span class="w">
                 </span><span class="p">})</span><span class="w">

</span><span class="n">paths2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">expr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sim_pp2</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="w"> </span><span class="n">rate</span><span class="p">),</span><span class="w"> </span><span class="n">simplify</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">means2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w">
                 </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
                     </span><span class="n">pathes</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">paths2</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="n">x</span><span class="p">]</span><span class="w">
                     </span><span class="n">mean</span><span class="p">(</span><span class="n">sapply</span><span class="p">(</span><span class="n">pathes</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="n">y</span><span class="p">[[</span><span class="m">1</span><span class="p">]][</span><span class="n">nrow</span><span class="p">(</span><span class="n">y</span><span class="p">[[</span><span class="m">1</span><span class="p">]]),</span><span class="w"> </span><span class="m">2</span><span class="p">]))</span><span class="w">
                 </span><span class="p">})</span><span class="w">

</span><span class="n">rbind</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">means1</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"1"</span><span class="p">),</span><span class="w">
    </span><span class="n">data.frame</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">means2</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"2"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">ggplot</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">method</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_hline</span><span class="p">(</span><span class="n">yintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rate</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">t</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme</span><span class="p">(</span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">24</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://irudnyts.github.io/images/posts/2018-03-09-simulating-poisson-process-part-1/c1.png" alt="" /></p>

<p>Indeed, visually the estimation of expected value convergence approximately with the same speed. However, I had problems with probabilities, and below I performed the same procedure but for the probability of a path to be below ten.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">paths1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2000</span><span class="p">,</span><span class="w"> </span><span class="n">expr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sim_pp1</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">simplify</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">probs1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">2000</span><span class="p">,</span><span class="w">
                 </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
                     </span><span class="n">pathes</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">paths1</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="n">x</span><span class="p">]</span><span class="w">
                     </span><span class="n">mean</span><span class="p">(</span><span class="n">sapply</span><span class="p">(</span><span class="n">pathes</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="n">y</span><span class="p">[[</span><span class="m">1</span><span class="p">]][</span><span class="n">nrow</span><span class="p">(</span><span class="n">y</span><span class="p">[[</span><span class="m">1</span><span class="p">]]),</span><span class="w"> </span><span class="m">2</span><span class="p">])</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">
                 </span><span class="p">})</span><span class="w">

</span><span class="n">paths2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2000</span><span class="p">,</span><span class="w"> </span><span class="n">expr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sim_pp2</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">simplify</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">probs2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">2000</span><span class="p">,</span><span class="w">
                 </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
                     </span><span class="n">pathes</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">paths2</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="n">x</span><span class="p">]</span><span class="w">
                     </span><span class="n">mean</span><span class="p">(</span><span class="n">sapply</span><span class="p">(</span><span class="n">pathes</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="n">y</span><span class="p">[[</span><span class="m">1</span><span class="p">]][</span><span class="n">nrow</span><span class="p">(</span><span class="n">y</span><span class="p">[[</span><span class="m">1</span><span class="p">]]),</span><span class="w"> </span><span class="m">2</span><span class="p">])</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">
                 </span><span class="p">})</span><span class="w">

</span><span class="n">rbind</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">probs1</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"1"</span><span class="p">),</span><span class="w">
      </span><span class="n">data.frame</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">probs2</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"2"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">ggplot</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">prob</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">method</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_hline</span><span class="p">(</span><span class="n">yintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ppois</span><span class="p">(</span><span class="n">q</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">rate</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme</span><span class="p">(</span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">24</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>
<p><img src="https://irudnyts.github.io/images/posts/2018-03-09-simulating-poisson-process-part-1/c2.png" alt="" /></p>

<p>Again, methods seem to have the same performance. This is a good sign, because now I can compare methods for slightly more complicated models not being affraid that differences might be due to Poisson process simulation algorithms.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[A couple of weeks ago a colleague of mine asked me for a help to estimate Gerber-Shiu function by Monte-Carlo methods. The function is used in ruin theory for risk processes. One can think about this function as of equialence to a moment generating function. That is if the function is known, it is easy to derive a certain measurments of interest, for instance, a ruin probability. My colleague wants to estimate this function for an extenssion of Cramér–Lundberg model that includes positive jumps (capital injections). From the first glance it seems as a trivial task, but when I started approaching it, this problem turned out to be not so easy to solve.]]></summary></entry><entry><title type="html">📊 [archived] Multinomial regression in R</title><link href="http://irudnyts.github.io//multinomial-regression/" rel="alternate" type="text/html" title="📊 [archived] Multinomial regression in R" /><published>2018-01-24T00:00:00+00:00</published><updated>2018-01-24T00:00:00+00:00</updated><id>http://irudnyts.github.io//multinomial-regression</id><content type="html" xml:base="http://irudnyts.github.io//multinomial-regression/"><![CDATA[<p>In my current project on <a href="https://www.youtube.com/watch?v=kLf6SVEMd94">Long-term care</a> at some point we were required to use a regression model with multinomial responses. I was very surprised that in contrast to well-covered binomial GLM for binary response case, multinomial case is poorly described. Surely, there are half-dozen packages overlapping each other, however, there is no sound tutorial or vignette. Hopefully, my post will improve the current state.</p>

<blockquote>
  <p><strong>Disclaimer:</strong> This post is outdated and was archived for back compatibility: please use with care! This post does not reflect the author’s current point of view and might deviate from the current best practices.</p>
</blockquote>

<p>We can distinguish two types of multinominal responses, namely nominal and ordinal. For nominal response a variable can possess a value from predefined finite set and these values are not ordered. For instance a variable <code class="language-plaintext highlighter-rouge">color</code> can be either <code class="language-plaintext highlighter-rouge">green</code> or <code class="language-plaintext highlighter-rouge">blue</code> or <code class="language-plaintext highlighter-rouge">green</code>. In machine learning the problem is often referred to as a classification. In contrast to nominal case, for ordinal repose variable the set of values has the relative ordering. For example, a variable <code class="language-plaintext highlighter-rouge">size</code> can be <code class="language-plaintext highlighter-rouge">small &lt; middle &lt; large</code>. Furthermore, depending on a link function we can have logit or probit models.</p>

<h2 id="nominal-response-models">Nominal response models</h2>

<p>According to Agresti (2002) we can the problem can be formulated by two similar approaches: through baseline-category logits or multivariate GLM. In general, these two approaches are equivalent with identical maximum-likelihood estimates, the only thing which is different is the formula representation.</p>

<h3 id="baseline-category-logits-multinomial-logit-model">Baseline-category logits (multinomial logit model)</h3>

<p>The baseline-category logits is implemented as a function in three distinct packages, namely <code class="language-plaintext highlighter-rouge">nnet::multinom()</code> (referred as to log-linear model), <code class="language-plaintext highlighter-rouge">mlogit::mlogit</code>, <code class="language-plaintext highlighter-rouge">mnlogit::mnlogit</code> (claims to be more efficient implementation than <code class="language-plaintext highlighter-rouge">mlogit</code>, see <a href="https://www.r-bloggers.com/comparing-mnlogit-and-mlogit-for-discrete-choice-models/">comparison of perfomances of these packages</a>).</p>

<p>Let $p_j = \mathbb{P}(Y = j \mid \boldsymbol{x})$ is a probability of dependent variable $Y$ to have value $j$ given a vector of explanatory variables’ values $\boldsymbol{x}$. In total, there are $J$ categories, and obviously, due to second axiom of probability $\sum_j p_j = 1$. We fix a baseline category at level $J$ (or at any other level), and the model is as follows:</p>

\[\log \frac{p_j}{p_J} = \alpha_j + \boldsymbol{\beta}'_j \boldsymbol{x}, \quad j = 1, ..., J - 1,\]

<p>describing the effects of explanatory $\boldsymbol{x}$ on logits of odds between a level $j$ and baseline level. Of course, using these $J-1$ equations and the second axiom it’s possible to come back to probabilities (which is a nice exercise, by the way):</p>

\[p_j = \frac{\exp(\alpha_j + \boldsymbol{\beta}'_j \boldsymbol{x})}{1 + \sum_{h = 1}^{J-1}\exp(\alpha_h + \boldsymbol{\beta}'_h \boldsymbol{x})}\]

<p>For each group $j$ the set of parameters $\alpha_j$ and $\boldsymbol{\beta}_j$ are distinct. Let’s now estimate those $\alpha_j, \quad \boldsymbol{\beta}_j, \quad j = 1, …, J - 1$ by different packages and make sure that estimates are identical. I use <code class="language-plaintext highlighter-rouge">marital.nz</code> data from <code class="language-plaintext highlighter-rouge">VGAM</code> package.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># install.packages("VGAM")</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">VGAM</span><span class="p">)</span><span class="w">
</span><span class="n">data</span><span class="p">(</span><span class="n">marital.nz</span><span class="p">)</span><span class="w">
</span><span class="c1">#   age ethnicity            mstatus</span><span class="w">
</span><span class="c1"># 1  29  European             Single</span><span class="w">
</span><span class="c1"># 2  55  European  Married/Partnered</span><span class="w">
</span><span class="c1"># 3  44  European  Married/Partnered</span><span class="w">
</span><span class="c1"># 4  53  European Divorced/Separated</span><span class="w">
</span><span class="c1"># 5  45  European  Married/Partnered</span><span class="w">
</span><span class="c1"># 7  30  European             Single</span><span class="w">
</span><span class="n">unique</span><span class="p">(</span><span class="n">marital.nz</span><span class="o">$</span><span class="n">mstatus</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] Single             Married/Partnered  Divorced/Separated Widowed           </span><span class="w">
</span><span class="c1"># Levels: Divorced/Separated Married/Partnered Single Widowed</span><span class="w">
</span></code></pre></div></div>

<p>The data contains “marital data mainly from a large NZ company collected in the early 1990s”. Dependent variable <code class="language-plaintext highlighter-rouge">mstatus</code> has four unordered classes <code class="language-plaintext highlighter-rouge">Divorced/Separated</code>, <code class="language-plaintext highlighter-rouge">Married/Partnered</code>, <code class="language-plaintext highlighter-rouge">Single</code>, and <code class="language-plaintext highlighter-rouge">Widowed</code>. We use <code class="language-plaintext highlighter-rouge">age</code> as the only exploratory variable.</p>

<ul>
  <li>Package <code class="language-plaintext highlighter-rouge">nnet</code></li>
</ul>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">nnet</span><span class="p">)</span><span class="w">
</span><span class="n">fit_nnet</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">multinom</span><span class="p">(</span><span class="n">mstatus</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">marital.nz</span><span class="p">)</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">fit_nnet</span><span class="p">)</span><span class="w">
</span><span class="c1">#                   (Intercept)          age</span><span class="w">
</span><span class="c1"># Married/Partnered    2.778686 -0.003538729</span><span class="w">
</span><span class="c1"># Single               6.368064 -0.152745520</span><span class="w">
</span><span class="c1"># Widowed             -6.753123  0.099333903</span><span class="w">
</span></code></pre></div></div>

<ul>
  <li>Package <code class="language-plaintext highlighter-rouge">mlogit</code></li>
</ul>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">mlogit</span><span class="p">)</span><span class="w">
</span><span class="n">fit_mlogit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mlogit</span><span class="p">(</span><span class="n">mstatus</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">marital.nz</span><span class="p">,</span><span class="w"> </span><span class="n">shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"wide"</span><span class="p">)</span><span class="w">
</span><span class="n">matrix</span><span class="p">(</span><span class="n">fit_mlogit</span><span class="o">$</span><span class="n">coefficients</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1">#           [,1]         [,2]</span><span class="w">
</span><span class="c1"># [1,]  2.778666 -0.003538297</span><span class="w">
</span><span class="c1"># [2,]  6.368056 -0.152745424</span><span class="w">
</span><span class="c1"># [3,] -6.753157  0.099334560</span><span class="w">
</span></code></pre></div></div>

<ul>
  <li>Package <code class="language-plaintext highlighter-rouge">mnlogit</code></li>
</ul>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">mnlogit</span><span class="p">)</span><span class="w">
</span><span class="n">marital.nz_long</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mlogit.data</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">marital.nz</span><span class="p">,</span><span class="w"> </span><span class="n">choice</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"mstatus"</span><span class="p">)</span><span class="w">
</span><span class="n">fit_mnlogit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mnlogit</span><span class="p">(</span><span class="n">mstatus</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">age</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">marital.nz_long</span><span class="p">)</span><span class="w">
</span><span class="n">matrix</span><span class="p">(</span><span class="n">fit_mnlogit</span><span class="o">$</span><span class="n">coefficients</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">byrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="c1">#           [,1]         [,2]</span><span class="w">
</span><span class="c1"># [1,]  2.778666 -0.003538297</span><span class="w">
</span><span class="c1"># [2,]  6.368056 -0.152745424</span><span class="w">
</span><span class="c1"># [3,] -6.753157  0.099334560</span><span class="w">
</span></code></pre></div></div>

<p>Even though the latter package is very efficient and customizable, there are several points I am not a big fan of. First off, <code class="language-plaintext highlighter-rouge">mnlogit</code> works <em>only</em> with long data instead of common and familiar for regression wide. That’s why we had to use <code class="language-plaintext highlighter-rouge">mlogit.data</code> to convert the data. Second, the formula’s syntax is too confusing despite its customizability. Of course, the list is not exhaustive, other packages exists, e.g. <a href="https://cran.r-project.org/web/packages/brglm2/vignettes/multinomial.html">brglm2</a>.</p>

<h3 id="multinomial-logit-model-as-multivariate-glm">Multinomial logit model as multivariate GLM</h3>

<p>For this model instead of treating the response variable as a scalar we set to be a vector of $J-1$ elements ($J$-th is redundant). Then, $\boldsymbol{y_i} = (y_{i,1}, …, y_{i, J-1})’$ and $\boldsymbol{\mu_i} = (p_{i,1}, …, p_{i, J-1})’$. Therefore,</p>

\[g_j(\boldsymbol{\mu}_i) = \log \frac{\mu_{i,j}}{1 - (\mu_{i,1}+...+\mu_{i, J-1})}\]

<p>and</p>

\[\boldsymbol{g}(\boldsymbol{\mu}_i) = \boldsymbol{X}_i \boldsymbol{\beta}\]

<p>where $\boldsymbol{g}$ is a vector of link functions.</p>

<p>The package <code class="language-plaintext highlighter-rouge">vgam</code> deals exactly with cases of multivariate GLM and GAM. Let’s compute estimates for this model, which should coincide with previously calculated ones:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">VGAM</span><span class="p">)</span><span class="w">
</span><span class="n">fit_vgam</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">vglm</span><span class="p">(</span><span class="n">mstatus</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">multinomial</span><span class="p">(</span><span class="n">refLevel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
                 </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">marital.nz</span><span class="p">)</span><span class="w">
</span><span class="n">matrix</span><span class="p">(</span><span class="n">fit_vgam</span><span class="o">@</span><span class="n">coefficients</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1">#           [,1]         [,2]</span><span class="w">
</span><span class="c1"># [1,]  2.778666 -0.003538297</span><span class="w">
</span><span class="c1"># [2,]  6.368056 -0.152745424</span><span class="w">
</span><span class="c1"># [3,] -6.753157  0.099334560</span><span class="w">
</span></code></pre></div></div>

<h2 id="ordinal-response-model-proportional-odds-model">Ordinal response model: proportional odds model</h2>

<p>For ordinal response variable the model is slightly different. Let $Y$ be a categorical response variable with $J$ categories which are ordered $1&lt;…&lt;J$. Therefore, it is possible to define cumulative probabilities as</p>

\[\mathbb{P}(Y \leq j \mid \boldsymbol{x}) = p_1 + ... + p_j, \quad j = 1, ..., J\]

<p>Then, cumulative logits are:</p>

\[\text{logit}(\mathbb{P}(Y \leq j \mid \boldsymbol{x})) = \log\frac{\mathbb{P}(Y \leq j \mid \boldsymbol{x})}{1 - \mathbb{P}(Y \leq j \mid \boldsymbol{x})} = \log\frac{p_1 + ... + p_j}{p_{j+1} + ...+ p_J}, \quad j = 1, ..., J - 1\]

<p>Let’s now define the cumulative logits and exploratory variables $\boldsymbol{x}$:</p>

\[\text{logit}(\mathbb{P}(Y \leq j \mid \boldsymbol{x})) = \alpha_j + \boldsymbol{\beta}' \boldsymbol{x}, \quad j = 1, ..., J-1\]

<p>Note that $\boldsymbol{\beta}$ are the same for each logit. However, intercepts can be different and necessarily are non-decreasing.</p>

<p>The model got its name from its property:</p>

\[\text{logit}(\mathbb{P}(Y \leq j \mid \boldsymbol{x}_1)) - \text{logit}(\mathbb{P}(Y \leq j \mid \boldsymbol{x}_2)) = \log\frac{\mathbb{P}(Y \leq j \mid \boldsymbol{x}_1) / \mathbb{P}(Y \geq j \mid \boldsymbol{x}_1)}{\mathbb{P}(Y \leq j \mid \boldsymbol{x}_2) / \mathbb{P}(Y \geq j \mid \boldsymbol{x}_2)} = \boldsymbol{\beta}' (\boldsymbol{x}_1 - \boldsymbol{x}_2)\]

<p>Again, there are at least four packages, which calibrate the proportional odds model. Let’s quickly compare those estimates using Italian household data for 2006 dataset <code class="language-plaintext highlighter-rouge">ecb06it</code> from <code class="language-plaintext highlighter-rouge">VGAMdata</code> package. We try to explain ordinal variable <code class="language-plaintext highlighter-rouge">education</code> of 8 levels by numeric <code class="language-plaintext highlighter-rouge">age</code>.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># install.packages("VGAMdata")</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">VGAMdata</span><span class="p">)</span><span class="w">
</span><span class="n">data</span><span class="p">(</span><span class="n">ecb06it</span><span class="p">)</span><span class="w">
</span><span class="c1"># str(ecb06.it)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">ecb06.it</span><span class="p">[,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"age"</span><span class="p">,</span><span class="w"> </span><span class="s2">"education"</span><span class="p">)])</span><span class="w">
</span><span class="c1">#    age     education</span><span class="w">
</span><span class="c1"># 1   58    highschool</span><span class="w">
</span><span class="c1"># 4   81 primaryschool</span><span class="w">
</span><span class="c1"># 5   52    highschool</span><span class="w">
</span><span class="c1"># 9   67  middleschool</span><span class="w">
</span><span class="c1"># 12  56  middleschool</span><span class="w">
</span><span class="c1"># 16  72 primaryschool</span><span class="w">
</span></code></pre></div></div>

<ul>
  <li>Package <code class="language-plaintext highlighter-rouge">MASS</code></li>
</ul>

<p>Perhaps the most famous function is <code class="language-plaintext highlighter-rouge">MASS::polr</code>.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">MASS</span><span class="p">)</span><span class="w">
</span><span class="n">fit_polr</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">polr</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">education</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ecb06.it</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">fit_polr</span><span class="p">)</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">drop</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">]</span><span class="w">
</span><span class="c1">#                                  Value</span><span class="w">
</span><span class="c1"># age                        -0.06417893</span><span class="w">
</span><span class="c1"># none|primaryschool         -6.95688936</span><span class="w">
</span><span class="c1"># primaryschool|middleschool -4.51869196</span><span class="w">
</span><span class="c1"># middleschool|profschool    -3.06471919</span><span class="w">
</span><span class="c1"># profschool|highschool      -2.73295822</span><span class="w">
</span><span class="c1"># highschool|bachelors       -0.96907401</span><span class="w">
</span><span class="c1"># bachelors|masters          -0.89517059</span><span class="w">
</span><span class="c1"># masters|higherdegree        2.42815131</span><span class="w">
</span></code></pre></div></div>

<ul>
  <li>Package <code class="language-plaintext highlighter-rouge">VGAM</code></li>
</ul>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fit_vglm</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">vglm</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">education</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">propodds</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ecb06.it</span><span class="p">)</span><span class="w">
</span><span class="n">as.matrix</span><span class="p">(</span><span class="n">fit_vglm</span><span class="o">@</span><span class="n">coefficients</span><span class="p">)</span><span class="w">
</span><span class="c1">#                      [,1]</span><span class="w">
</span><span class="c1"># (Intercept):1  6.95576156</span><span class="w">
</span><span class="c1"># (Intercept):2  4.51825182</span><span class="w">
</span><span class="c1"># (Intercept):3  3.06430069</span><span class="w">
</span><span class="c1"># (Intercept):4  2.73254206</span><span class="w">
</span><span class="c1"># (Intercept):5  0.96867493</span><span class="w">
</span><span class="c1"># (Intercept):6  0.89470432</span><span class="w">
</span><span class="c1"># (Intercept):7 -2.42867591</span><span class="w">
</span><span class="c1"># age           -0.06417086</span><span class="w">
</span></code></pre></div></div>

<ul>
  <li>Package <code class="language-plaintext highlighter-rouge">ordinal</code></li>
</ul>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">ordinal</span><span class="p">)</span><span class="w">
</span><span class="n">fit_clm</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">clm</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">education</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ecb06.it</span><span class="p">)</span><span class="w">
</span><span class="n">as.matrix</span><span class="p">(</span><span class="n">fit_clm</span><span class="o">$</span><span class="n">coefficients</span><span class="p">)</span><span class="w">
</span><span class="c1">#                                  [,1]</span><span class="w">
</span><span class="c1"># none|primaryschool         -6.9557784</span><span class="w">
</span><span class="c1"># primaryschool|middleschool -4.5182645</span><span class="w">
</span><span class="c1"># middleschool|profschool    -3.0643131</span><span class="w">
</span><span class="c1"># profschool|highschool      -2.7325541</span><span class="w">
</span><span class="c1"># highschool|bachelors       -0.9686858</span><span class="w">
</span><span class="c1"># bachelors|masters          -0.8947152</span><span class="w">
</span><span class="c1"># masters|higherdegree        2.4286635</span><span class="w">
</span><span class="c1"># age                        -0.0641711</span><span class="w">
</span></code></pre></div></div>

<p>Nice thing about this package is that it allows for using different link functions, i.e. <code class="language-plaintext highlighter-rouge">"logit"</code>, <code class="language-plaintext highlighter-rouge">"probit"</code>, <code class="language-plaintext highlighter-rouge">"cloglog"</code>, <code class="language-plaintext highlighter-rouge">"loglog"</code>, and <code class="language-plaintext highlighter-rouge">"cauchit"</code>. To my regret I know only <code class="language-plaintext highlighter-rouge">"logit"</code> and <code class="language-plaintext highlighter-rouge">"probit"</code> from this list.</p>

<ul>
  <li>Package <code class="language-plaintext highlighter-rouge">rms</code></li>
</ul>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">rms</span><span class="p">)</span><span class="w">
</span><span class="n">fit_lrm</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lrm</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">education</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ecb06.it</span><span class="p">)</span><span class="w">
</span><span class="n">as.matrix</span><span class="p">(</span><span class="n">fit_lrm</span><span class="o">$</span><span class="n">coefficients</span><span class="p">)</span><span class="w">
</span><span class="c1">#                        [,1]</span><span class="w">
</span><span class="c1"># y&gt;=primaryschool  6.9557784</span><span class="w">
</span><span class="c1"># y&gt;=middleschool   4.5182645</span><span class="w">
</span><span class="c1"># y&gt;=profschool     3.0643131</span><span class="w">
</span><span class="c1"># y&gt;=highschool     2.7325541</span><span class="w">
</span><span class="c1"># y&gt;=bachelors      0.9686858</span><span class="w">
</span><span class="c1"># y&gt;=masters        0.8947152</span><span class="w">
</span><span class="c1"># y&gt;=higherdegree  -2.4286635</span><span class="w">
</span><span class="c1"># age              -0.0641711</span><span class="w">
</span></code></pre></div></div>

<p>This function was rather unstable. Adding more exploratory variable have thrown an error a couple of times.</p>

<p>Coefficients are consistent (difference in signs are explained by $\mathbb{P}(Y \leq j)$ and $\mathbb{P}(Y \geq j)$), which is good.</p>

<p>Perhaps, now you have a question which package to use? Well, I do not know, just choose one and stick to it. I will use probably <code class="language-plaintext highlighter-rouge">VGAM</code>, as long as it covers various models and seems like nicely documented.</p>

<p>References:</p>

<ul>
  <li>Agresti, A. (2002) Categorical Data, Second edition, Wiley</li>
  <li><a href="https://onlinecourses.science.psu.edu/stat504/node/176">STAT504</a></li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[In my current project on Long-term care at some point we were required to use a regression model with multinomial responses. I was very surprised that in contrast to well-covered binomial GLM for binary response case, multinomial case is poorly described. Surely, there are half-dozen packages overlapping each other, however, there is no sound tutorial or vignette. Hopefully, my post will improve the current state.]]></summary></entry><entry><title type="html">🥕 [archived] Dortmund real estate market analysis: data preprocessing with caret</title><link href="http://irudnyts.github.io//dortmund-real-estate-market-analysis-data-preprocessing-with-caret/" rel="alternate" type="text/html" title="🥕 [archived] Dortmund real estate market analysis: data preprocessing with caret" /><published>2017-10-25T00:00:00+00:00</published><updated>2017-10-25T00:00:00+00:00</updated><id>http://irudnyts.github.io//dortmund-real-estate-market-analysis-data-preprocessing-with-caret</id><content type="html" xml:base="http://irudnyts.github.io//dortmund-real-estate-market-analysis-data-preprocessing-with-caret/"><![CDATA[<p>This is rather a short note, which is more related to an amazing package <code class="language-plaintext highlighter-rouge">caret</code>, than to our data set. The package allows for manipulating the model with less typing, for instance cross-validation or data preprocessing can be done by just specifying a couple of arguments in the key function of package <code class="language-plaintext highlighter-rouge">train</code>.</p>

<blockquote>
  <p><strong>Disclaimer:</strong> This post is outdated and was archived for back compatibility: please use with care! This post does not reflect the author’s current point of view and might deviate from the current best practices.</p>
</blockquote>

<p>Perhaps, the median imputation and $k$-nearest neighbors algorithm I will live for the better times, since Dortmund real estate data set contains no missed values. Furthermore, <code class="language-plaintext highlighter-rouge">caret</code> makes it possible to apply various transformation of data, e.g. centering, scaling, principle component analysis (PCA), independent component analysis (ICA) etc. As one can remember, the model with the smallest out-of-sample RMSE is GAM with inverse Gaussian responses and log-link function. GAM assumes applying smooth functions to regressors, and thus, centering and scaling won’t improve our metric. On the other hand, I am not a big fan of centering and scaling, since both process make the sample dependent. In other words, subtracting the sample mean from each observation makes this observation dependent on the whole sample.</p>

<p>On the other hand, PCA might be very useful, since <code class="language-plaintext highlighter-rouge">rooms</code> and <code class="language-plaintext highlighter-rouge">area</code> are correlated. Can this improve the model? Let’s experiment and see. As usual, we start from loading packages and data.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">packages</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"mgcv"</span><span class="p">,</span><span class="w"> </span><span class="s2">"magrittr"</span><span class="p">,</span><span class="w"> </span><span class="s2">"vtreat"</span><span class="p">,</span><span class="w"> </span><span class="s2">"caret"</span><span class="p">)</span><span class="w">
</span><span class="n">sapply</span><span class="p">(</span><span class="n">packages</span><span class="p">,</span><span class="w"> </span><span class="n">library</span><span class="p">,</span><span class="w"> </span><span class="n">character.only</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">logical.return</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="c1"># options(scipen=999)</span><span class="w">
</span><span class="n">rm</span><span class="p">(</span><span class="n">list</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ls</span><span class="p">())</span><span class="w">

</span><span class="n">setwd</span><span class="p">(</span><span class="s2">"/Users/irudnyts/Documents/data/"</span><span class="p">)</span><span class="w">
</span><span class="n">property</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="s2">"dortmund.csv"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Below our GAM with IG outcome and log-link function is rewritten in <code class="language-plaintext highlighter-rouge">caret</code> syntax yielding the same RMSE:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">train</span><span class="p">(</span><span class="n">price</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">rooms</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">area</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">property</span><span class="p">,</span><span class="w">
               </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gam"</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">inverse.gaussian</span><span class="p">(</span><span class="n">link</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"log"</span><span class="p">))</span><span class="w">
</span><span class="n">pred</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">property</span><span class="p">[,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"area"</span><span class="p">,</span><span class="w"> </span><span class="s2">"rooms"</span><span class="p">)])</span><span class="w">
</span><span class="p">(</span><span class="n">pred</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">property</span><span class="p">[,</span><span class="w"> </span><span class="s2">"price"</span><span class="p">])</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mean</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">()</span><span class="w">
</span><span class="c1"># [1] 134.9973</span><span class="w">
</span></code></pre></div></div>

<p>I don’t utilize the power of the function <code class="language-plaintext highlighter-rouge">trainControl()</code>, which can be used for cross-validation, in order to obtain consistency with previous posts. The model with PCA transformation is shown below:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model_pca</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">train</span><span class="p">(</span><span class="n">price</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">rooms</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">area</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">property</span><span class="p">,</span><span class="w">
                   </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gam"</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">inverse.gaussian</span><span class="p">(</span><span class="n">link</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"log"</span><span class="p">),</span><span class="w">
                   </span><span class="n">preProcess</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"pca"</span><span class="p">)</span><span class="w">
</span><span class="n">pred</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model_pca</span><span class="p">,</span><span class="w"> </span><span class="n">property</span><span class="p">[,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"area"</span><span class="p">,</span><span class="w"> </span><span class="s2">"rooms"</span><span class="p">)])</span><span class="w">
</span><span class="p">(</span><span class="n">pred</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">property</span><span class="p">[,</span><span class="w"> </span><span class="s2">"price"</span><span class="p">])</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mean</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">()</span><span class="w">
</span><span class="c1"># [1] 146.4144</span><span class="w">
</span></code></pre></div></div>

<p>The in-sample RMSE is higher than our best model, which is not very encouraging. Even though it makes a little sense to go further, we calculate our-of-sample RMSE.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set.seed</span><span class="p">(</span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">folds</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">kWayCrossValidation</span><span class="p">(</span><span class="n">nRows</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">property</span><span class="p">),</span><span class="w"> </span><span class="n">nSplits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">

</span><span class="n">pred</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">property</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">fold</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">folds</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">model_pca</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">train</span><span class="p">(</span><span class="n">price</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">rooms</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">area</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">property</span><span class="p">[</span><span class="n">fold</span><span class="o">$</span><span class="n">train</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w">
                       </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gam"</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">inverse.gaussian</span><span class="p">(</span><span class="n">link</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"log"</span><span class="p">),</span><span class="w">
                       </span><span class="n">preProcess</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"pca"</span><span class="p">)</span><span class="w">

    </span><span class="n">pred</span><span class="p">[</span><span class="n">fold</span><span class="o">$</span><span class="n">app</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model_pca</span><span class="p">,</span><span class="w"> </span><span class="n">property</span><span class="p">[</span><span class="n">fold</span><span class="o">$</span><span class="n">app</span><span class="p">,</span><span class="w"> </span><span class="p">])</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="nf">sqrt</span><span class="p">(</span><span class="n">mean</span><span class="p">((</span><span class="n">pred</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">property</span><span class="o">$</span><span class="n">price</span><span class="p">)</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="c1"># [1] 151.7058</span><span class="w">
</span></code></pre></div></div>

<p>I can summarize in one short sentence: PCA is not helpful for this case.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[This is rather a short note, which is more related to an amazing package caret, than to our data set. The package allows for manipulating the model with less typing, for instance cross-validation or data preprocessing can be done by just specifying a couple of arguments in the key function of package train.]]></summary></entry><entry><title type="html">🔬 [archived] Dortmund real estate market analysis: neural networks</title><link href="http://irudnyts.github.io//dortmund-real-estate-market-analysis-neural-networks/" rel="alternate" type="text/html" title="🔬 [archived] Dortmund real estate market analysis: neural networks" /><published>2017-10-21T00:00:00+00:00</published><updated>2017-10-21T00:00:00+00:00</updated><id>http://irudnyts.github.io//dortmund-real-estate-market-analysis-neural-networks</id><content type="html" xml:base="http://irudnyts.github.io//dortmund-real-estate-market-analysis-neural-networks/"><![CDATA[<p>At every turn in a non-technical post about AI for broader audience an author deems their duty to mention a deep learning as panacea for all woes. Well, it’s not. Deep learning is just one of various models, which might or might not perform better then the other techniques. At the end of the day, in a nutshell, it’s just regular neural networks with multiple hidden layers between the input and output layers (well, it’s rather a oversimplification, but you got it right). In this post I am curious whether it’s possible for neural networks approach to beat our best model so far (GAM with response’s inverse Gaussian distribution).</p>

<blockquote>
  <p><strong>Disclaimer:</strong> This post is outdated and was archived for back compatibility: please use with care! This post does not reflect the author’s current point of view and might deviate from the current best practices.</p>
</blockquote>

<p><img src="https://irudnyts.github.io/images/posts/2017-10-21-dortmund-real-estate-market-analysis-neural-networks/neuron.png" alt="" /></p>

<p>The nice thing about neural networks is it allows for interactions between variables. Remember we included several interaction terms to our simple linear, GLM and GAM models? The construction of neural networks’ model assumes much more complicated interactions, we do not have to worry about that. The more hidden layers we use the more complex these interactions can be.</p>

<p>When at first I tried to use TensorFlow and <code class="language-plaintext highlighter-rouge">keras</code> I admit my guilt to R users, I did it in Python. Let me just quickly go over the code chunks and I will come back to R (the code of which is pretty similar).</p>

<p>First, libraries and data should be loaded. As API to TensorFlow the package <code class="language-plaintext highlighter-rouge">keras</code> is used, and further, I load <code class="language-plaintext highlighter-rouge">pandas</code> to enable <code class="language-plaintext highlighter-rouge">DataFrame</code>, as well as <code class="language-plaintext highlighter-rouge">numpy</code>’s arrays. Also the regressors (<code class="language-plaintext highlighter-rouge">area</code>, <code class="language-plaintext highlighter-rouge">rooms</code>) and outcomes (<code class="language-plaintext highlighter-rouge">price</code>) are stored in separate variables.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">keras</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span>
<span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Sequential</span>

<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="nb">property</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'/Users/irudnyts/Documents/data/dortmund.csv'</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="nb">property</span><span class="p">[[</span><span class="s">"area"</span><span class="p">,</span> <span class="s">"rooms"</span><span class="p">]].</span><span class="n">values</span>
<span class="n">y</span> <span class="o">=</span> <span class="nb">property</span><span class="p">[[</span><span class="s">"price"</span><span class="p">]].</span><span class="n">values</span>
</code></pre></div></div>

<p>We need to initialize <code class="language-plaintext highlighter-rouge">Sequential</code> model (layers are connected sequentially). We use a neural network with two hidden layers, each of 50 neurons, and a rectified linear unit (ReLU) activation function. Input layer has 2 neurons (<code class="language-plaintext highlighter-rouge">area</code> and <code class="language-plaintext highlighter-rouge">rooms</code>), and for output layer we have only one output neuron (<code class="language-plaintext highlighter-rouge">price</code>). We use standard <code class="language-plaintext highlighter-rouge">adam</code> (Adaptive Monument Estimation) optimizer and standard mean squared error for loss (objective) function. Finally we slightly increase number of epochs to 15 for a fit.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">()</span>

<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">50</span><span class="p">,</span> <span class="n">activation</span> <span class="o">=</span> <span class="s">'relu'</span><span class="p">,</span> <span class="n">input_shape</span> <span class="o">=</span> <span class="p">(</span><span class="mi">2</span><span class="p">,)))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">50</span><span class="p">,</span> <span class="n">activation</span> <span class="o">=</span> <span class="s">'relu'</span><span class="p">))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span>

<span class="n">model</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">optimizer</span> <span class="o">=</span> <span class="s">'adam'</span><span class="p">,</span> <span class="n">loss</span> <span class="o">=</span> <span class="s">'mean_squared_error'</span><span class="p">)</span>
<span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="p">,</span> <span class="n">validation_split</span> <span class="o">=</span> <span class="mf">0.3</span><span class="p">,</span> <span class="n">epochs</span> <span class="o">=</span> <span class="mi">15</span><span class="p">)</span>
</code></pre></div></div>
<p>After the model is specified, compiled, and fitted we can predict and calculate in-sample RMSE.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">predicted</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">)</span>

<span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">((</span><span class="n">predicted</span> <span class="o">-</span> <span class="n">y</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">))</span>
<span class="c1"># 166.2982065207604
</span></code></pre></div></div>

<p>For in-sample RMSE the result is not bad. However, it was only a first quick and dirty try. At this moment I decided to switch back to R, and here the equivalent code:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">packages</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">,</span><span class="w"> </span><span class="s2">"magrittr"</span><span class="p">,</span><span class="w"> </span><span class="s2">"keras"</span><span class="p">,</span><span class="w"> </span><span class="s2">"vtreat"</span><span class="p">)</span><span class="w">
</span><span class="n">sapply</span><span class="p">(</span><span class="n">packages</span><span class="p">,</span><span class="w"> </span><span class="n">library</span><span class="p">,</span><span class="w"> </span><span class="n">character.only</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">logical.return</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="n">property</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="s2">"/Users/irudnyts/Documents/data/dortmund.csv"</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">property</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="m">3</span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">()</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">property</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">

</span><span class="n">model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">keras_model_sequential</span><span class="p">()</span><span class="w">

</span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">activation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"relu"</span><span class="p">,</span><span class="w"> </span><span class="n">input_shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">activation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"relu"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">

</span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">compile</span><span class="p">(</span><span class="w">
    </span><span class="n">loss</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"mean_squared_error"</span><span class="p">,</span><span class="w">
    </span><span class="n">optimizer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">optimizer_adam</span><span class="p">()</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">fit</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">epochs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="p">)</span><span class="w">

</span><span class="n">pred</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w">

</span><span class="nf">sqrt</span><span class="p">(</span><span class="n">mean</span><span class="p">((</span><span class="n">pred</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="c1"># 166.152</span><span class="w">
</span></code></pre></div></div>

<p>Looks pretty similar to Python code, right? OK, let’s play around with tuning parameters and empirically find the optimal ones. For this purpose, we define a function that calculates RMSE for neural network with a given number of layers and neurons (assuming each layer has the same number of neurons). Note, while fitting a model we use custom stopping time, i.e. if the mean squared error is not improved more than <code class="language-plaintext highlighter-rouge">min_delta = 0.01</code>, then train is stop at current epoch.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">get_rmse</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n_layers</span><span class="p">,</span><span class="w"> </span><span class="n">n_neurons</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">stopifnot</span><span class="p">(</span><span class="n">n_layers</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
    </span><span class="n">model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">keras_model_sequential</span><span class="p">()</span><span class="w">
    </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
        </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_neurons</span><span class="p">,</span><span class="w"> </span><span class="n">activation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"relu"</span><span class="p">,</span><span class="w"> </span><span class="n">input_shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
    </span><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="n">n_layers</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
            </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_neurons</span><span class="p">,</span><span class="w"> </span><span class="n">activation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"relu"</span><span class="p">)</span><span class="w">
    </span><span class="p">}</span><span class="w">
    </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">

    </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">compile</span><span class="p">(</span><span class="w">
        </span><span class="n">loss</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"mean_squared_error"</span><span class="p">,</span><span class="w">
        </span><span class="n">optimizer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">optimizer_adam</span><span class="p">()</span><span class="w">
    </span><span class="p">)</span><span class="w">

    </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">fit</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">epochs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="p">,</span><span class="w">
                  </span><span class="n">callbacks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">callback_early_stopping</span><span class="p">(</span><span class="n">min_delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">,</span><span class="w">
                                                      </span><span class="n">monitor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'loss'</span><span class="p">))</span><span class="w">

    </span><span class="n">pred</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w">
    </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">mean</span><span class="p">((</span><span class="n">pred</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Then, we train models with 50 neurons and several levels of layers, namely from 2 to 5:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">layers_summary</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">n_layers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="m">5</span><span class="p">,</span><span class="w">
                             </span><span class="n">rmse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="m">2</span><span class="o">:</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">get_rmse</span><span class="p">,</span><span class="w"> </span><span class="n">n_neurons</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">))</span><span class="w">
</span><span class="c1"># n_layers     rmse</span><span class="w">
</span><span class="c1">#        2 166.7639</span><span class="w">
</span><span class="c1">#        3 165.5580</span><span class="w">
</span><span class="c1">#        4 165.2328</span><span class="w">
</span><span class="c1">#        5 164.3651</span><span class="w">
</span></code></pre></div></div>
<p>It seems that 2 layers is more than enough. Let’s now define the number of neurons for each layer:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">neurons</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">))</span><span class="w">
</span><span class="n">neurons_summary</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">n_neurons</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">neurons</span><span class="p">,</span><span class="w">
                              </span><span class="n">rmse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">neurons</span><span class="p">,</span><span class="w"> </span><span class="n">get_rmse</span><span class="p">,</span><span class="w"> </span><span class="n">n_layers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="c1"># n_neurons     rmse</span><span class="w">
</span><span class="c1">#        10 296.4352</span><span class="w">
</span><span class="c1">#        12 165.5779</span><span class="w">
</span><span class="c1">#        14 167.7926</span><span class="w">
</span><span class="c1">#        16 210.7701</span><span class="w">
</span><span class="c1">#        18 166.0520</span><span class="w">
</span><span class="c1">#        20 166.3228</span><span class="w">
</span><span class="c1">#        30 165.0766</span><span class="w">
</span><span class="c1">#        40 165.8220</span><span class="w">
</span><span class="c1">#        50 165.6295</span><span class="w">
</span><span class="c1">#        60 165.6339</span><span class="w">
</span><span class="c1">#        70 165.5375</span><span class="w">
</span><span class="c1">#        80 165.9305</span><span class="w">
</span><span class="c1">#        90 164.7960</span><span class="w">
</span><span class="c1">#       100 164.7582</span><span class="w">
</span></code></pre></div></div>

<p>The model with around 20 neurons looks stable. Thus, for our final model we use 2 layers with 20 neurons to calculate our-of-sample RMSE. The model will use slightly larger number of potential epochs, since we decrease <code class="language-plaintext highlighter-rouge">min_delta</code> to 0.0005 to let the model train a bit more.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set.seed</span><span class="p">(</span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">folds</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">kWayCrossValidation</span><span class="p">(</span><span class="n">nRows</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">property</span><span class="p">),</span><span class="w"> </span><span class="n">nSplits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">

</span><span class="n">pred</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">property</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">fold</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">folds</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">keras_model_sequential</span><span class="p">()</span><span class="w">
    </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
        </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">activation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"relu"</span><span class="p">,</span><span class="w"> </span><span class="n">input_shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
        </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">activation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"relu"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
        </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
    </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">compile</span><span class="p">(</span><span class="w">
        </span><span class="n">loss</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"mean_squared_error"</span><span class="p">,</span><span class="w">
        </span><span class="n">optimizer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">optimizer_adam</span><span class="p">()</span><span class="w">
    </span><span class="p">)</span><span class="w">
    </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">fit</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="n">fold</span><span class="o">$</span><span class="n">train</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">[</span><span class="n">fold</span><span class="o">$</span><span class="n">train</span><span class="p">],</span><span class="w">
                  </span><span class="n">epochs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w">
                  </span><span class="n">callbacks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">callback_early_stopping</span><span class="p">(</span><span class="n">monitor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"loss"</span><span class="p">,</span><span class="w">
                                                      </span><span class="n">min_delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.0005</span><span class="p">))</span><span class="w">

    </span><span class="n">pred</span><span class="p">[</span><span class="n">fold</span><span class="o">$</span><span class="n">app</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="n">fold</span><span class="o">$</span><span class="n">app</span><span class="p">,</span><span class="w"> </span><span class="p">])</span><span class="w">
    </span><span class="n">rm</span><span class="p">(</span><span class="n">model</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="nf">sqrt</span><span class="p">(</span><span class="n">mean</span><span class="p">((</span><span class="n">pred</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="c1"># 165.5461</span><span class="w">
</span></code></pre></div></div>

<p>Fortunately or unfortunately, the model has not outperform our previous models, and the leader is still GAM with IG outcome. On the other hand, we have used the simplest (and when I am saying simplest I do mean simplest) neural networks. With this post I finish the cycle of Dortmund real estate data analysis.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[At every turn in a non-technical post about AI for broader audience an author deems their duty to mention a deep learning as panacea for all woes. Well, it’s not. Deep learning is just one of various models, which might or might not perform better then the other techniques. At the end of the day, in a nutshell, it’s just regular neural networks with multiple hidden layers between the input and output layers (well, it’s rather a oversimplification, but you got it right). In this post I am curious whether it’s possible for neural networks approach to beat our best model so far (GAM with response’s inverse Gaussian distribution).]]></summary></entry></feed>