📊 [archived] Multinomial regression in R

In my current project on Long-term care at some point we were required to use a regression model with multinomial responses. I was very surprised that in contrast to well-covered binomial GLM for binary response case, multinomial case is poorly described. Surely, there are half-dozen packages overlapping each other, however, there is no sound tutorial or vignette. Hopefully, my post will improve the current state.

Disclaimer: This post is outdated and was archived for back compatibility: please use with care! This post does not reflect the author’s current point of view and might deviate from the current best practices.

We can distinguish two types of multinominal responses, namely nominal and ordinal. For nominal response a variable can possess a value from predefined finite set and these values are not ordered. For instance a variable color can be either green or blue or green. In machine learning the problem is often referred to as a classification. In contrast to nominal case, for ordinal repose variable the set of values has the relative ordering. For example, a variable size can be small < middle < large. Furthermore, depending on a link function we can have logit or probit models.

Nominal response models

According to Agresti (2002) we can the problem can be formulated by two similar approaches: through baseline-category logits or multivariate GLM. In general, these two approaches are equivalent with identical maximum-likelihood estimates, the only thing which is different is the formula representation.

Baseline-category logits (multinomial logit model)

The baseline-category logits is implemented as a function in three distinct packages, namely nnet::multinom() (referred as to log-linear model), mlogit::mlogit, mnlogit::mnlogit (claims to be more efficient implementation than mlogit, see comparison of perfomances of these packages).

Let $p_j = \mathbb{P}(Y = j \mid \boldsymbol{x})$ is a probability of dependent variable $Y$ to have value $j$ given a vector of explanatory variables’ values $\boldsymbol{x}$. In total, there are $J$ categories, and obviously, due to second axiom of probability $\sum_j p_j = 1$. We fix a baseline category at level $J$ (or at any other level), and the model is as follows:

\[\log \frac{p_j}{p_J} = \alpha_j + \boldsymbol{\beta}'_j \boldsymbol{x}, \quad j = 1, ..., J - 1,\]

describing the effects of explanatory $\boldsymbol{x}$ on logits of odds between a level $j$ and baseline level. Of course, using these $J-1$ equations and the second axiom it’s possible to come back to probabilities (which is a nice exercise, by the way):

\[p_j = \frac{\exp(\alpha_j + \boldsymbol{\beta}'_j \boldsymbol{x})}{1 + \sum_{h = 1}^{J-1}\exp(\alpha_h + \boldsymbol{\beta}'_h \boldsymbol{x})}\]

For each group $j$ the set of parameters $\alpha_j$ and $\boldsymbol{\beta}_j$ are distinct. Let’s now estimate those $\alpha_j, \quad \boldsymbol{\beta}_j, \quad j = 1, …, J - 1$ by different packages and make sure that estimates are identical. I use marital.nz data from VGAM package.

# install.packages("VGAM")
library(VGAM)
data(marital.nz)
#   age ethnicity            mstatus
# 1  29  European             Single
# 2  55  European  Married/Partnered
# 3  44  European  Married/Partnered
# 4  53  European Divorced/Separated
# 5  45  European  Married/Partnered
# 7  30  European             Single
unique(marital.nz$mstatus)
# [1] Single             Married/Partnered  Divorced/Separated Widowed           
# Levels: Divorced/Separated Married/Partnered Single Widowed

The data contains “marital data mainly from a large NZ company collected in the early 1990s”. Dependent variable mstatus has four unordered classes Divorced/Separated, Married/Partnered, Single, and Widowed. We use age as the only exploratory variable.

Package nnet

library(nnet)
fit_nnet <- multinom(mstatus ~ age, marital.nz)
coef(fit_nnet)
#                   (Intercept)          age
# Married/Partnered    2.778686 -0.003538729
# Single               6.368064 -0.152745520
# Widowed             -6.753123  0.099333903

Package mlogit

library(mlogit)
fit_mlogit <- mlogit(mstatus ~ 0 | age, data = marital.nz, shape = "wide")
matrix(fit_mlogit$coefficients, ncol = 2)
#           [,1]         [,2]
# [1,]  2.778666 -0.003538297
# [2,]  6.368056 -0.152745424
# [3,] -6.753157  0.099334560

Package mnlogit

library(mnlogit)
marital.nz_long <- mlogit.data(data = marital.nz, choice = "mstatus")
fit_mnlogit <- mnlogit(mstatus ~ 1 | age | 1, marital.nz_long)
matrix(fit_mnlogit$coefficients, ncol = 2, byrow = TRUE)
#           [,1]         [,2]
# [1,]  2.778666 -0.003538297
# [2,]  6.368056 -0.152745424
# [3,] -6.753157  0.099334560

Even though the latter package is very efficient and customizable, there are several points I am not a big fan of. First off, mnlogit works only with long data instead of common and familiar for regression wide. That’s why we had to use mlogit.data to convert the data. Second, the formula’s syntax is too confusing despite its customizability. Of course, the list is not exhaustive, other packages exists, e.g. brglm2.

Multinomial logit model as multivariate GLM

For this model instead of treating the response variable as a scalar we set to be a vector of $J-1$ elements ($J$-th is redundant). Then, $\boldsymbol{y_i} = (y_{i,1}, …, y_{i, J-1})’$ and $\boldsymbol{\mu_i} = (p_{i,1}, …, p_{i, J-1})’$. Therefore,

\[g_j(\boldsymbol{\mu}_i) = \log \frac{\mu_{i,j}}{1 - (\mu_{i,1}+...+\mu_{i, J-1})}\]

and

\[\boldsymbol{g}(\boldsymbol{\mu}_i) = \boldsymbol{X}_i \boldsymbol{\beta}\]

where $\boldsymbol{g}$ is a vector of link functions.

The package vgam deals exactly with cases of multivariate GLM and GAM. Let’s compute estimates for this model, which should coincide with previously calculated ones:

library(VGAM)
fit_vgam <- vglm(mstatus ~ age, multinomial(refLevel = 1),
                 data = marital.nz)
matrix(fit_vgam@coefficients, ncol = 2)
#           [,1]         [,2]
# [1,]  2.778666 -0.003538297
# [2,]  6.368056 -0.152745424
# [3,] -6.753157  0.099334560

Ordinal response model: proportional odds model

For ordinal response variable the model is slightly different. Let $Y$ be a categorical response variable with $J$ categories which are ordered $1<…<J$. Therefore, it is possible to define cumulative probabilities as

\[\mathbb{P}(Y \leq j \mid \boldsymbol{x}) = p_1 + ... + p_j, \quad j = 1, ..., J\]

Then, cumulative logits are:

\[\text{logit}(\mathbb{P}(Y \leq j \mid \boldsymbol{x})) = \log\frac{\mathbb{P}(Y \leq j \mid \boldsymbol{x})}{1 - \mathbb{P}(Y \leq j \mid \boldsymbol{x})} = \log\frac{p_1 + ... + p_j}{p_{j+1} + ...+ p_J}, \quad j = 1, ..., J - 1\]

Let’s now define the cumulative logits and exploratory variables $\boldsymbol{x}$:

\[\text{logit}(\mathbb{P}(Y \leq j \mid \boldsymbol{x})) = \alpha_j + \boldsymbol{\beta}' \boldsymbol{x}, \quad j = 1, ..., J-1\]

Note that $\boldsymbol{\beta}$ are the same for each logit. However, intercepts can be different and necessarily are non-decreasing.

The model got its name from its property:

\[\text{logit}(\mathbb{P}(Y \leq j \mid \boldsymbol{x}_1)) - \text{logit}(\mathbb{P}(Y \leq j \mid \boldsymbol{x}_2)) = \log\frac{\mathbb{P}(Y \leq j \mid \boldsymbol{x}_1) / \mathbb{P}(Y \geq j \mid \boldsymbol{x}_1)}{\mathbb{P}(Y \leq j \mid \boldsymbol{x}_2) / \mathbb{P}(Y \geq j \mid \boldsymbol{x}_2)} = \boldsymbol{\beta}' (\boldsymbol{x}_1 - \boldsymbol{x}_2)\]

Again, there are at least four packages, which calibrate the proportional odds model. Let’s quickly compare those estimates using Italian household data for 2006 dataset ecb06it from VGAMdata package. We try to explain ordinal variable education of 8 levels by numeric age.

# install.packages("VGAMdata")
library(VGAMdata)
data(ecb06it)
# str(ecb06.it)
head(ecb06.it[, c("age", "education")])
#    age     education
# 1   58    highschool
# 4   81 primaryschool
# 5   52    highschool
# 9   67  middleschool
# 12  56  middleschool
# 16  72 primaryschool

Package MASS

Perhaps the most famous function is MASS::polr.

library(MASS)
fit_polr <- polr(formula = education ~ age, data = ecb06.it)
summary(fit_polr)$coefficients[, 1, drop = FALSE]
#                                  Value
# age                        -0.06417893
# none|primaryschool         -6.95688936
# primaryschool|middleschool -4.51869196
# middleschool|profschool    -3.06471919
# profschool|highschool      -2.73295822
# highschool|bachelors       -0.96907401
# bachelors|masters          -0.89517059
# masters|higherdegree        2.42815131

Package VGAM

fit_vglm <- vglm(formula = education ~ age, family = propodds, data = ecb06.it)
as.matrix(fit_vglm@coefficients)
#                      [,1]
# (Intercept):1  6.95576156
# (Intercept):2  4.51825182
# (Intercept):3  3.06430069
# (Intercept):4  2.73254206
# (Intercept):5  0.96867493
# (Intercept):6  0.89470432
# (Intercept):7 -2.42867591
# age           -0.06417086

Package ordinal

library(ordinal)
fit_clm <- clm(formula = education ~ age, data = ecb06.it)
as.matrix(fit_clm$coefficients)
#                                  [,1]
# none|primaryschool         -6.9557784
# primaryschool|middleschool -4.5182645
# middleschool|profschool    -3.0643131
# profschool|highschool      -2.7325541
# highschool|bachelors       -0.9686858
# bachelors|masters          -0.8947152
# masters|higherdegree        2.4286635
# age                        -0.0641711

Nice thing about this package is that it allows for using different link functions, i.e. "logit", "probit", "cloglog", "loglog", and "cauchit". To my regret I know only "logit" and "probit" from this list.

Package rms

library(rms)
fit_lrm <- lrm(formula = education ~ age, data = ecb06.it)
as.matrix(fit_lrm$coefficients)
#                        [,1]
# y>=primaryschool  6.9557784
# y>=middleschool   4.5182645
# y>=profschool     3.0643131
# y>=highschool     2.7325541
# y>=bachelors      0.9686858
# y>=masters        0.8947152
# y>=higherdegree  -2.4286635
# age              -0.0641711

This function was rather unstable. Adding more exploratory variable have thrown an error a couple of times.

Coefficients are consistent (difference in signs are explained by $\mathbb{P}(Y \leq j)$ and $\mathbb{P}(Y \geq j)$), which is good.

Perhaps, now you have a question which package to use? Well, I do not know, just choose one and stick to it. I will use probably VGAM, as long as it covers various models and seems like nicely documented.

References:

Agresti, A. (2002) Categorical Data, Second edition, Wiley
STAT504