🌳 [archived] Dortmund real estate market analysis: tree-based methods

In pervious posts traditional regression models were fitted to real estate data. In this post tree-based models, namely random forests and gradient boosting, are trained to predict prices of the rent. These methods typically outperform traditional regression models yielding smaller errors. Furthermore, tree-based methods are much more robust to overfitting, which makes them superior in terms of prediction. However, the main disadvantage (and the reason why there is no love in insurance industry) is difficulties with interpretability.

Disclaimer: This post is outdated and was archived for back compatibility: please use with care! This post does not reflect the author’s current point of view and might deviate from the current best practices.

Random forests

Originally random forests are implemented in randomForest package. There is also a faster implementation in ranger package, which is used further. As usual, we start with some preliminary code to clean out the memory, load packages and data.

packages <- c("ggplot2", "magrittr", "vtreat", "ranger", "xgboost", "caret")
sapply(packages, library, character.only = TRUE, logical.return = TRUE)

theme_set(theme_bw())
theme_update(text = element_text(size = 24))

rm(list = ls())

setwd("/Users/irudnyts/Documents/data/")
property <- read.csv("dortmund.csv")
set.seed(1)

For random forests there is a hyperparameter mtry determining the number of columns to split in each node. As long as we have only two columns, we set mtry = 2 (the model with default mtry = 1 yields higher RMSE). We also use default number of trees (500 trees). The main function to run the regression is ranger. The function predict(.ranger) behaves a bit differently from other methods. Instead of returning a vector, it returns a list of vectors of class ranger.prediction. Thus, we need to extract predictions form that class.

rf <- ranger(formula = price ~ area + rooms, data = property, mtry = 2)
predicted <- predict(rf, property)
(predicted$predictions - property$price) ^ 2 %>% mean() %>% sqrt()
# [1] 93.07542

Random forests model outperform all traditional regression models in previous posts in terms of in-sample RMSE. Let’s now cross-validate model by looking at out-of-sample RMSE:

property$pred_rf <- NULL

folds <- kWayCrossValidation(nRows = nrow(property), nSplits = 3)

for(fold in folds) {
    rf <- ranger(formula = price ~ area + rooms,
                 data = property[fold$train, ],
                 mtry = 2)
    property[fold$app, "pred_rf"] <-
        predict(rf, property[fold$app, ])$predictions
}

(property$price - property$pred_rf) ^ 2 %>% mean() %>% sqrt()
# [1] 152.3637

The out-of-sample RMSE is comparable with one returned by linear model, but larger than for GAM IG.

Gradient boosting

This methods iteratively optimize the RMSE (or other accuracy measure) on training data. Thus, gradient boosting is more exposed to overfit. In order to get the optimal ‘nrounds’ (the maximum number of iterations) we need to use cross-validation technique, implemented in xgb.cv, which also calculates out-of-sample RMSE. The number of folds we keep equals to $3$, to be consistent with pervious analysis. The learning rate parameter is set to be $0.1$, a little bit less than default ($0.3$), implying robustness to overfit, but also slower speed.

xg <- xgb.cv(data = as.matrix(property[, 2:3]),
             label = property[, 1],
             nfold = 3,
             nrounds = 500,
             metrics = "rmse",
             eta = 0.1)
which.min(xg$evaluation_log$test_rmse_mean)
# [1] 32
ggplot(data = xg$evaluation_log) + geom_line(aes(x = iter, y = test_rmse_mean))

The minimum out-of-sample RMSE is showed by the model with nrounds = 32. This value is approximate and depends on a seed. We use nrounds = 40. Let’s calculate RMSE for the optimal model and then out-of sample RMSE:

xg <- xgboost(data = as.matrix(property[, 2:3]),
              label = property[, 1],
              nfold = 3,
              nrounds = 34,
              eta = 0.1)

property$pred_xg <- predict(xg, as.matrix(property[, 2:3]))

(property$price - property$pred_xg) ^ 2 %>% mean() %>% sqrt()
# [1] 104.8407

property$pred_xg <- NULL
set.seed(1)
folds <- kWayCrossValidation(nRows = nrow(property), nSplits = 3)

for(fold in folds) {
    xg <- xgboost(data = as.matrix(property[fold$train, 2:3]),
                  label = property[fold$train, 1],
                  nfold = 3,
                  nrounds = 40,
                  eta = 0.1)
    property[fold$app, "pred_xg"] <-
        predict(xg, as.matrix(property[fold$app, 2:3]))
}

(property$price - property$pred_xg) ^ 2 %>% mean() %>% sqrt()
# [1] 145.9076

Unfortunately, XGboost model does not achieve smaller RMSE than previous models.

As summary, tree based methods are not always better solution for prediction. As we see, the out-of-sample RMSE is similar to one for linear model. At the same time, tree-based methods lose precious interpretability.

Note: Both packages have built-in cross-validation functions. However, for proper comparison one has to train models on the same training data set applying, then, to the same test set.