Bootstrapping Time Series Data

Introduction

This demonstration is part of a requirement for my statistical consulting class at UT Austin. I will go through the basics of bootstrapping time series data using three different resampling methods.

  1. Fixed Block Sampling
  2. Stationary Block Sampling
  3. Model-based resampling

Packages Used

For this demonstration I will use the following packages: The boot package is the workhorse behind the bootstrapping methods, but the forecast method is used for the time series modeling. tidyverse, cowplot, and lubridate are all packages used for cleaning the data and presenting the results at the end, so aren’t as necessary depending on your preferred methods.

library(boot)
library(forecast)
library(tidyverse)
library(cowplot)
library(lubridate)

Question of interest

We will be investigating the Rio Negro river level in the manaus dataset. We are interested in developing an AR(p) model to describe the river level through time. However, we are not interested in selecting the p that is the best for the data, but rather understanding the full distribution of p for these data, which is where the bootstrapping will come into play.

Always a good idea to see what your data look like first:

plot(manaus)

Creating a function for obtaining a test statistic

The boot package requires a function that returns your statistic of interest for any data supplied. We will create a function that finds the maximum likelihood order (p) for a time series that is supplied as tsb. We return the best fit order, the mean of the time-series, and then the whole time series data, as a means to extract this information later. Really you would only need to return the statistic itself depending on your goals.

manaus_fun <- function(tsb) {
    ar.fit <- auto.arima(tsb, max.p = 25, max.d = 0, max.q = 0, max.P = 0, max.Q = 0, 
        max.D = 0, ic = "aic", max.order = 25, seasonal = FALSE)
    c(ar.fit$arma[1], mean(tsb), tsb)
}

Bootstrapping the statistic

For time series data we will use the tsboot function, which has several methods for resampling the time series data. We will show three of them here. I would refer the user to this description for a further explanation on the difference between the methods. I’ll first set that we want to obtain a statistic for 100 resamplings of our data, and that we want our blocks for block sampling to be of the length of the data raised to the (1/3) power.

num_resamples <- 1000
block_length <- round(length(manaus)^(1/3))

Fixed Block Sampling

For fixed block sampling, you first generate a date to begin a block, and then select from that date to a point that gives a time series of length block_length. You then draw a new time point, select a new block, and add it to your time series. You repeat the process until you have a time series of the length of your original time series. You then calculate a statistic for each sample, and aggregate the results.

The tsboot() function takes the original time series, your time series function that calculates the statistic (manaus_fun), the number of samples you want (R), the block length (l), and the simulation type desired (sim). In this case we will set sim="fixed" for fixed block sampling.

# the fixed block bootstrap with length
manaus_fixed <- tsboot(manaus, manaus_fun, R = num_resamples, l = block_length, 
    sim = "fixed")

Stationary Block Sampling

For stationary block sampling, you first generate a date to begin a block, and then select a random length generated by a geometric distribution with mean, block_length. You save that block of data, and then draw a new time point and block length, select it, add it to your time series. You repeat the process until you have a time series of the length of your original time series. Just like in the fixed block sampling, you repeat that process until you have all of your resampled time series, and aggregate the statistics for each one.

In this case we will set sim="geom" for stationary block sampling.

# the stationary bootstrap with mean block length 20
manaus_geom <- tsboot(manaus, manaus_fun, R = num_resamples, l = block_length, 
    sim = "geom")

Model-based Sampling

For model-based resampling, the trick is to use your model to generate new time series, and adding noise with the resampled residuals from the fit. This method is particularly risky if you are not entirely sure about the underlying model for your data.

We first have to fit the model, and extract the residuals from the model. Then we create a function that can simulate arima models according to our fit model. The rand.gen argument to the arima.sim function is the most important for determining the resampling of the residuals, as the new innovations for simulating time series.

## First fit the model
manaus_ar <- auto.arima(manaus, max.p = 25, max.d = 0, max.q = 0, max.P = 0, 
    max.Q = 0, max.D = 0, ic = "aic", max.order = 25, seasonal = FALSE)

## Create list containing the components of fit model
manaus_mod <- list(order = c(manaus_ar$arma[1], 0, 0), ar = coef(manaus_ar))

## Extract the residuals
manaus_res <- resid(manaus_ar) - mean(resid(manaus_ar))

## Simulation function for simulating
manaus_sim <- function(res, n.sim, ran.args) {
    # random generation of replicate series using arima.sim
    rg1 <- function(n, res) sample(res, n, replace = TRUE)
    ts.orig <- ran.args$ts
    ts.mod <- ran.args$model
    mean(ts.orig) + ts(arima.sim(model = ts.mod, n = n.sim, rand.gen = rg1, 
        res = as.vector(res)))
}

Once we’ve setup the time-series simulation framework, we can once again use the tsboot function with a few added arguments.

manaus_model <- tsboot(manaus_res, manaus_fun, R = num_resamples, sim = "model", 
    n.sim = length(manaus), orig.t = FALSE, ran.gen = manaus_sim, ran.args = list(ts = manaus, 
        model = manaus_mod))

Analyzing the results

So we’ve bootstrapped the manaus data three different ways to get an idea for the variability in AR model order (p). Let’s take a look at some of our simulated data first.

recording_dates <- expand.grid(seq(1, 12), seq(1903, 1992))
recording_dates <- paste0("1/", recording_dates$Var1, "/", recording_dates$Var2)

samp_data <- data_frame(time = dmy(recording_dates), true = c(manaus), fixed = manaus_fixed$t[1, 
    -c(1, 2)], stationary = manaus_geom$t[1, -c(1, 2)], model_based = manaus_model$t[1, 
    -c(1, 2)])

samp_data %>% gather(key, value, true:model_based) %>% ggplot(aes(time, value)) + 
    geom_line() + facet_wrap(~key)

The data look very similar to the true data, and if we didn’t know which was true, we probably wouldn’t be able to tell the difference between them, which is exactly what we’d like. Now let’s take a look at the distributions for the order parameter of the AR model.

p_dists <- data_frame(fixed = manaus_fixed$t[, 1], stationary = manaus_geom$t[, 
    1], model_based = manaus_model$t[, 1])

p_dists %>% gather(key, value, fixed:model_based) %>% ggplot(aes(value)) + geom_histogram(bins = 15) + 
    facet_wrap(~key, nrow = 3)

Looks like the stationary and fixed methods give similar distributions for the order of the autoregressive model, but the model-based distribution is slightly larger. The best fit model to the original data have an order of 4, so you can see that the model-based method for resampling maintains that level of dependence between the data. However, the stationary and fixed methods reduce the dependence between the data slightly (by combining chunks of the time series at random), so the best fit order for the AR models is reduced. Overall, the model-based method appears to work best in this scenario (though there is an increased risk of model misspecification), and we get a confidence interval from 3 to 5 for the order of the model.