Brandon Harris bio photo

Brandon Harris

Cloud + Data Engineering + Analytics

Twitter LinkedIn Instagram Github Photography

Note - If you’ve already read this, and just want to jump to the forecast vs actual table, click here.

I realized long ago that when I’m actually interested in a topic, I learn it much more quickly and easily than topics I have difficulty relating to, or have no interest in. I recently had some exposure to time series analysis and while I think it’s a fascinating field, it can very quickly become overwhelming with the mathematical nuances and numerous modeling techniques. In order to start building up some more experience around time series data I started thinking about how I might apply the related concepts in a manner that would interest me. My wife works in the television industry and had a great suggestion around forecasting television viewership. I thought it would be a neat experiment, as the results are tangible and can be easily measured, so I started digging into it.

After reviewing some options for TV shows I decided to try my hand at forecasting viewership for ‘The Walking Dead’. There were a couple of reasons for selecting that show (though I’ll come clean here and admit I’ve never made it past the first couple of episodes myself). Firstly, the viewership is pretty high and for the last 5 seasons or so, the viewership numbers definitely exhibited a positive trend year-over-year. Secondly, each season is made up of 16 episodes, but those seasons are broken into two 8-episode chunks that air in different parts of the year. I thought the trend and seasonality components of the viewership numbers would make the analysis interesting and definitely applicable to a wider set of real-world issues I might run into in the future.

I’m going to walk through my research and forecasting methodology using R in the steps below. One thing that I think is worth mentioning here is that the season is airing as we speak. So I honestly do not know how well the model will do for the season (and keep in mind that this is my research into learning more about Time Series data, I’m far from an expert at this). I plan on updating the results after the viewership numbers get published each week, but if things go horribly wrong (and that is certainly likely) you may want to think twice before referring to the code and techniques below for any professional projects!

The first step in any analysis is to get the data, clean it up and check it out. Thankfully the lovely contributors at Wikipedia have made this relatively straightforward. The fans of the show are constantly updating the page with viewership numbers. You can find the data at the bottom of the WP page, but I’ll also host this on GitHub (along with the source code) for easy reference. You’ll want the GitHub data as I’ve inlcuded our regressors as well, which will not be on the WP page.

Let’s get our data loaded and all cleaned up.

library(xts)
library(forecast)
library(Matrix)

setwd('/Users/brandon/')

# Read data file (obtained from https://en.wikipedia.org/wiki/The_Walking_Dead_%28TV_series%29 )
> project_data <- read.csv("walking_dead_viewers_wikipedia.csv",stringsAsFactors=FALSE)

> head(project_data)
Airdate Type Viewers sunday_night_football
————– ———- ——— ———————–
1 2010-10-31 Premiere 5.35 Y
2 2010-11-07 New 4.71 Y
3 2010-11-14 New 5.07 Y
4 2010-11-21 New 4.75 Y
5 2010-11-28 New 5.56 Y
6 2010-12-05 Finale 5.97 Y

One thing you always want to be clear about are your units of measurement. What we’re predicting here are viewers from the 18-49 demographic, in millions of viewers. Since this is a time series, we can’t use this data as-is, we’ll have to convert it to a time series object. We’ve got the airdate in our first column, so we’ll go ahead and use that as our time index. While we’re at it, we’re going to make sure the two cateogrical regressors are represented as such in R. I haven’t discussed our regressors much until now as they’re pretty straight-forward. They’re two categorical variables, the type of a show (being a premiere, regular new episode or finale) as well as a binary variable indicating if Sunday Night Football was airing against the given episode (this only ocurrs in the first of the two mini-seasons).

# Convert AirDate column from character to date object
> project_data$Airdate <- as.Date(project_data$Airdate, "%m/%d/%Y")


# Convert the type of show variable and sunday night football variables
# to be categorical variables instead of characters.
> project_data$Type <- as.factor(project_data$Type)
> project_data$sunday_night_football <- as.factor(project_data$sunday_night_football)

# create time series object out of viewers and the episode airdates
> ts_data <- xts(project_data$Viewers, order.by = project_data$Airdate)

Now that we’ve formatted the data to a usable fashion, let’s plot the time series and see what we’re dealing with.

> plot(ts_data)

The first thing that probably jumps out to you is that this is definitely not a regluar (meaning observations at consistent intervals) time series! For most irregular time-series you would definitely want to experiement here with regularizing it (referring to making the observation time consistent, not to be confused with L1 / L2 regularization). Since the series pretty much airs consistently throughout multiple years, and the only irregular part of the time series are the breaks between seasons (the actual observations ARE regular) we don’t need to do anything here. You can certainly feel free to use your own time index and delete the ‘dead space’ (see what I did there.. Walking Dead.. dead space..) but you’ll get the same results as we’ll see below.

R has some amazing packages and one of the great ones, IMO, is the forecast package by Prof. Hyndman and company (Hyndman, if you’re new to time series, is pretty much ‘the man’ in this field). The auto.arima function is a fantastic place to start for determining what types of models to use when dealing with time series data. It will review your data for you and suggest values for you indicating if ARIMA, ARMA, or just AR / MA models would be a good fit. Let’s take a look at what we get with auto.arima and our data.

> auto.arima(ts_data)

Series: ts_data 
ARIMA(2,1,2) with drift         

Coefficients:
         ar1      ar2      ma1     ma2   drift
      1.4351  -0.5903  -1.8987  0.9802  0.1076
s.e.  0.1201   0.1163   0.1817  0.1920  0.0664

sigma^2 estimated as 1.21:  log likelihood=-113.31
AIC=238.61   AICc=239.85   BIC=252.52

Right away we’re given our values and the and coefficients, as well as model diagnostics. Our AICc value of 239.85 is something we’ll use as a benchmark going forward.

Now, as you might have suspected from the title and cateogrical variables, we’re going to use regression to help us nail down the trend, and then apply ARIMA to the residuals. We could do this manually with R’s lm() or glm() fuction, but the wonderful forecast package allows us to save some time here. Once we’ve got our variables into the proper format, we can pass a matrix of regressors directly to auto.arima itself. First though, auto.arima doesn’t seem to deal well with cateogrical variables, so we’ll need to encode these variables in order to proceed.

# turn our factor level variables into a sparse matrix to be used with regression
# esentially this is "One hot" encoding the variables.
> encoded_regressors <- sparse.model.matrix(Viewers~Type+sunday_night_football, data = project_data)

# create a time series out of of encoded categorical variables
> regressors <- xts(encoded_regressors[,-1], order.by = project_data$Airdate)

Now that we’ve formatted our regression data, let’s see what auto.arima thinks.

> auto.arima(ts_data,xreg=regressors)

Series: ts_data 
ARIMA(0,1,1) with drift         

Coefficients:
          ma1   drift  TypeNew  TypePremiere  sunday_night_footballY
      -0.2823  0.1146  -0.9293        1.1366                  0.1996
s.e.   0.1460  0.0703   0.2411        0.2667                  0.2764

sigma^2 estimated as 0.7108:  log likelihood=-91.92
AIC=195.83   AICc=197.07   BIC=209.74

Note that we’ve gone from a suggested model to an model, and our AIC score has dropped a bit, indicating that our regressors have definitely added some valuable information to the model. The fact that our AR component has dropped to zero, and we see a decrease in our MA component from 2 to 1 bodes well.

At this point we’ve got a good baseline for our model, but we haven’t added in our seasonal components yet . If it’s been a while since you’ve used time series data, remember that the capital P,D,Q refer to the seasonal componenets of the individual AR, differencing and MA components, and m refers to our relative period of time. auto.arima unfortunately will not help us here, we need to take what we know about the actual data and make some educated guesses for those values. If you’re unsure of your seasonality components, you can alway take a look at the auto-correlation function (ACF) and partial-auto-correlation function (PACF) plots of your model residuals in R and that should give you a starting point to work with as far as potential values for your seasonality period as well as AR/MA components. Professor Nau at Duke’s Fuqua School of Business has some great slides on this topic if you’re interested in learning more.

Since we know that there are two mini-seasons within each 16-episode season of ‘The Walking Dead’, we can be pretty sure our ‘period’ parameter of the seasonal ARIMA() function should be 8. We’ll use our suggested model and try out some seasonal component variations for . I always like to start with myself, and R makes it trivial to evaluate other values for and compare via AICc scores, so let’s look at what that gets us, as well as try a few quick alternatives out.

> Arima(ts_data, xreg=regressors, order = c(0,1,1), seasonal = list(order = c(0,1,1), period = 8))
Series: ts_data 
ARIMA(0,1,1)(0,1,1)[8]                    

Coefficients:
          ma1     sma1  TypeNew  TypePremiere  sunday_night_footballY
      -0.3149  -0.4267  -0.2471        1.2920                  0.1127
s.e.   0.1511   0.1370   0.3679        0.4563                  0.2396

sigma^2 estimated as 0.9261:  log likelihood=-76.63
AIC=165.26   AICc=166.66   BIC=178.49


> Arima(ts_data, xreg=regressors, order = c(0,1,1), seasonal = list(order = c(1,1,0), period = 8))
Series: ts_data 
ARIMA(0,1,1)(1,1,0)[8]                    

Coefficients:
          ma1     sar1  TypeNew  TypePremiere  sunday_night_footballY
      -0.3654  -0.4465  -0.0850        1.4344                  0.1251
s.e.   0.1610   0.1187   0.3826        0.4877                  0.2675

sigma^2 estimated as 0.8964:  log likelihood=-75.78
AIC=163.55   AICc=164.95   BIC=176.78


> Arima(ts_data, xreg=regressors, order = c(0,1,1), seasonal = list(order = c(1,1,1), period = 8))
Series: ts_data 
ARIMA(0,1,1)(1,1,1)[8]                    

Coefficients:
          ma1     sar1    sma1  TypeNew  TypePremiere  sunday_night_footballY
      -0.3909  -0.5804  0.1631  -0.0390        1.4472                  0.1477
s.e.   0.1697   0.2796  0.3345   0.3919        0.4899                  0.2794

sigma^2 estimated as 0.8909:  log likelihood=-75.67
AIC=165.33   AICc=167.23   BIC=180.76


> Arima(ts_data, xreg=regressors, order = c(0,1,1), seasonal = list(order = c(1,0,1), period = 8))
Series: ts_data 
ARIMA(0,1,1)(1,0,1)[8]                    

Coefficients:
          ma1    sar1     sma1  TypeNew  TypePremiere  sunday_night_footballY
      -0.2844  0.5053  -0.1831  -0.6558        1.0881                  0.1431
s.e.   0.1308  0.2236   0.2214   0.3021        0.3169                  0.2213

sigma^2 estimated as 0.6619:  log likelihood=-89.81
AIC=193.63   AICc=195.3   BIC=209.85

Judging from the AICc scores, it looks like our best model is going to be (lowest AICc score of all the models tested). Now at this point you would really want to do some cross-validation here and not simply rely on AIC scores, but since 1) we don’t have a lot of data and 2) I know that the viewership for the last full season has declined for the first time in the show’s history, it probably is not going to be 100% accurate to cross-validate on prior data. We’re just going to have to fly a bit blind here and hope for the best.

Also, if you’re a very detail-oriented reader, you’ll notice that there is one single data point in the data set that exists for the season we want to predict (Season 6B - B indicating that it’s the 2nd half of season 6). That’s just a function of when I started this project, and nothing more. However, since we really do want to forecast the season in it’s entirely, we’re going to remove it from our data set. It’ll also serve as a good test to see how close we actually get when forecasting since we can compare at least one real data point.

# kill the last data point
> train_data <- head(ts_data,-1)

# kill the last regressor data points
> train_regressors <- head(regressors,-1)

Let’s re-fit our model, minus the single data point, just to make sure nothing changed too much.

> Arima(train_data, xreg=train_regressors, order = c(0,1,1), seasonal = list(order = c(1,1,0), period = 8))
Series: train_data 
ARIMA(0,1,1)(1,1,0)[8]                    

Coefficients:
          ma1     sar1  TypeNew  TypePremiere  sunday_night_footballY
      -0.3656  -0.4496  -0.0828        1.4373                  0.1149
s.e.   0.1625   0.1235   0.3867        0.4933                  0.2833

sigma^2 estimated as 0.9115:  log likelihood=-74.9
AIC=161.8   AICc=163.23   BIC=174.94

Indeed, the model still looks like it’s performing well, and our AICc score actually dropped a bit.

Since we have the dates of the upcoming episodes (Wikipedia to the rescue, once again) and we also know our future regressor variable values (if a show is on against Sunday Night Football and the ‘Type’ of show) I’ve created a .csv file we can quickly load into R (check Github for this guy as well). Let’s fit our model at this time (again, using the data we had previously, dropping our last data point for the first episode of season 6B).

# When we predict the new season, we need the regressor information for the upcoming episodes
# so I just created a file with these values as it was easier than making the data in R directly.

> season_6_regressors <- read.csv("season_6_regressors.csv",stringsAsFactors=FALSE)

# We convert the episode Airdate column to a time object - These are the future episode airdates

> season_6_regressors$Airdate <- as.Date(season_6_regressors$Airdate, "%m/%d/%Y")

# We grab the columns needed for the regression and create a time series out of it
# using the airdate for future episodes

> test_regressors <- xts(season_6_regressors[,2:4], order.by = season_6_regressors$Airdate)

> model_fit <- Arima(train_data, xreg=train_regressors, order = c(0,1,1), seasonal = list(order = c(1,1,0), period = 8))

Now that we’ve built our model, and we’ve formatted the regressor data for our future dates, let’s take call the predict() function to forecast our viewership.

# We apply our model from seasons 1-5 and ask it to predict the next 8 episodes of Season 6
# using our new regessor data

> our_forecast <- predict(model_fit, newxreg=test_regressors, n.ahead = 8)

# we get output of our forecast and store it as a new time series
> forecasted_viewers <- xts(our_forecast$pred,season_6_regressors$Airdate)

# plot the actual (including our single data point from Season 6B)
# as well as our predictions

> plot.xts(ts_data,main="Viewership (millions) - Actual vs Predicted (Red)")
> lines(forecasted_viewers, col=c("red"))

# get an output of the actual counts we predicted / forecasted.
> forecasted_viewers

Below we see a table of our viewership forecast (in millions of viewers) as well as a plot of the historical data (including our single data point from Season 6B episode 1) and our forecast.

2016-02-14 13.86824
2016-02-21 11.00194
2016-02-28 12.05680
2016-03-06 12.61066
2016-03-13 12.16258
2016-03-20 12.11633
2016-03-27 12.24480
2016-04-03 13.57275

Our forecast for the first episode of Season 6B is 13.868M viewers, and we are actually very close to the real number, which was ~13.742M viewers. Not too bad! I think it’s also neat that we can see from the plot how the viewership patterns are repeating, both historically and in our forecast. This is a good indication that our model was succesful in capturing the trend and seasonality components. One thing that is interesting, is that the model does result in a forecast that’s trending downward. We’re flying blind here without any knowledge of what will happen in the future, but the model certainly thinks viwewership is due to drop off this season, when looking at a year-to-year perspective.

That’s pretty much it for this foray into TV viewership forecasting. I don’t do this for a living, so if you do happen to have real-world experience with forecasting in the media industry I’d love to hear your comments and feedback. Below you’ll find a table with our forecast and the viewership numbers that I’ll do my best to keep udpated throughout the spring as The Walking Dead keeps airing and we’ll see just how accurate the forecast turns out to be. The viewership numbers were looking for don’t usually get published until 3-4 days after each episode airs, so we’ll have to wait a bit after each episode aired to see how we did.

Thanks for reading!

Airdate Forecasted Viewership (millions) Actual Viewership (millions) Difference (millions)
———– ———————————- —————————— ———————–
2/14/2016 13.86824 13.742 0.12624
2/21/2016 11.00194 13.483 2.481 (ouch..)
2/28/2016 12.0568 12.794 0.7372
3/6/2016 12.61066 12.812 0.2014
3/13/2016 12.16258 12.530 0.3675
3/20/2016 12.11633 12.686 0.5697
3/27/2016 12.2448 12.384 0.1392
4/3/2016 13.57275 14.193 0.62025