After following the fantastic R tutorial “Titanic: Getting Stated with R”, by Trevor Stephens on the Kaggle.com Titanic challenge, I felt confident to strike out on my own and apply my new knowledge on another Kaggle challenge. Initially I tried to tackle the African Soil Properties challenge, but with over 6,000+ independent variables, it proved quite a bit more than I was ready for. I ran across the Bike Sharing Demand challenge and it was a much better fit. My first few attempts did OK, landing me in the top 500. I’ve since developed a more accurate model (last I checked, it was still somewhere around 185th place), but I wanted to share an attempt at a model for this problem that turned out surpisingly well given how simple it was. If you wish to do so, just repeating the process here should easily land you in the top half of contestants, if not higher (you don’t need to copy and paste from this page, you can find the entire source at GitHub).
I should take a moment and mention that all of the following code is for R. While I’m learning Python at the moment as well, I still fall back on R when it comes to the challenges as I’m more familiar with it. Hopefully as I gain more experience with IPython I’ll be able to post those notebooks for review.
The biggest takeaway I got from researching the Titanic challenge was the concept of feature engineering. While there are many people who can explain it much better than I can (one example), it basically boils down to turning your variables into more meaningful information.
Let’s take a look at what we have to work with as far as data set.
#set working directory setwd("/users/Brandon/Dropbox/Kaggle/Bike Sharing/") #read in train/test train <- read.csv("train.csv") test <- read.csv("test.csv") str(train) 'data.frame': 10886 obs. of 12 variables: $ datetime : Factor w/ 10886 levels "2011-01-01 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ... $ season : int 1 1 1 1 1 1 1 1 1 1 ... $ holiday : int 0 0 0 0 0 0 0 0 0 0 ... $ workingday: int 0 0 0 0 0 0 0 0 0 0 ... $ weather : int 1 1 1 1 1 2 1 1 1 1 ... $ temp : num 9.84 9.02 9.02 9.84 9.84 ... $ atemp : num 14.4 13.6 13.6 14.4 14.4 ... $ humidity : int 81 80 80 75 75 75 80 86 75 76 ... $ windspeed : num 0 0 0 0 0 ... $ casual : int 3 8 5 3 0 0 2 1 1 8 ... $ registered: int 13 32 27 10 1 1 0 2 7 6 ... $ count : int 16 40 32 13 1 1 2 3 8 14 ...
We’ve got a date/time stamp, and some coded variables such as season, holiday, workingday, weather, etc..
Pro Tip - When you see a variable as a factor that has the same number of levels as your observations, it probaly shouldn’t be a factor. We’re not going to use datetime as is, so we’re going to leave it alone for this example.
You’ll notice those coded variables aren’t factorized, which we’ll want to do to make sure our model knows they’re not just integer values. Let’s do that now, using the factor() function.
#factorize training set train_factor <- train train_factor$weather <- factor(train$weather) train_factor$holiday <- factor(train$holiday) train_factor$workingday <- factor(train$workingday) train_factor$season <- factor(train$season)
Remember, since we’re ultimately going to run our model against the test data, it also needs to be in the same exact structure. Whatever we do to this training data, we also need to do to the test data, so let’s factor the test data now.
#factorize test set test_factor <- test test_factor$weather <- factor(test$weather) test_factor$holiday <- factor(test$holiday) test_factor$workingday <- factor(test$workingday) test_factor$season <- factor(test$season)
The first variable we can perform some feature engineering on is the date/time stamp. Since it’s an actual combination of the date and time, we can break out the time portion into a new variable. This makes sense, as people are probably more likely to rent a bike at 2PM as opposed to 2AM. We want our model to consider this factor so we create a new variable for it. Since we saw that ‘datetime’ is a string (“2011-01-01 00:00:00”), we can use substring() on it to return just the portion of the string that we want.
Pro Tip - In R, you can use the qestion mark to find out more about a given command. For more information on substring, just do ?substring.
Again, we do this twice, once for training data, once for testing data.
#create time column by stripping out timestamp train_factor$time <- substring(train$datetime,12,20) test_factor$time <- substring(test$datetime,12,20)
Now let’s take a look at our data frame.
'data.frame': 10886 obs. of 13 variables: $ datetime : Factor w/ 10886 levels "2011-01-01 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ... $ season : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ... $ holiday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ workingday: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ weather : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 2 1 1 1 1 ... $ temp : num 9.84 9.02 9.02 9.84 9.84 ... $ atemp : num 14.4 13.6 13.6 14.4 14.4 ... $ humidity : int 81 80 80 75 75 75 80 86 75 76 ... $ windspeed : num 0 0 0 0 0 ... $ casual : int 3 8 5 3 0 0 2 1 1 8 ... $ registered: int 13 32 27 10 1 1 0 2 7 6 ... $ count : int 16 40 32 13 1 1 2 3 8 14 ... $ time : chr "00:00:00" "01:00:00" "02:00:00" "03:00:00" ...
We see our new time variable, but it’s just a bunch of strings. Since we’re only dealing with 24 different values (24 hours in a day) it makes sense to turn these into levels of a factor variable as well.
#factorize new timestamp column train_factor$time <- factor(train_factor$time) test_factor$time <- factor(test_factor$time)
Now that we’ve handled the hour portion of the ‘datetime’ variable, what about the day? If we repeated the same process, we’d end up with a new variable with a bunch of strings like ‘01/01/2013’ and ‘01/02/2013’. Not too useful for our model to work with. It makes much more sense to turn these values into their actual names, such as Monday or Tuesday. This will give us a variable that we can factorize with just seven levels. To do this, we can use a combination of R’s weekdays() and as.Date() functions.
Again, repeat for both training and test sets.
#create day of week column train_factor$day <- weekdays(as.Date(train_factor$datetime)) train_factor$day <- as.factor(train_factor$day) test_factor$day <- weekdays(as.Date(test_factor$datetime)) test_factor$day <- as.factor(test_factor$day)
At this point I felt pretty good about our new variables, but I wanted to see what other features we might be able to add. Let’s use our new day column, and see if there might be any ideas around certain days being more or less popular for bike rentals. The aggregate command is helpful here.
aggregate(train_factor[,"count"],list(train_factor$day),mean) Group.1 x 1 Friday 197.8443 2 Monday 190.3907 3 Saturday 196.6654 4 Sunday 180.8398 5 Thursday 197.2962 6 Tuesday 189.7238 7 Wednesday 188.4113
Sunday certainly sticks out as the day of the week where bikes are least frequently rented. Why don’t we create a ‘sunday’ variable that tells our model that this may be an important predictor (and we’ll also make it a factor).
#create Sunday variable train_factor$sunday[train_factor$day == "Sunday"] <- "1" train_factor$sunday[train_factor$day != "1"] <- "0" test_factor$sunday[test_factor$day == "Sunday"] <- "1" test_factor$sunday[test_factor$day != "1"] <- "0" #convert to factor train_factor$sunday <- as.factor(train_factor$sunday) test_factor$sunday <- as.factor(test_factor$sunday)
The final thought I had was around the previously created time variable. Firstly, representing 1AM as “00:00:00” seems a bit odd, even it works fine. A value of “0” would probably make more sense. I also thought that maybe instead of looking at each hour individually, it may make sense to group them together, as parts of the day (i.e. morning, afteroon, etc..) as certain parts of the day may lend themselves to more or less frequent bike rentals rather than just a single hour. I decided on 4, 6 hour blocks.
The process for doing this is below.
- Use substring() to strip out the actual hour values from the previous time variable, and use as.numberic() to convert it to a number. Store in the dataframe as ‘hour’.
- Create our new daypart column in the dataframe. (We populate it as 4 to save time, we could have used 1,2 or 3 as well.)
- Evaluate the new ‘hour’ column, and assign daypart to one of our four dayparts (i.e. 4AM-10AM = 1, 11AM-3PM = 2, etc..)
- Convert the daypart variable into a factor, since we only have 4 levels.
- Convert hour back to a factor (we converted it to a number in step 1).
#convert time and create $hour as integer to evaluate for daypart train_factor$hour<- as.numeric(substr(train_factor$time,1,2)) test_factor$hour<- as.numeric(substr(test_factor$time,1,2)) #create daypart column, default to 4 to make things easier for ourselves train_factor$daypart <- "4" test_factor$daypart <- "4" #4AM - 10AM = 1 train_factor$daypart[(train_factor$hour < 10) & (train_factor$hour > 3)] <- 1 test_factor$daypart[(test_factor$hour < 10) & (test_factor$hour > 3)] <- 1 #11AM - 3PM = 2 train_factor$daypart[(train_factor$hour < 16) & (train_factor$hour > 9)] <- 2 test_factor$daypart[(test_factor$hour < 16) & (test_factor$hour > 9)] <- 2 #4PM - 9PM = 3 train_factor$daypart[(train_factor$hour < 22) & (train_factor$hour > 15)] <- 3 test_factor$daypart[(test_factor$hour < 22) & (test_factor$hour > 15)] <- 3 #convert daypart to factor train_factor$daypart <- as.factor(train_factor$daypart) test_factor$daypart <- as.factor(test_factor$daypart) #convert hour back to factor train_factor$hour <- as.factor(train_factor$hour) test_factor$hour <- as.factor(test_factor$hour)
Now we have our new hour variable (with values, 0-23) which looks much better than our old value (00:00:00 - 23:00:00), and we’ve also grouped the hours into 4 different dayparts (6 hours blocks each).
With our feature engineering work out of the way, let’s think about creating a model with our training data set.
The main difference between the Bike Sharing content and the Titanic contest is that we’re not predicting a binomial outcome. With Titanic, it was 1 or 0 (survived vs died). This is not a classification problem, it’s a regression problem, and we need to adjust our approach accordingly.
My initial thought was that since we have just a handful of independent variables, a basic decision tree might do decently well here. Initial tests with rpart seemed to provide OK results (I believe initial value returned from Kaggle for a basic rprart method was around 0.6000), but since we’re not doing any strict classification, we have a few more options available to use. One interesting package we can utilize is party. Party provides an implementation of conditional inference trees. This is still recursive partitioning, similar to rpart, but there are a few important differences.
rpart() tends to choose variables that will provide many possible splits when building the tree. This initial selection bias can work well sometimes, but in other cases can cause a loss of accuracy in the model. A conditional inference tree avoids this initial selection bias by using a permutation test framework to calculate statistics on the covariates, and then determine which our of independent variables might be the most appropriate for building our decision / regression tree. Plainly put, conditional inference trees are a bit more methodical when it comes to deciding which covariates are most important to our outcome.
Let’s go ahead and load the ‘party’ package and see how the ctree command works.
install.packages('party') library('party') ?ctree Description Recursive partitioning for continuous, censored, ordered, nominal and multivariate response variables in a conditional inference framework. Usage ctree(formula, data, subset = NULL, weights = NULL, controls = ctree_control(), xtrafo = ptrafo, ytrafo = ptrafo, scores = NULL)
Pretty straightforward, and if you’re at all familiar with similar functions like lm() or rpart(), the basics are all the same.
Pro Tip - It’s always a good idea to assign our formulas to a variable. This just helps when moving between packages and models, and makes things easily repeatable. It’s a simple step, but one that can save you a lot of typing.
#create our formula formula <- count ~ season + holiday + workingday + weather + temp + atemp + humidity + hour + daypart + sunday
Let’s fit our model using our formula and training data set.
#build our model fit.ctree <- ctree(formula, data=train_factor)
Pretty simple, right? In order to evaluate the tree and see which covariates were determined to be of the most importance, we do the following (note, this tree is rather large, so I’m not printing the output. Feel free to review it on your own. I would also suggest that one not attempt to plot this tree unless you have a very beefy computer).
#examine model for variable importance fit.ctree
If you did this, you’ll notice that our “hour” variable was determined as the most improtant predictor to bike rental count. Yay for feature engineering.
I will give you a hint here, and say that not every feature we created was useful. It’s always a good idea to go back and review your model to see where there are areas for improvement. Removing those variables which have little or no impact on predictive ability is always a good step towards simplifying your model. It’s also a good idea to read the documentation on the function you’re using, there might be other options (control, comes to mind) that can help improve your model dramatically.
Now that we have our model, let’s go ahead and run it against our test dataset, and write the output to .csv so we can submit it and get our score from Kaggle.
#run model against test data set predict.ctree <- predict(fit.ctree, test_factor) #build a dataframe with our results submit.ctree <- data.frame(datetime = test$datetime, count=predict.ctree) #write results to .csv for submission write.csv(submit.ctree, file="submit_ctree_v1.csv",row.names=FALSE)
Now that we have the results of our model written to a file, all that’s left is to submit that file to Kaggle. Doing so should (hopefully) give you results similar to these.
Notes and further reading
Some readers may notice that I’ve ignored a few columns in the training data set, such as casual and registered users. This was done for the sake of simplicity, and to underscore the point that it’s possible to create a solid model without using every single variable avialable to you. You could certainly use that data to create additional submodels and merge the results to create a (possibly) more accurate model.
There was also discussion on the challenge forum about “soft” rules regarding using only data form a certain range for building your model. Kaggle ultimately tests the model regardless of which data you used to create it, so in the name of brevity (in an already rather long post) I chose to ignore those stipulations.
Artem Fedosov has code (GitHub) for splitting the training data into it’s own training and testing sets. This is useful for getting a baseline for your model, without wasting your valuable Kaggle daily submissions.
Root mean squared log error is the statstic used to evaluate this Kaggle competition. It is available via the R Metrics package. Example usage would be rmsle(train_data$count, abs(prediction_result)). We use abs() as some more advanced models (Support Vector Machines come to mind) may result in negative values for count, and we want to use the absolute value, not the negative value.