If you’re not familiar with TrueSkill, it’s a rating system developed by Microsoft in order to assist in player matchmaking with their XBOX Live service. It’s a pretty cool and flexible system, and allows for unbalanced teams, ties, partial play, etc… I originally learned about it, as I do with most sports-related math theory, from Scott Turner and his website.

The general idea is that a player’s (or in the case of the NFL, we’ll be using teams) skill is defined by a normal distribution $\mathcal{N}(\mu , \sigma^{2})$ . Mu is essentially the skill rating, and sigma represents the uncertainty of that skill rating. If player/team A were to beat player/team B, A’s mu value would increase proportionally with their sigma value and the difference between the winner and loser’s mu value. A similar decrease would happen to team B, as the loser. Since these post-game adjustments in ratings incorporate both parties prior ratings, a 10-0 team (with a high rating) that lost to a 0-10 (low rating) team would result in a severe penalty to the 10-0 team, much more so than had they lost to a 9-1 team (and accordingly, the 0-10 team, now 1-10, would receive a well deserved increase in their rating).

After Week 3 of the NFL season is completed, I’ll be compiling a weekly list of all NFL results and applying the TrueSkill algorithm to each team to generate team ratings. The results will have the team’s rating (mu), along with their rating variance (more correctly, their rating’s standard deviation) and a 95% confidence interval for mu. This is a slightly different process than XBOX Live currently uses, as their player rating is a more conservative value calculated as $\mu - k\sigma$ where k is chosen constant per video game. I’ll also be using the TrueSkill ratings to predict the winners of the following week’s matches, alongside two other prediction methods: choosing the home team for every game, as well as the Vegas favorite (straight up favorite, no spread, referenced at the time of posting). The idea is to see what (if any) improvement in accuracy we might see over simply choosing the home team for our winner (I don’t expect TrueSkill to even come close to Vegas accuracy, it’s more for reference). Unfortunately, there aren’t a lot of games played in an NFL season, (another reason I prefer to primarily focus on NCAA basketball) so the results from this experiment will be far from statistically sound. Don’t take your iPad and head off to Las Vegas just yet.

For the detail oriented folks out there, the following decisions have been made ahead of time:

For neutral site games, a coin flip will determine the home team / away team.
Predictions for the game outcomes are based on rating. The higher rating is predicted to win.
If two teams happen to have an exact match with respect to their ratings (and this will probably happen due to the limited number of games), no prediction will be made using TrueSkill, though it will be noted. Accordingly, those games will be dropped when calculating prediction accuracy.
Ties will be incorporated into the TrueSkill ratings (which makes sense, as our 10-0 team tying against a 0-10 team should definitely impact their ratings). The probability of a tie is a factor when calculating ratings, and this value is set to 1/256 (in the prior two seasons, there have been 2 ties out of 512 regular-season games).
The metric for accuracy is essentially a sensitivity measure, where true positive would be that the winner was correctly predicted.
All teams have their initial distribution created equally, with the parameters $\mu=2000$ and $\sigma=666.666$ , per Microsoft’s recommendation of $\sigma=\mu / 3$ . I’m not giving any bonus points out for last year’s performances.
Additional parameters relevant to TrueSkill calculations are $\beta$ and $\tau$ . You can think of $\beta$ as the number of rating/skill points that team A needs to have over team B, in order to be approximately 80% sure that team A will beat team B. $\beta$ =333.333 or $\sigma / 2$ . $\tau$ is important, as it allows a player’s standard deviation to expand, as well as shrink. Without $\tau$ , it would always shrink, indicating that TrueSkill is more and more confident about a player/team. $\tau$ =6.666 (or $\sigma / 100$ ).

At this point, some of you might be thinking “Why even bother with all this? Isn’t this just really picking the team with the better record to win?” In some cases that might turn out to be the case, but TrueSkill actually gives us more information than looking at win/loss records. One of the benefits of this distribution-based rating system is that it’s incorporating the strength of the prior opposing teams into a team’s current rating.

As an example, if Chicago and Minnesota are both 3-3, and we are just looking at win/loss records, who would we pick to win? Hard to say, right? However, if we know that Minnesota is 3-3 because they played some very difficult teams over the last 6 weeks, and Chicago is 3-3 even though they’ve played 6 teams that aren’t very strong, we’re likely to pick Minnesota. This is precisely what TrueSkill is capturing.

Clearly the factors that determine a winner and loser of a football game are far more complex than just a rating value. Coaches and individual players make a huge difference, along with player injuries, positional matchups, etc… Weather and even the stadium and crowd can have an impact on an outcome, not to say anything of pure luck.

The goal of this exercise is not to present TrueSkill as a univariate solution for all your office pick-em needs (especially with just 13 or so games per team for a rating value), rather it’s a simple investigation into the rating system to see what, if any, kind of signal we can get. My expectation is that using TrueSkill in this manner will provide us with some improvement over choosing the home team to win, though you can never be sure. As Bert Bell was fond of saying, “On any given Sunday…”

Here’s to hoping it’s a fun 2015 NFL season and that we end up with some interesting results!