What is my model?
My predictive rating
model has gone through so many iterations and tweaks over the years
that it s difficult to know where to start in explaining what goes into it. My
original inspiration was an offseason project from when I was a GA at
UW-Platteville after I discovered Ken Pomeroy s basketball ratings. I recreated
KenPom's rankings
which at that time were essentially just recency-weighted, opponent-adjusted
Pythagorean win percentage[1]
for DIII football. The goal was simple: to predict the scoring margin of future
games as accurately as possible.
After calculating a
couple seasons-worth of data, and then posting in-season ratings into the void
on a blogspot site for a season, I came to a couple realizations:
1.
Since Pythagorean win percentage is essentially a team's ratio of
offensive points scored to points scored and points allowed, teams with great defenses were getting over-rated.
In most leagues, sports, and competition levels, this wouldn t be an issue,
because every team allows a decent number of points. In DIII football though,
it's not uncommon for a handful of teams each year to average single-digit
points allowed over an entire season.
2.
The KenPom method of adjusting for opponents, AdjEff
= ActualEff + OpponentsEff
- AverageEff, iterated until the ratings come to
balance, was severely over-fitting,
which hurt predictiveness. The first season for which I calculated ratings
using this method was 2013, and I think the whole ASC was ranked in my top 25.
There are not enough games linking teams together in DIII football to
opponent-adjust the way most sports do.
Because of these
realizations, my model experienced its first major overhaul. Instead of
Pythagorean win percentage, I switched to scoring margin for my main efficiency
metric. And then instead of iterating opponent adjustments repeatedly until
they stabilized, I switched to a Bayesian updating
model, with FiveThirtyEight's
Elo ratings as my inspiration. The catch was that Elo only gives an
overall team rating, my original ratings had separate offensive and defensive
ratings, and I didn't want to switch to a model that gave me less granular
information, so I had to create my own original ranking system[2].
In 2024 I decided to
incorporate Elo ratings into my model as well, utilizing two different Elo
ratings for each team. The first Elo rating is the traditional Elo
that does not consider margin of victory. The second Elo rating, which I call
Elo+, is a blatant copy FiveThirtyEight's model, but without the quarterback
adjustment.
Here's the bare bones
explanation for how my model works:
1.
Create preseason ratings for each team's offense (AdjO)
& defense (AdjD) based on
a.
Recent team success,
b.
Returning production, and
c.
Coaching changes.
2.
Calculate overall ratings from AdjO, AdjD,
AdjEff, Elo, and Elo+.
3.
Calculate game predictions using AdjEff,
Elo, and Elo+.
4.
After games are played, adjust team ratings proportional to how
much they over- or under-performed their expectation.
5.
Repeat weekly.
Preseason Ratings
Preseason ratings
are the lifeblood of my model. Since I'm doing what I can to avoid overfitting
within a season, having accurate priors is extremely important. My preseason
ratings are calculated from a five-year weighted average of each team's
end-of-season ratings, with adjustments for returning production and coaching
changes.
I don't have
returning production or player-level statistics for seasons before 2009, so
there are no returning production adjustments for those seasons, and the linear
weights for the 5-year average are different. I've tried adding some regression to the
mean every time I've tweaked my preseason weights, and it always
makes predictions less accurate, so I don't use it. I have a constant I add to
offensive and defensive ratings to account for the general increase in scoring
each year and to balance the overall national average rating.
|
|
Pre-2009 Linear Weights |
2009 & Later Linear Weights |
Elo Models Linear Weights |
|||
|
AdjHFA |
AdjO |
AdjD |
AdjO |
AdjD |
Elo |
Elo+ |
Year N-1 |
1.000 |
0.800 |
0.770 |
0.860 |
0.830 |
0.850 |
0.850 |
Year N-2 |
0.000 |
0.070 |
0.100 |
0.080 |
0.065 |
0.070 |
0.065 |
Year N-3 |
0.000 |
0.025 |
0.025 |
0.020 |
0.015 |
0.025 |
0.020 |
Year N-4 |
0.000 |
0.065 |
0.060 |
0.025 |
0.080 |
0.045 |
0.040 |
Year N-5 |
0.000 |
0.040 |
0.045 |
0.015 |
0.010 |
0.010 |
0.025 |
Average |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
Constant |
0.000 |
0.310 |
0.015 |
0.365 |
0.345 |
1.950 |
1.450 |
My historic ratings
go back to 1997[3]. For those early seasons,
I use Massey Ratings
offensive and defensive ratings as my priors[4].
I also use Massey for preseason ratings for teams reclassifying from other
divisions, since my model is exclusive to DIII teams[5].
I adjust the numbers just to get them on the right scale, and I'm sure my
preseason weights differ from Massey's, but otherwise, our preseason ratings
for reclassifying teams should be nearly identical.
For startup
programs, I don t have Massey Ratings to set my prior[6],
so I originally used the historic average of startups, an AdjEff
of -32, as the preseason rating:
AdjO = (National Average Offense) - 16.0
AdjD = (National Average Defense) + 16.0
That worked well
enough for a while, but I wanted to set a slightly better prior[7],
so I tried a method from the now-defunct Atomic Ratings[8],
and use teams schedule strength to adjust my preseason expectation:
AdjO = (National_Avg_AdjO) - 13.5 - 0.32*Opp_Avg_AdjEff
AdjD = (National_Avg_AdjD)
+ 13.5 + 0.32*Opp_Avg_AdjEff
The next piece of
preseason ratings are the returning production numbers. I used to use teams self-reported
number of starters returning in D3Football s Kickoff magazine[9],
but with Kickoff s demise in the COVID 2020 season (RIP in peace), I had to
figure something else out.
Looking at the
primary offensive and defensive statistics for each team, I calculated the
percent of returning production in each, and then used a multivariate
regression to determine how much each statistic contributed to a team s over-
or under-performance from my model s naive preseason expectations. To determine
who was returning or not, I just assumed players listed as seniors were
leaving, and everyone else was staying. From 2009-2019, this assumption was
probably well-founded, but after 2020, who knows?. I
kept doing it anyways, but manually checked the rosters for the top 50 or so
teams[10].
Offensive Returning
Production |
|
Defensive Returning
Production |
||
Statistic |
Weight |
|
Statistic |
Weight |
Pass Att |
0.150 |
|
Tackles |
0.121 |
Pass Yds |
0.067 |
|
TFLs |
0.513 |
Rush Att |
0.000 |
|
Sacks |
0.000 |
Rush Yds |
0.154 |
|
PBU |
0.219 |
Receptions |
0.349 |
|
INTs |
0.147 |
Rec Yds |
0.041 |
|
Def GP |
0.000 |
OL GP[11] |
0.239 |
|
|
|
The equation for how
much a team s returning production affects their ratings are:
AdjO = AdjO + 12.5*(Returning_Off Avg_Returning_Off)
AdjD = AdjD 10.0*(Returning_Def Avg_Returning_Def)
As you can see, the
coefficient for returning offensive production is slightly higher than that for
returning defensive production, but the variation among teams returning
defensive production is higher than the variation of offensive production[12],
so the effect of offensive vs. defensive production is relatively equal.
The last piece of
the preseason ratings is a coaching adjustment. The NCAA.org statistics page
keeps great records of teams head coaches over the years, so it s easy to
track when teams have coaching changes. Any teams who experienced a head
coaching change since the start of the previous season will have their ratings
adjusted[13]. Those teams experience a
slight rating decline (2.22 points) and regression to the mean (2%).
AdjO = 0.98*AdjO + 0.02*Avg_AdjO - 1.11
AdjD = 0.98*AdjD + 0.02*Avg_AdjD + 1.11
In-Season Predictions and Ratings
Now that I ve
calculated preseason ratings, I can start predicting games. New in 2024, I m making
four different predictions for each game. The first prediction is the same as I ve
always used, which is based on teams AdjEff
ratings. Then I make one prediction apiece using the two Elo models, and the
final, official prediction is an ensemble[14]
of those three predictions[15].
The final prediction uses linear weights, with the AdjEff
prediction carrying the major heft, and the Elo models serving as a slight
corrective if the AdjEff model is
overconfident[16].
Spread = 0.928*AdjEff_Spread + 0.045*Elo_Spread + 0.028*Elo+_Spread
Before calculating
game spreads, I first calculate the estimated home-field advantage (EstHFA) for each matchup, which is the sum of each
team s AdjHFA. This EstHFA
is used in each of the three different models predictions. My version of AdjHFA estimates the difference between how well a
team plays at home and how well they play on the road[17].
Calculating the
estimated margin of victory (EstMOV) and win
probability (EstWP) for each model is probably
the simplest part of this whole enterprise. The spread for the Elo models is
calculated the same way FiveThirtyEight does it, but with my version of EstHFA instead:
EstMOV
= (Elo - OppElo + EstHFA)/25
EstWP
= 1 / (1025*EstMOV/400+1)
In the AdjEff prediction, I m explicitly calculating the
estimated score for each team instead of just the spread, so that s only mildly
more complicated:
EstOff = AdjO + Opp_AdjD - Avg_AdjD + EstHFA/2
EstDef = AdjD + Opp_AdjO - Avg_AdjO - EstHFA/2
EstMOV = EstOff - EstDef
EstWP = normdist(EstMOV, 0, 15.75, TRUE)[18]
After making game
predictions, the games are played (who cares about that part?), and then the
ratings get updated. As you might have guessed, the Elo ratings use the same
formula to update ratings as used by FiveThirtyEight, but with different
K-Factors[19]. The non-MOV Elo model
uses a K-Factor of 146 and the MOV-inclusive Elo+ model uses a K-Factor of 57.
The difference between the two K-Factors is due to the latter model s inclusion
of a MOV-multiplier. AdjHFA is updated with a
very similar process:
AdjHFA
= Pregame_AdjHFA + K*(ActualMOV
- EstMOV)
The K-Factor for AdjHFA is 0.0028, which essentially means that about
one half of one percent of the prediction error belongs to each team s AdjHFA.
Updating the AdjO
& AdjD ratings might be the most
convoluted part of what I do. After I made the switch from the Pythagorean
model to the current version, I initially used the following equation to update
AdjO and AdjD:
AdjO = Pregame_AdjO + K*(ActualOff - EstOff)
AdjD = Pregame_AdjD + K*(ActualDef - EstDef)
That process worked
well enough for a few years, but then I started to get frustrated with one team
in particular running up scores in the regular season and then underperforming
my model s expectations in the playoffs[20],
so I decided that my K-Factor, which had been a constant of 0.1645, should vary
depending on the pregame win probability and a simple post-game second-order win
probability (PostWP).
Below is the first
version of my variable K-Factor:
PostWP = normdist(ActualMOV,
0, 15.75 ,TRUE)
Var_K = K (0.5 + 2*0.5*min(EstWP+PostWP,
1-EstWP + 1-PostWP)
AdjO = Pregame_AdjO + Var_K*(ActualOff - EstOff)
AdjD = Pregame_AdjD + Var_K*(ActualDef - EstDef)
What the variable
K-Factor does is decrease the magnitude of movement by about half for games
that were supposed to be blowouts and ended up being blowouts. If it s supposed
to be moderately close and ends up being moderately close, the ratings move the
same as they would have with a constant K-Factor. The big jumps happen in
upsets. Games where the results are particularly surprising, where the underdog
wins by a lot or where a big underdog pulls off an upset end up with the
largest K-Factor.
After using this
K-Factor for a few years, I decided to do some more thorough testing to see
what proportion of constant K to variable K yielded the best predictive
accuracy. I started off with 50-50 (the 0.5 terms in the Var_K
equation, above), and the best results settled in at 80% constant/20% variable.
Before the 2023
season I made two other changes to my K-Factor. First, I added a weekly
modifier, so that games earlier in the year moved ratings more than games later
in the year. I made this change because I noticed that my ratings were always
slightly less predictive early in the year, which is to be expected, but by
allowing ratings to jump around more in the early weeks, I was able to get even
better predictive accuracy by mid-season.
Week of Season |
K-Factor Magnitude |
Week 1 |
1.14 |
Week 2 |
1.10 |
Week 3 |
1.06 |
Week 4 |
1.02 |
Week 5 |
0.98 |
Week 6 |
0.94 |
Weeks 7-11 |
0.90 |
Playoffs |
1.00 |
The second major
change before the 2023 season was using different K-Factors for AdjO and
AdjD. I was able to improve accuracy by using
a slightly higher offensive K-Factor (0.185) and a slightly lower defensive
K-Factor (0.160).
K_Off = 0.185*(0.8 + 2*0.2*min(EstWP+PostWP, 1-EstWP +
1-PostWP)
K_Def = 0.160*(0.8 + 2*0.2*min(EstWP+PostWP,
1-EstWP + 1-PostWP)
AdjO = Pregame_AdjO + K_Off*(ActualOff - EstOff)
AdjD = Pregame_AdjD + K_Def*(ActualDef - EstDef)
Considering all the various
K-Factors, I can estimate how much of the model s prediction error is due to
HFA, offense, defense, and random variation/noise.
Let s use an
example. Say a game was predicted to be 35-25 in favor of the home team
(74%/26% win probability), and it ended up being 35-32 (58%/42% postgame win
expectancy).
Factor |
Points |
Percentage |
Error |
7.00 |
100% |
Home HFA |
0.02 |
0.3% |
Away HFA |
0.02 |
0.3% |
Home Off |
0.00 |
0% |
Away Def |
0.00 |
0% |
Home Def |
1.20 |
17% |
Away Off |
1.39 |
20% |
Noise |
4.36 |
62% |
In this
hypothetical, 62% of the prediction error is due to noise, 0.6% home-field
advantage, 20% offense, and 17% defense. Because most of the error is
attributed to noise, this means that if the game were to be replayed
immediately, the subsequent prediction would always be closer to the previous
prediction than the actual outcome[21].
To make the opaque
process of updating ratings after each game more complicated, new in 2024 I m adding
what I m referring to as an Island Adjustment[22].
The schools on the DIII islands on the West Coast and Texas rarely get to play
games outside of their geographic proximity, so I wanted to make sure any
persistent non-conference trends were accounted for in team ratings. Add in the
fact that the PAC didn t play any non-conference games in 2023, and that each
of their teams who played in the post-season vastly outperformed their
expectation, and I decided this was the right direction to go.
The adjustment is
simple. It takes the sum of predictive error for teams
in each conference in a given week, and then adds 0.6%
of that error to each team s offensive and defensive ratings. If a conference
plays eight non-conference games, and each team performs 7 points better than
predicted, the ratings will update as usual, and then about a sixth of a point
(7*8*0.6%/2 = 0.17) gets added to and subtracted from each team s AdjO
and AdjD, respectively[23].
If the conference underperforms, team ratings get worse. This adjustment hardly
improves predictive accuracy[24],
but it feels right to me, so I m keeping it.
One Last Thing
For a long time, I
completely excluded games outside of the division. I assumed there were so few
of those games that including them would just add noise to my predictions, and
I wanted my model to be exclusive to DIII. Only DIII games. Only DIII teams. Calibrated
for accuracy in DIII.
After 2020 it seemed
to me that some of the top teams in the division were playing more games out of
the division than usual[25],
so I wanted some way to give those teams credit in my resume rankings and to
inform my Top 25 vote. In the seasons from 2021 to 2023, I manually input the
score predictions for out-of-division games from Massey into my model. I didn t
track whether including those games improved my model s accuracy at all,
because I was mostly concerned with including them for the resume rankings.
This offseason,
while I was going back through previous seasons to calculate Elo ratings and
calibrating the Island Adjustment, I noticed how much more frequently teams
played out of the division than they do now[26].
Because of that, I decided to figure out some way to include results from those
historical games in my ratings. Massey is great for historic ratings, but as
far as I know, there s no way to go back in time and see team ratings
mid-season to calculate estimated scores.
Begrudgingly, I
decided to use Massey s end-of-season ratings to craft my predictions [27].
The first step in creating a game prediction was crafting a rating for each non-DIII
team. To do this, I calculated the average and standard deviation of offensive
& defensive ratings for teams in my model, and then took teams Massey
ratings and re-scaled them so they would have the same z-score in both models.
After plugging those
teams ratings into my model and calculating game predictions the same way as I
do for regular games, my model s accuracy improved so drastically that I
thought I had introduced a coding error. Admittedly, I was cheating a bit
because I was using end of season ratings for the non-DIII teams, but my model s
average absolute error improved by about 1%. I know this doesn t seem like a wow,
you definitely made a mistake level of improvement, but for context, if I
looked up all the other tweaks I ve made since 2015:
Returning production adjustments (0.50%)
Weighted K-Factors by game week (0.21%)
Variable K-Factor (0.13%)
Team-specific AdjHFA (0.09%
improvement)
Different linear weights for preseason AdjO & AdjD (0.07%)
Ensemble ratings with Elo & Elo+ (0.07%)
Separate K-Factors for AdjO & AdjD
(0.04%)
Coaching adjustment (0.04%)
SOS-dependent startup program ratings (0.04%)
Island adjustment (0.01%)
Including non-DIII
games provided nearly as much of a lift as all of
those combined. The moral of the story? A) including more data (games) in your
model is going to do way more for your accuracy than anything else you can do,
and B) calibrating a model for a specific level and sport sets a pretty high
baseline for accuracy. That latter point is why my answer to a frequent
question I get, Why don t you use this for sports betting? is usually Because
this model would be terrible at predicting anything but DIII football.
Conclusion
If you ve somehow
made it this far, congratulations! You now have all the information you would
need to perfectly recreate my ratings.
And you're a
complete degenerate[28].
[1] As you'll see, I eventually changed my ratings to use scoring margins instead of scoring ratios, a change KenPom made as well, the season after I did. I would like to take credit and say I influenced him, but in all likelihood, his change was made independently from mine.
[2] I do not recommend this, if you can avoid it.
[3] Why 1997? Because that s as far back as the tabulated team results go in the NCAA database. Game results before 1997 are only stored as handwritten scans. I'm hopeful that some sort of AI/machine learning algorithm will help me solve the problem of turning all these scans into sortable tables, but I haven't bothered to dig in too much, yet, although I am determined to eventually get ratings calculated for the four-peat Augustana teams.
[4] I've reached out to Ken Massey about getting his game files for seasons pre-1997, but he said that he's not confident in his data fidelity for those earlier seasons, so he's unwilling to share them.
[5] With some minor exceptions. I calculate Apprentice's ratings every year, but don t rank them. They've played a majority of their games against DIII opponents every year except one since 1997. I also calculated Trinity Bible's ratings for a few seasons while they were in the UMAC, and I calculated Emory & Henry's ratings their first year as a DII, since they still played a full ODAC schedule. I also think I have some other schools included in my historic ratings prematurely, before they were officially classified as DIII, such as Crown and Northwestern Minn., but they were playing in a DIII conference, so that's fine with me.
[6] For some, I technically could. Starting in the mid-aughts, Massey started calculating ratings for club teams, so if a DIII startup played a club season before they played a varsity season, they would have a Massey Rating, but I don t use those at all.
[7] It was Calvin s announcement that they were adding football that gave me the motivation.
[8] Atomic Ratings were pretty good. Back when I used to track every DIII computer rating s predictions, they were consistently in the same ballpark as my ratings in terms of absolute error in the latter half of the season.
[9] Amazingly, this method worked better than using any sort of percentage of games started or games played returning. I think a lot of coaches were responding to "How many good players do you have returning? " instead of the actual question asked. The average number of starters returning from 2013-2019 was 6.75, and the average value of each returning starter above- or below-average was worth somewhere between 1.0 and 1.5 points per game.
[10] And then I assumed the percentage of seniors coming back held constant for teams I didn t check manually.
[11] I would have liked to use offensive line games started, as well, but there are two problems: seasons before 2012 don t have games started in the NCAA statistics database, and games started data is extremely scattershot.
[12] With a slight caveat: returning offensive production tends to be bimodal, with one mode centered around 80% and the other much, much, lower. You might be able to figure out why. When both your passing and rushing offense runs through one player, the quarterback, losing that one player can be catastrophic to your returning production numbers.
[13] By the NCAA s count, there have only been 13 mid-season coaching changes since 1997.
[14] Ensemble is just a fancy word for weighted average.
[15] Back-testing this ensemble over 26 seasons, it was more accurate than the AdjEff spread in 20 seasons, but only more accurate in one season since COVID. The lost season did a number on the Elo ratings accuracy.
[16] I decided to do an ensemble rating after I created the Elo ratings for a different project, and then noticed that they were much more accurate predicting the Stagg Bowl. This is almost definitely because of small sample sizes, but I figured why not? The Elo models don t usually move the prediction much, usually only a half point, if that, but they improve the overall accuracy of the model.
[17] A lot of very smart people have spent a lot of time calculating how different aspects of sport rest, travel distance, changing time zones, etc. affect HFA. That all would probably make my predictions a little better, but to be honest, it sounds like way more work than its worth.
[18] I m sure most of you recognize this as the pro-football-reference.com win probability equation. I honestly don t care much at all for doing anything more detailed for win probability, because most people see WP>50% and internalize it as WP=100%. I only care about the predicted margin.
[19] The K-Factor determines how much the ratings move after each game.
[20] [cough] St. Thomas [cough]
[21] This is hard for people to accept, and I think it s where 99% of the critiques of my model come from, but if you want a model that strictly ranks teams ahead of teams they ve beaten, it would be a terrible model for predicting future games.
[22] After posting about it on Twitter, Bill Connelly of ESPN, formerly of SBNation, said he does something similar, but he calls his a Real Estate Adjustment.
[23] Subtracting from AdjD means their defense is rated better.
[24] Absolute error is 0.01% better than if I didn t use it.
[25] Historically, this really wasn t the case at all.
[26] In 1997 & 1998, DIII teams averaged nearly one out of division game per season. In 2023 the whole division only played 24 games outside the division.
[27] It s really not fair to call these predictions since they re created from ratings that have information from the future, but when I m tracking my model s accuracy, I m excluding these games anyways. Games after 2020 use non-DIII teams pre-game Massey Ratings.
[28] In other words, one of my people.