A close up of a logo

Description automatically generated

Support Hansen Ratings

What is my model?

My predictive rating model has gone through so many iterations and tweaks over the years that it s difficult to know where to start in explaining what goes into it. My original inspiration was an offseason project from when I was a GA at UW-Platteville after I discovered Ken Pomeroy s basketball ratings. I recreated KenPom's rankings which at that time were essentially just recency-weighted, opponent-adjusted Pythagorean win percentage[1] for DIII football. The goal was simple: to predict the scoring margin of future games as accurately as possible.

After calculating a couple seasons-worth of data, and then posting in-season ratings into the void on a blogspot site for a season, I came to a couple realizations:

1.       Since Pythagorean win percentage is essentially a team's ratio of offensive points scored to points scored and points allowed, teams with great defenses were getting over-rated. In most leagues, sports, and competition levels, this wouldn t be an issue, because every team allows a decent number of points. In DIII football though, it's not uncommon for a handful of teams each year to average single-digit points allowed over an entire season.

2.       The KenPom method of adjusting for opponents, AdjEff = ActualEff + OpponentsEff - AverageEff, iterated until the ratings come to balance, was severely over-fitting, which hurt predictiveness. The first season for which I calculated ratings using this method was 2013, and I think the whole ASC was ranked in my top 25. There are not enough games linking teams together in DIII football to opponent-adjust the way most sports do.

Because of these realizations, my model experienced its first major overhaul. Instead of Pythagorean win percentage, I switched to scoring margin for my main efficiency metric. And then instead of iterating opponent adjustments repeatedly until they stabilized, I switched to a Bayesian updating model, with FiveThirtyEight's Elo ratings as my inspiration. The catch was that Elo only gives an overall team rating, my original ratings had separate offensive and defensive ratings, and I didn't want to switch to a model that gave me less granular information, so I had to create my own original ranking system[2].

In 2024 I decided to incorporate Elo ratings into my model as well, utilizing two different Elo ratings for each team. The first Elo rating is the traditional Elo that does not consider margin of victory. The second Elo rating, which I call Elo+, is a blatant copy FiveThirtyEight's model, but without the quarterback adjustment.

Here's the bare bones explanation for how my model works:

1.       Create preseason ratings for each team's offense (AdjO) & defense (AdjD) based on

a.      Recent team success,

b.      Returning production, and

c.      Coaching changes.

2.       Calculate overall ratings from AdjO, AdjD, AdjEff, Elo, and Elo+.

3.      Calculate game predictions using AdjEff, Elo, and Elo+.

4.      After games are played, adjust team ratings proportional to how much they over- or under-performed their expectation.

5.      Repeat weekly.

Preseason Ratings

Preseason ratings are the lifeblood of my model. Since I'm doing what I can to avoid overfitting within a season, having accurate priors is extremely important. My preseason ratings are calculated from a five-year weighted average of each team's end-of-season ratings, with adjustments for returning production and coaching changes.

I don't have returning production or player-level statistics for seasons before 2009, so there are no returning production adjustments for those seasons, and the linear weights for the 5-year average are different. I've tried adding some regression to the mean every time I've tweaked my preseason weights, and it always makes predictions less accurate, so I don't use it. I have a constant I add to offensive and defensive ratings to account for the general increase in scoring each year and to balance the overall national average rating.

 

 

Pre-2009

Linear Weights

2009 & Later

Linear Weights

Elo Models

Linear Weights

 

AdjHFA

AdjO

AdjD

AdjO

AdjD

Elo

Elo+

Year N-1

1.000

0.800

0.770

0.860

0.830

0.850

0.850

Year N-2

0.000

0.070

0.100

0.080

0.065

0.070

0.065

Year N-3

0.000

0.025

0.025

0.020

0.015

0.025

0.020

Year N-4

0.000

0.065

0.060

0.025

0.080

0.045

0.040

Year N-5

0.000

0.040

0.045

0.015

0.010

0.010

0.025

Average

0.000

0.000

0.000

0.000

0.000

0.000

0.000

Constant

0.000

0.310

0.015

0.365

0.345

1.950

1.450

 

My historic ratings go back to 1997[3]. For those early seasons, I use Massey Ratings offensive and defensive ratings as my priors[4]. I also use Massey for preseason ratings for teams reclassifying from other divisions, since my model is exclusive to DIII teams[5]. I adjust the numbers just to get them on the right scale, and I'm sure my preseason weights differ from Massey's, but otherwise, our preseason ratings for reclassifying teams should be nearly identical.

For startup programs, I don t have Massey Ratings to set my prior[6], so I originally used the historic average of startups, an AdjEff of -32, as the preseason rating:

AdjO = (National Average Offense) - 16.0
AdjD = (National Average Defense) + 16.0

That worked well enough for a while, but I wanted to set a slightly better prior[7], so I tried a method from the now-defunct Atomic Ratings[8], and use teams schedule strength to adjust my preseason expectation:

AdjO = (National_Avg_AdjO) - 13.5 - 0.32*Opp_Avg_AdjEff
AdjD = (National_Avg_AdjD) + 13.5 + 0.32*Opp_Avg_AdjEff

The next piece of preseason ratings are the returning production numbers. I used to use teams self-reported number of starters returning in D3Football s Kickoff magazine[9], but with Kickoff s demise in the COVID 2020 season (RIP in peace), I had to figure something else out.

Looking at the primary offensive and defensive statistics for each team, I calculated the percent of returning production in each, and then used a multivariate regression to determine how much each statistic contributed to a team s over- or under-performance from my model s naive preseason expectations. To determine who was returning or not, I just assumed players listed as seniors were leaving, and everyone else was staying. From 2009-2019, this assumption was probably well-founded, but after 2020, who knows?. I kept doing it anyways, but manually checked the rosters for the top 50 or so teams[10].


Offensive Returning Production

 

Defensive Returning Production

Statistic

Weight

 

Statistic

Weight

Pass Att

0.150

 

Tackles

0.121

Pass Yds

0.067

 

TFLs

0.513

Rush Att

0.000

 

Sacks

0.000

Rush Yds

0.154

 

PBU

0.219

Receptions

0.349

 

INTs

0.147

Rec Yds

0.041

 

Def GP

0.000

OL GP[11]

0.239

 

 

 

 

The equation for how much a team s returning production affects their ratings are:

AdjO = AdjO + 12.5*(Returning_Off Avg_Returning_Off)
AdjD = AdjD 10.0*(Returning_Def Avg_Returning_Def)

As you can see, the coefficient for returning offensive production is slightly higher than that for returning defensive production, but the variation among teams returning defensive production is higher than the variation of offensive production[12], so the effect of offensive vs. defensive production is relatively equal.

The last piece of the preseason ratings is a coaching adjustment. The NCAA.org statistics page keeps great records of teams head coaches over the years, so it s easy to track when teams have coaching changes. Any teams who experienced a head coaching change since the start of the previous season will have their ratings adjusted[13]. Those teams experience a slight rating decline (2.22 points) and regression to the mean (2%).

AdjO = 0.98*AdjO + 0.02*Avg_AdjO - 1.11
AdjD = 0.98*AdjD + 0.02*Avg_AdjD + 1.11

In-Season Predictions and Ratings

Now that I ve calculated preseason ratings, I can start predicting games. New in 2024, I m making four different predictions for each game. The first prediction is the same as I ve always used, which is based on teams AdjEff ratings. Then I make one prediction apiece using the two Elo models, and the final, official prediction is an ensemble[14] of those three predictions[15]. The final prediction uses linear weights, with the AdjEff prediction carrying the major heft, and the Elo models serving as a slight corrective if the AdjEff model is overconfident[16].

Spread = 0.928*AdjEff_Spread + 0.045*Elo_Spread + 0.028*Elo+_Spread

Before calculating game spreads, I first calculate the estimated home-field advantage (EstHFA) for each matchup, which is the sum of each team s AdjHFA. This EstHFA is used in each of the three different models predictions. My version of AdjHFA estimates the difference between how well a team plays at home and how well they play on the road[17].

Calculating the estimated margin of victory (EstMOV) and win probability (EstWP) for each model is probably the simplest part of this whole enterprise. The spread for the Elo models is calculated the same way FiveThirtyEight does it, but with my version of EstHFA instead:

EstMOV = (Elo - OppElo + EstHFA)/25
EstWP = 1 / (1025*EstMOV/400+1)

In the AdjEff prediction, I m explicitly calculating the estimated score for each team instead of just the spread, so that s only mildly more complicated:

EstOff = AdjO + Opp_AdjD - Avg_AdjD + EstHFA/2
EstDef = AdjD + Opp_AdjO - Avg_AdjO - EstHFA/2
EstMOV = EstOff - EstDef
EstWP = normdist(EstMOV, 0, 15.75, TRUE)[18]

After making game predictions, the games are played (who cares about that part?), and then the ratings get updated. As you might have guessed, the Elo ratings use the same formula to update ratings as used by FiveThirtyEight, but with different K-Factors[19]. The non-MOV Elo model uses a K-Factor of 146 and the MOV-inclusive Elo+ model uses a K-Factor of 57. The difference between the two K-Factors is due to the latter model s inclusion of a MOV-multiplier. AdjHFA is updated with a very similar process:

AdjHFA = Pregame_AdjHFA + K*(ActualMOV - EstMOV)

The K-Factor for AdjHFA is 0.0028, which essentially means that about one half of one percent of the prediction error belongs to each team s AdjHFA.

Updating the AdjO & AdjD ratings might be the most convoluted part of what I do. After I made the switch from the Pythagorean model to the current version, I initially used the following equation to update AdjO and AdjD:

AdjO = Pregame_AdjO + K*(ActualOff - EstOff)
AdjD = Pregame_AdjD + K*(ActualDef - EstDef)

That process worked well enough for a few years, but then I started to get frustrated with one team in particular running up scores in the regular season and then underperforming my model s expectations in the playoffs[20], so I decided that my K-Factor, which had been a constant of 0.1645, should vary depending on the pregame win probability and a simple post-game second-order win probability (PostWP).

Below is the first version of my variable K-Factor:

PostWP = normdist(ActualMOV, 0, 15.75 ,TRUE)
Var_K = K (0.5 + 2*0.5*min(EstWP+PostWP, 1-EstWP + 1-PostWP)
AdjO = Pregame_AdjO + Var_K*(ActualOff - EstOff)
AdjD = Pregame_AdjD + Var_K*(ActualDef - EstDef)

What the variable K-Factor does is decrease the magnitude of movement by about half for games that were supposed to be blowouts and ended up being blowouts. If it s supposed to be moderately close and ends up being moderately close, the ratings move the same as they would have with a constant K-Factor. The big jumps happen in upsets. Games where the results are particularly surprising, where the underdog wins by a lot or where a big underdog pulls off an upset end up with the largest K-Factor.

After using this K-Factor for a few years, I decided to do some more thorough testing to see what proportion of constant K to variable K yielded the best predictive accuracy. I started off with 50-50 (the 0.5 terms in the Var_K equation, above), and the best results settled in at 80% constant/20% variable.

Before the 2023 season I made two other changes to my K-Factor. First, I added a weekly modifier, so that games earlier in the year moved ratings more than games later in the year. I made this change because I noticed that my ratings were always slightly less predictive early in the year, which is to be expected, but by allowing ratings to jump around more in the early weeks, I was able to get even better predictive accuracy by mid-season.

Week of

Season

K-Factor

Magnitude

Week 1

1.14

Week 2

1.10

Week 3

1.06

Week 4

1.02

Week 5

0.98

Week 6

0.94

Weeks 7-11

0.90

Playoffs

1.00

 

The second major change before the 2023 season was using different K-Factors for AdjO and AdjD. I was able to improve accuracy by using a slightly higher offensive K-Factor (0.185) and a slightly lower defensive K-Factor (0.160).

K_Off = 0.185*(0.8 + 2*0.2*min(EstWP+PostWP, 1-EstWP + 1-PostWP)
K_Def = 0.160*(0.8 + 2*0.2*min(EstWP+PostWP, 1-EstWP + 1-PostWP)
AdjO = Pregame_AdjO + K_Off*(ActualOff - EstOff)
AdjD = Pregame_AdjD + K_Def*(ActualDef - EstDef)

Considering all the various K-Factors, I can estimate how much of the model s prediction error is due to HFA, offense, defense, and random variation/noise.

Let s use an example. Say a game was predicted to be 35-25 in favor of the home team (74%/26% win probability), and it ended up being 35-32 (58%/42% postgame win expectancy).

Factor

Points

Percentage

Error

7.00

100%

Home HFA

0.02

0.3%

Away HFA

0.02

0.3%

Home Off

0.00

0%

Away Def

0.00

0%

Home Def

1.20

17%

Away Off

1.39

20%

Noise

4.36

62%

 

In this hypothetical, 62% of the prediction error is due to noise, 0.6% home-field advantage, 20% offense, and 17% defense. Because most of the error is attributed to noise, this means that if the game were to be replayed immediately, the subsequent prediction would always be closer to the previous prediction than the actual outcome[21].

To make the opaque process of updating ratings after each game more complicated, new in 2024 I m adding what I m referring to as an Island Adjustment[22]. The schools on the DIII islands on the West Coast and Texas rarely get to play games outside of their geographic proximity, so I wanted to make sure any persistent non-conference trends were accounted for in team ratings. Add in the fact that the PAC didn t play any non-conference games in 2023, and that each of their teams who played in the post-season vastly outperformed their expectation, and I decided this was the right direction to go.

The adjustment is simple. It takes the sum of predictive error for teams in each conference in a given week, and then adds 0.6% of that error to each team s offensive and defensive ratings. If a conference plays eight non-conference games, and each team performs 7 points better than predicted, the ratings will update as usual, and then about a sixth of a point (7*8*0.6%/2 = 0.17) gets added to and subtracted from each team s AdjO and AdjD, respectively[23]. If the conference underperforms, team ratings get worse. This adjustment hardly improves predictive accuracy[24], but it feels right to me, so I m keeping it.

One Last Thing

For a long time, I completely excluded games outside of the division. I assumed there were so few of those games that including them would just add noise to my predictions, and I wanted my model to be exclusive to DIII. Only DIII games. Only DIII teams. Calibrated for accuracy in DIII.

After 2020 it seemed to me that some of the top teams in the division were playing more games out of the division than usual[25], so I wanted some way to give those teams credit in my resume rankings and to inform my Top 25 vote. In the seasons from 2021 to 2023, I manually input the score predictions for out-of-division games from Massey into my model. I didn t track whether including those games improved my model s accuracy at all, because I was mostly concerned with including them for the resume rankings.

This offseason, while I was going back through previous seasons to calculate Elo ratings and calibrating the Island Adjustment, I noticed how much more frequently teams played out of the division than they do now[26]. Because of that, I decided to figure out some way to include results from those historical games in my ratings. Massey is great for historic ratings, but as far as I know, there s no way to go back in time and see team ratings mid-season to calculate estimated scores.

Begrudgingly, I decided to use Massey s end-of-season ratings to craft my predictions [27]. The first step in creating a game prediction was crafting a rating for each non-DIII team. To do this, I calculated the average and standard deviation of offensive & defensive ratings for teams in my model, and then took teams Massey ratings and re-scaled them so they would have the same z-score in both models.

After plugging those teams ratings into my model and calculating game predictions the same way as I do for regular games, my model s accuracy improved so drastically that I thought I had introduced a coding error. Admittedly, I was cheating a bit because I was using end of season ratings for the non-DIII teams, but my model s average absolute error improved by about 1%. I know this doesn t seem like a wow, you definitely made a mistake level of improvement, but for context, if I looked up all the other tweaks I ve made since 2015:

         Returning production adjustments (0.50%)

         Weighted K-Factors by game week (0.21%)

         Variable K-Factor (0.13%)

         Team-specific AdjHFA (0.09% improvement)

         Different linear weights for preseason AdjO & AdjD (0.07%)

         Ensemble ratings with Elo & Elo+ (0.07%)

         Separate K-Factors for AdjO & AdjD (0.04%)

         Coaching adjustment (0.04%)

         SOS-dependent startup program ratings (0.04%)

         Island adjustment (0.01%)

Including non-DIII games provided nearly as much of a lift as all of those combined. The moral of the story? A) including more data (games) in your model is going to do way more for your accuracy than anything else you can do, and B) calibrating a model for a specific level and sport sets a pretty high baseline for accuracy. That latter point is why my answer to a frequent question I get, Why don t you use this for sports betting? is usually Because this model would be terrible at predicting anything but DIII football.

Conclusion

If you ve somehow made it this far, congratulations! You now have all the information you would need to perfectly recreate my ratings.

And you're a complete degenerate[28].



[1] As you'll see, I eventually changed my ratings to use scoring margins instead of scoring ratios, a change KenPom made as well, the season after I did. I would like to take credit and say I influenced him, but in all likelihood, his change was made independently from mine.

[2] I do not recommend this, if you can avoid it.

[3] Why 1997? Because that s as far back as the tabulated team results go in the NCAA database. Game results before 1997 are only stored as handwritten scans. I'm hopeful that some sort of AI/machine learning algorithm will help me solve the problem of turning all these scans into sortable tables, but I haven't bothered to dig in too much, yet, although I am determined to eventually get ratings calculated for the four-peat Augustana teams.

[4] I've reached out to Ken Massey about getting his game files for seasons pre-1997, but he said that he's not confident in his data fidelity for those earlier seasons, so he's unwilling to share them.

[5] With some minor exceptions. I calculate Apprentice's ratings every year, but don t rank them. They've played a majority of their games against DIII opponents every year except one since 1997. I also calculated Trinity Bible's ratings for a few seasons while they were in the UMAC, and I calculated Emory & Henry's ratings their first year as a DII, since they still played a full ODAC schedule. I also think I have some other schools included in my historic ratings prematurely, before they were officially classified as DIII, such as Crown and Northwestern Minn., but they were playing in a DIII conference, so that's fine with me.

[6] For some, I technically could. Starting in the mid-aughts, Massey started calculating ratings for club teams, so if a DIII startup played a club season before they played a varsity season, they would have a Massey Rating, but I don t use those at all.

[7] It was Calvin s announcement that they were adding football that gave me the motivation.

[8] Atomic Ratings were pretty good. Back when I used to track every DIII computer rating s predictions, they were consistently in the same ballpark as my ratings in terms of absolute error in the latter half of the season.

[9] Amazingly, this method worked better than using any sort of percentage of games started or games played returning. I think a lot of coaches were responding to "How many good players do you have returning? " instead of the actual question asked. The average number of starters returning from 2013-2019 was 6.75, and the average value of each returning starter above- or below-average was worth somewhere between 1.0 and 1.5 points per game.

[10] And then I assumed the percentage of seniors coming back held constant for teams I didn t check manually.

[11] I would have liked to use offensive line games started, as well, but there are two problems: seasons before 2012 don t have games started in the NCAA statistics database, and games started data is extremely scattershot.

[12] With a slight caveat: returning offensive production tends to be bimodal, with one mode centered around 80% and the other much, much, lower. You might be able to figure out why. When both your passing and rushing offense runs through one player, the quarterback, losing that one player can be catastrophic to your returning production numbers.

[13] By the NCAA s count, there have only been 13 mid-season coaching changes since 1997.

[14] Ensemble is just a fancy word for weighted average.

[15] Back-testing this ensemble over 26 seasons, it was more accurate than the AdjEff spread in 20 seasons, but only more accurate in one season since COVID. The lost season did a number on the Elo ratings accuracy.

[16] I decided to do an ensemble rating after I created the Elo ratings for a different project, and then noticed that they were much more accurate predicting the Stagg Bowl. This is almost definitely because of small sample sizes, but I figured why not? The Elo models don t usually move the prediction much, usually only a half point, if that, but they improve the overall accuracy of the model.

[17] A lot of very smart people have spent a lot of time calculating how different aspects of sport rest, travel distance, changing time zones, etc. affect HFA. That all would probably make my predictions a little better, but to be honest, it sounds like way more work than its worth.

[18] I m sure most of you recognize this as the pro-football-reference.com win probability equation. I honestly don t care much at all for doing anything more detailed for win probability, because most people see WP>50% and internalize it as WP=100%. I only care about the predicted margin.

[19] The K-Factor determines how much the ratings move after each game.

[20] [cough] St. Thomas [cough]

[21] This is hard for people to accept, and I think it s where 99% of the critiques of my model come from, but if you want a model that strictly ranks teams ahead of teams they ve beaten, it would be a terrible model for predicting future games.

[22] After posting about it on Twitter, Bill Connelly of ESPN, formerly of SBNation, said he does something similar, but he calls his a Real Estate Adjustment.

[23] Subtracting from AdjD means their defense is rated better.

[24] Absolute error is 0.01% better than if I didn t use it.

[25] Historically, this really wasn t the case at all.

[26] In 1997 & 1998, DIII teams averaged nearly one out of division game per season. In 2023 the whole division only played 24 games outside the division.

[27] It s really not fair to call these predictions since they re created from ratings that have information from the future, but when I m tracking my model s accuracy, I m excluding these games anyways. Games after 2020 use non-DIII teams pre-game Massey Ratings.

[28] In other words, one of my people.