In my next few posts I’m going to cover the topic of regression to the mean, and how it applies to basketball statistics. This is a complex issue and this first post is pretty heavy on the math, but I plan on following it up with more practical examples showing how you can do the calculations in Excel and looking at what the results tell us about different areas of the game.
Most of the equations in this post are not my original work but instead were taken from various sources. I’ve tried to compile them all into one place and in a fairly logical order that can benefit both a newcomer to the topic as well as those with more advanced knowledge looking for a refresher. The main sources are various posts and comments by Tangotiger, MGL, and others on The Book blog, Andy Dolphin’s appendix to The Book, and the Social Research Methods site. Throughout this post I will link to several specific pages that are of relevance. I would also recommend two excellent introductions to regression to the mean by Ed Küpfer and Sal Baxamusa.
True Score Theory
Regression to the mean is rooted in true score theory (aka classical test theory). The basic idea is that a player’s observed performance over some period of time (as measured by a statistic like field-goal percentage) is a function of  the player’s true ability or talent in that area and  a random error component. It should not be forgotten that this is a simplified model, and it leaves a lot of stuff out (team context, for one).
Observed measure = true ability + random error
A player’s true ability can never be known, it can only be estimated. A player’s observed rate is the typical estimate that is used (i.e. we assume a player with a 40% three-point percentage is a “40% three-point shooter”), but by using regression to the mean we can get a better estimate. This is done by combining what we know about how the individual fares in a particular metric with what we know about how players generally fare in that metric.
The first step is to convert the true score model from the individual level to the group level by looking at the spread (or variance) of the distribution of many players’ stats:
var(obs) = var(true + rand) ...but since the errors are by definition random, they aren't correlated with true ability, so... var(obs) = var(true) + var(rand)
If you look at the field-goal percentages of a group of players, some of the variation would be from the differing shooting abilities among the players, and some would come from the differing amounts of random luck each player had. As the equation shows, the overall variance (the standard deviation squared) of players’ observed rates is equal to the sum of the variance of their true rates and the variance of the random errors.
The next important concept is reliability. The more reliable a stat is, the less affected it is by randomness, and the better it captures a player’s true ability. If a metric stays consistent when looking at one sample of a player’s performance and comparing that to another sample from the same player (for instance, one season compared to the next), that metric is reliable (assuming that the player’s true ability was the same for both samples). Below I will use the correlation coefficient symbol “r” to represent reliability because what we’re really looking at are correlations. Reliability is a specific type of correlation though - the correlation between a measure and itself.
There are many different ways of calculating reliability. Here are some, divided into empirical and mathematical:
- Year-to-year correlation
- Split-half correlation
- Cronbach’s alpha
- Intraclass correlation
- Derivation from var(rand) and var(obs)
- Tango’s regression equation method
- Tango’s z-scores method
- Andy Dolphin’s mathematical derivation
Year-to-year correlations (YTY r’s) are the most basic and widespread method. The idea is simple - for some group of players, find the correlation between how those players performed in that stat in one season with how they performed the next season. Assuming players’ abilities don’t change much from year to year, and that luck isn’t correlated from year to year (that is, that a player having good luck one season doesn’t make him more likely to also be lucky the next year), then if a stat has a low year-to-year correlation this must be because it is picking up more of the fluctuating randomness and less of the consistent true ability, and vice versa. This is the basis of how correlating a metric to itself gives a measure of reliability.
There are some issues with year-to-year correlations. For one thing, the assumption that player abilities are stable from one season to the next is surely false. Player improvement and decline (as well as other context changes between seasons) are added sources of season-to-season fluctuation, and thus YTY r’s may underestimate a stat’s reliability. To deal with this one can instead use split-half correlations (like this). Instead of correlating one season to the next, these split a season in half and correlate one half to the other. This could be all games from the first half of the season compared to all games from the second half, or better yet, performance in odd-numbered games compared to even-numbered games (or even odd-numbered shot attempts vs. even-numbered ones). The same idea can be used on a larger scale, comparing performance in odd-numbered years to even-numbered years. Cronbach’s alpha is basically split-half correlations to the extreme - instead of just comparing one subset of 41 games (like odd-numbered ones) to the other 41, it looks at every possible subset of half the season compared to the counterpart half, and averages all these correlations. It’s similar to intraclass correlation, a technique that Pizza Cutter often uses on his blog.
To this point I’ve been sidestepping a major issue in calculating reliability. This is the fact that reliability is dependent on sample size (here meaning the number of opportunities or attempts each player had for the given stat). The year-to-year correlation for a stat where players have only a few dozen opportunities per season will likely be much lower than the correlation for a stat where players have thousands of opportunities in a season. To see why this is so it is useful to look at the mathematical calculation of reliability.
As discussed above, reliability measures how much a metric captures true ability vs. random error. It is really just a measure of the percentage of the total observed variance that comes from variance in true ability. (Again I’ll use r to represent reliability, though here it should be noted that we’re not really dealing with a correlation.)
r = var(true)/var(obs) r = var(true)/(var(true) + var(rand))
Here we can clearly see that reliability depends on var(true) and var(rand). var(true) is the spread of true talent in the metric’s skill area in the population. The more variation in skill, the greater the reliability. The more random error, the smaller the reliability. But we can delve deeper than this and calculate the value of var(rand). If the stat we’re looking at is a binomial opportunity rate, meaning that each opportunity/trial/attempt has two possible outcomes - success or failure, then we can use the binomial distribution to rather easily calculate var(rand) for a group of players if we know the average rate of the group and the average number of opportunities for each player (this can also be done through a more complicated process for multinomial stats - see Andy Dolphin’s appendix to The Book or this post by Ed Küpfer). Here is the formula for binomials:
var(rand) = PopMeanRate*(1 - PopMeanRate)/PopMeanOpps
The basic idea (on an individual player level) behind binomial randomness is that on each opportunity we are guaranteed an error of either PlayerTrueRate or (1 - PlayerTrueRate). For each shot attempt, a true 40% shooter will either make it (100% FG%, error of .6) or miss it (0% FG%, error of .4). After two attempts, the player with 40% skill must either be shooting 0%, 50% or 100%. As the number of opportunities increases, the observed rate will approach the true rate as the effect of randomness decreases (for a full derivation of this equation see this paper). So now we can see how sample size affects reliability. Fewer opportunities means a higher var(rand), and a higher var(rand) means a lower reliability.
(I’m actually unsure about what the exact formula for var(rand) should be. If every player in the population has the same number of opportunities, the one listed above is fine. But once there is some variation in opportunity levels between players, it’s unclear to me whether one should use the arithmetic mean of each player’s opportunities or the harmonic mean (as discussed here), or something else. I’ve tried different things and none seem to work perfectly. This isn’t a big deal when the players have had a similar number of opportunites.)
So when using empirical correlations to measure reliability, one must be cognizant of the number of opportunities each player has had for the given stat (this important point has been emphasized by Tango and MGL on their blog). Ideally, the group of players looked at should all have had a similar number of opportunities, as the reliability of a stat for players with 100 opportunities will be different from the reliability for that same stat for players with 1000 opportunities.
Going back to where we left off above, we know that reliability is the ratio of var(true) to var(obs), but we can’t calculate var(true) directly, so some manipulation is needed.
r = var(true)/var(obs) r = (var(obs) - var(rand))/var(obs)
This gives us a way to get reliability from var(rand), which we’ve already seen can be calculated from the population mean rate and mean opportunity level, and var(obs), which is simply the variance of the observed rates of the players in the population. Again we have to be careful that the players we look at have a similar number of opportunities, but the advantage of this method is that we don’t have to worry about dealing with multiple seasons (as in year-to-year correlations) or splitting data up by game (as in split-half correlations).
Regression to the Mean
Once we know the reliability of a statistic, we can use this to estimate the true ability for individual players area through regression to the mean. The more reliable a metric is, the more we confidence we have that a player’s true talent level is near the observed rate that he produced. But if reliability is low, there is a greater chance that the observed rate is a poor representation of true ability. So we can put something like a confidence interval around observed rates, with rates accumulated from more opportunities (lower var(rand), higher reliability) having thinner intervals and rates accumulated from fewer opportunities (higher var(rand), lower reliability) having wider intervals. But regression to the mean takes this a step further by taking into account the other half of reliability, var(true), the spread of true talent in the population.
This is where things get Bayesian. We don’t just know that player X shot 40% on 100 three-point attempts. We know more about player X - he’s an NBA player, he’s a shooting guard, he’s a starter, etc. And we know more about three-point shooting - it doesn’t vary that much among starting SG’s in the NBA, starting SG’s on average shoot Y% from three, etc.
Once we pick the population to regress toward (more on that later), we have two estimates of the player’s true ability - the player’s observed rate, and the mean rate of the population. How do we weight each of these estimates to arrive at one best guess? By using the metric’s reliability. As discussed above, if the player’s observed rate was produced from a small number of opportunities (high var(rand), low reliability), we put less weight on it. But now we have an additional piece of information to use as well - the spread of talent in the population. If players vary little in the stat, large deviations from the mean are more likely to be random flukes, and thus the player’s observed rate should be given less weight. On the other hand, if there is large skill variation between players, then an extreme observed rate deserves less skepticism, and should be given more weight. As we’ve seen, the calculation of reliability takes into account both of these factors - it increases as var(true) increases and it decreases as var(rand) increases. So we weight the player’s observed rate by the stat’s reliability, and the population mean by one minus the reliability. Plugging this into the formula for a weighted average (WeightedAvg = (Measure1*Weight1 + Measure2*Weight2)/(Weight1 + Weight2)), we get the following:
Regressed rate = (PlayerObsRate*r + PopMeanRate*(1 - r))/(r + 1 - r) Regressed rate = PopMeanRate + r*(PlayerObsRate - PopMeanRate)
In other words, we regress (1 - r) percent of the way to the mean (treating r as a percentage rather than a decimal). If the reliability is .8, then we regress 20% of the way from the player’s observed rate to the population mean to arrive at our estimate of true ability. A 50% shooter from a population with a mean of 40% would be regressed to 48% (40 + .8*(50 - 40)).
Tango’s regression equation
I’m not sure whether Tango came up with this or whether it was Andy Dolphin or MGL. I know Andy wrote the math-heavy appendix to The Book which outlines the full method of regressing to the mean (discussed below), but I don’t remember this shortcut equation from The Book and I know I’ve seen it a lot in Tango’s blog posts (like this one).
As mentioned above, it’s not the case there is one reliability figure for each metric. Reliability is metric specific and opportunity specific - the r for 3PT% for players with 50 three-point attempts is different than the r for 3PT% for players with 200 three-point attempts (even if we’re regressing to the same population). But when calculating reliability by the methods previously described (whether empirical or mathematical), one must use a specific opportunity level and thus arrive at a result that is specific to that level. This formula allows one to generalize from such a result and use that one r to generate other r’s for different opportunity levels for that statistic.
The first step is to calculate reliability for a specific opportunity level (it doesn’t matter the method). Call this opportunity level KnownOpps and the reliability KnownR. These are used to calculate a constant value specific to that metric that can then be put into a general formula allowing one to calculate r for any opportunity level.
constant = (1 - KnownR)*KnownOpps/KnownR constant = KnownOpps*var(rand)/(var(obs) - var(rand)) General r = opps/(constant + opps) General (1 - r) = constant/(constant + opps)
New r’s calculated from this equation can then be plugged into the regressed rate equation from above to regress the stat for players with different opportunity levels. The result is identical to taking a weighted average of a player’s observed rate and the population mean using opps for the player weight and constant for the population weight.
Regressed rate = (PlayerObsRate*opps + PopMean*constant)/(opps + constant)
Another way to think about this is in terms of adding constant number of opportunities at the population mean to the player’s observed rate. If the player’s observed rate was .5 in 100 opportunities (50/100), the population mean is .4, and the constant for that metric was determined to be 60, then add 60 attempts at 40% (24/60) to the player’s 50/100 to arrive at the regressed rate of (50 + 24)/(100 + 60) = .4625.
Tango’s Z-scores method
Again I’m not sure who came up with this method, but Tango has used it a number of times (like here). This is a method for calculating r when your population contains players with varying numbers of opportunities for the stat in question. Basically what it does is simulate the distribution of observed rates that would be expected if all players in the population had the same true rate, and then it looks to see how much more spread there is in the actual observed rates than there is in the simulated distribution.
Calculating by this method is a little more involved. For each player in the population, one must first compute two numbers, the PlayerVarRand and PlayerZscore. Then r can be computed using the variance of the PlayerZscores of all the players.
PlayerVarRand = PlayerObsRate*(1 - PlayerObsRate)/PlayerOpps PlayerZscore = (PlayerObsRate - PopMeanRate)/sqrt(PlayerVarRand) r = 1 - (1/var(PlayerZscores)) r = (var(PlayerZscores) - 1)/var(PlayerZscores)
var(PlayerZscores) is actually an estimate of var(obs)/var(rand). So we can also use this method to get var(rand) and var(true) for the population:
var(PlayerZscores) = var(obs)/var(rand) var(rand) = var(obs)/var(PlayerZscores) var(PlayerZscores) = var(obs)/var(rand) var(PlayerZscores) = var(obs)/(var(obs) - var(true)) var(true) = var(obs) - var(obs)/var(PlayerZscores)
The advantage this method has over calculating r from var(obs) and var(rand) is that it takes into account the varying number of opportunities that each player in the population had. However, I’m not completely sure how it works. Using this method one can calculate an r, but what opportunity level is this for? It can’t be the case that this r applies for all opportunity levels and that all players should be regressed the same amount. What should be plugged into KnownOpps in Tango’s regression equation in order to derive the constant? Again I don’t think that simply the arithmetic mean of all the players’ opportunities is the answer. This is also an issue when calculating r empirically from a correlation.
However, one still can adjust the r calculated from this method for different opportunity levels, and thus regress players different amounts based on their varying opportunities. This can be done by calculating player-specific r’s directly from var(true) (calculated as described above) and PlayerVarRand:
r = var(true)/(var(true) + PlayerObsRate*(1 - PlayerObsRate)/PlayerOpps)
Andy Dolphin’s method from The Book
In the appendix of The Book, Andy Dolphin described a more detailed mathematical method for regressing to the mean. The basic idea of taking a weighted average of the player’s observed rate and the population mean rate remains, but here each measure is weighted by the reciprocal of its uncertainty squared (uncertainty is basically equivalent to the standard deviation). So if the player’s rate has a larger uncertainty than the population mean, it will be weighted less (1/uncertainty^2 will be smaller), and thus will be regressed further toward the population mean.
(Uncertainty(PlayerObsRate))^2 = PlayerVarRand = PlayerObsRate*(1 - PlayerObsRate)/PlayerOpps (Uncertainty(PopMeanRate))^2 = PopVarTrue PlayerWeight = 1/(Uncertainty(PlayerObsRate))^2 = 1/PlayerVarRand PopWeight = 1/(Uncertainty(PopMeanRate))^2 = 1/PopVarTrue Regressed rate = (PlayerObsRate/PlayerVarRand + PopMeanRate/PopVarTrue)/(1/PlayerVarRand + 1/PopVarTrue)
PopVarTrue (which is the same thing I’ve been referring to as var(true)) is calculated in a somewhat similar method to the z-scores above. First calculate PlayerVarTrue for each player using PlayerVarObs - PlayerVarRand (with an additional term of (1 - PlayerOpps/PopOpps) added in), then take a weighted average of each of these where the weights are 1/uncertainty^2 of each player’s PlayerVarTrue (for the weighted average I’ve used the Excel formula of SUMPRODUCT(Measures,Weights)/SUM(Weights) rather than the mathematical summation notation):
PlayerVarTrue = PlayerVarObs - PlayerVarRand PlayerVarObs = (PlayerObsRate - PopMeanRate)^2 PlayerVarRand = PlayerObsRate*(1 - PlayerObsRate)/PlayerOpps PlayerVarTrue = (PlayerObsRate - PopMeanRate)^2 - (1 - PlayerOpps/PopOpps)*(PlayerObsRate*(1 - PlayerObsRate)/PlayerOpps)) PlayerVarTrueWeight = 1/(2*(PlayerObsRate*(1 - PlayerObsRate)/PlayerOpps + PopVarTrue)^2) PopVarTrue = SUMPRODUCT(PlayerVarTrues,PlayerVarTrueWeights)/SUM(PlayerVarTrueWeights)
You may have noticed that PopVarTrue, the figure we are trying to calculate, appears within the formula we are using to calculate it. We can avoid an infinite loop by iterating - first put in any random number (smaller than one) for PopVarTrue in each player’s PlayerVarTrueWeight (using the same random number for each player). Then solve for PopVarTrue through the weighted average formula. Next, take that result, and plug it back into all the PlayerVarTrueWeights, and solve for PopVarTrue again. Take that result, and plug it back in, etc. Eventually things will stabilize (meaning that the number you get out of the weighted average formula will be the same as the number you had just plugged in to each PlayerVarTrueWeight).
One can also calculate r from this method:
r = PlayerWeight/(PlayerWeight + PopWeight) r = (1/PlayerVarRand)/(1/PlayerVarRand + 1/PopVarTrue) r = 1/(PlayerVarRand*(1/PlayerVarRand + 1/PopVarTrue)) r = 1/(1 + PlayerVarRand/PopVarTrue) r = PopVarTrue/(PopVarTrue + PlayerVarRand)
Whew, that was a lot of math. Once you’ve got some of that down, you can get on to the fun stuff, which is applying these formulas onto real players’ stats. But that will have to wait until my next post. I hope to look at some practical results both to see what they tell us about the reliability of different basketball stats, and to explore the differences between the various methods I’ve discussed in this post. I also plan on addressing the issues of choosing a population to regress to, dealing with varying opportunity levels, determining whether a “skill” exists, and using regression to the mean as part of a projection system (like Tango does with his Marcels). Until then, if anyone can answer some of the questions I’ve posed about these different methods for regressing to the mean (or correct any of the mistakes I’ve surely made), I’d love to hear from you in the comments.