The Reliability and Context-Dependency of Basic Stats: Methodology
At that the end of my recent post on evaluating player ratings I said that the next step would be to take a step back from comprehensive ratings and look at how the component stats they are built from change in different contexts. That is what I will begin to look at in this post.
The methodology I’m going to use is pretty complicated, so instead of just presenting the results I’m going to use this post to explain in a step-by-step manner the techniques I plan on using. I’m also going to try to point out what I see as potential problems, but in many ways I’m learning as I’m going so I may miss some things. I’d welcome any critiques or suggestions from anyone who knows what they’re doing (or anyone who pretends to know what they’re doing, like me).
Calculating r
In addition to the previously discussed methods of looking at same-team and different-team year-to-year correlations or r’s, I will also be using a technique for calculating r mathematically. This is a technique covered in The Book that you can read more about in this thread (or this shorter PDF).
In looking at YTY r’s we’re really trying to eliminate binomial-based luck based on the assumption that even if a player gets lucky one season that will even out in the next season (i.e. a player’s measured rate for a year is a combination of skill and luck, and there is no year-to-year correlation in luck, so all the year-to-year correlation that is found comes from player skill). But we know how the binomial distribution works, so we can use a formula to calculate the contribution of binomial luck.
[An aside on binomials: Stats that have two possible outcomes (successes and failures) are binomials. Hence opportunity rates (which measure successes per opportunity) are binomials. So this method works for a lot of basic stats - 2P%, 3P%, FT%, DRB%, ORB%, Hollinger’s Assist Ratio, Pomeroy’s Assist Rate, etc. Stats with more than two possible outcomes are multinomials. One can also calculate r mathematically for multinomials, but the process is more involved. At least initially I will just be looking at binomials, and not attempting to calculate r for complex stats like PER or Wins Produced.]
Here’s how that calculation works. For starters, we’re not looking at season-pairs. This method allows one to calculate r without having to compare one year to the next. So the sample is all player-seasons (though I will again be throwing out seasons of players who played for multiple teams that year).
Definitions/abbreviations: var: variance, squared standard deviation obs: observed, the actual stats produced by the players true: the true talent/skill of the players in given statistical category (e.g. defensive rebounding skill) rand: random luck/chance/error, binomial-based randomness p: average rate for given stat in the whole population n: average number of opportunities/trials for the given slice of the population r: calculated correlation coefficient Formulas: var(obs) = var(true) + var(rand) var(rand) = (p*(1-p))/n r = var(true)/var(obs) r = 1 - var(rand)/var(obs)
Calculating r step-by-step:
- Select a slice of opportunities to look at (e.g. for 2P%, from 500-600 2PAs on the season)
- Calculate var(obs) by finding the variance in the stat for all players in the opportunity slice
- Calculate var(rand) using the average rate in the population (not just the slice) and the average number of opportunities in the slice (e.g. 550 2PA) in the (p*(1-p))/n formula
- Calculate var(true) by subtracting the value from step 3 from the value from step 2
- Calculate r by using the values from steps 2 and 4 in the var(true)/var(obs) formula
- [Or you can combine steps 4 and 5 and calculate r by using the values from steps 2 and 3 in the 1 - var(rand)/var(obs) formula.]
Given these results, we can identify what portion of the variance in a stat (for a specific slice of opportunities) comes from binomial-based random luck.
For example, let’s look at the two-point field-goal percentage of players from the 79-80 to 06-07 seasons who had between 500 and 600 two-point attempts in a season. The average two-point field-goal percentage for all players from those seasons is 48.4%.
- 500-600 opportunities, n = 550 opportunities
- var(obs) = 0.001927
- var(rand) = (0.484*(1-0.484))/550 = 0.000454
- var(true) = 0.001927 - 0.000454 = 0.001473
- r = 0.001473/0.001927 = 0.76
So for the 500-600 2PA slice, 76% of the variance in observed 2P% comes from variance in the true skill of the players (kind of - we’ll get to that later) and 24% comes from variance in binomial-based random luck. The 24% figure is what I was really after. It’s var(rand)/var(obs), or 1-r.
The difference between the calculated r and the measured YTY r
The calculated r can be seen as the upper-bound for any YTY r - the calculated r tells you the effect of binomial randomness, and because of that randomness you’ll never get a higher year-to-year correlation than the calculated r even if player skill is completely stable from year to year. But what does it mean if you get a YTY r that is lower than the calculated r? This is something that I haven’t seen addressed directly before (maybe because it’s not as big an issue in baseball).
So I’ll take a stab at interpreting what could be happening to result in YTY r’s that are lower than calculated r’s. The assumption behind looking at year-to-year correlations is that 1-r will isolate binomial randomness because it’s the only part of the observed stat that will vary from one season to the next. But as we saw when looking at YTY correlations of player ratings for players who changed teams between seasons, unless your starting point is a perfectly context-neutral stat, there’s going to be some additional variance from year to year beyond the variance in binomial randomness. This additional variance could come from a stat being dependent on changing contexts in a variety of ways - players changing teams, teams changing coaches, coaches changing systems, players changing roles, players getting hurt, etc. What this means is that we’re going to pick up more variance from year-to-year correlations (binomial randomness + context-dependence) than is reflected in calculated r’s (only binomial randomness). More variance means YTY r’s will be lower than calculated r’s, and the YTY r estimation of binomial randomness (1-r) will be too high.
There are techniques of measuring r that will control for some of this context and yield values that are closer to the calculated r. I’ve already used one such method when I looked only at the YTY r’s for players who played on the same team both seasons rather than looking at the YTY r’s of all players (which would include those who changed teams). An even better method is to use split-season correlations. If you have game-by-game data, for each player you can divide their performance in a stat into two groups - how they did in odd-numbered games and how they did in even-numbered games. Then you can look at the correlation between all players’ odd-numbered game rates and their even-numbered game rates. By using data from within a single season, this method eliminates many of the context changes present in year-to-year correlations. If you have play-by-play data, you can go even further in eliminating context by correlating how players did in their odd-numbered opportunities (i.e. their 1st and 3rd and 5th, etc. two-point attempts of the season) to how they did in their even-numbered opportunities. Unfortunately I don’t have game logs or play-by-plays, so I will just be using year-to-year correlations in addition to calculating r’s.
Putting it all together
Now we can get to the fun stuff. If we combine what we can learn from calculated r’s with what we can learn from same-team vs. different-team YTY r’s, we can construct a breakdown of the component parts of any stat. The calculated r tells us the impact of binomial randomness, the difference between the calculated r and the same-team YTY r tells us the impact of within-team context (i.e. all the changes in context from one season to the next for a player who stays on the same team), and the difference between the same-team YTY r and the different-team YTY r tells us the impact of between-team context (i.e. the changes in context from switching teams). For any stat we should be able to come up with a breakdown that says, at around n opportunities, this stat is W% player skill, X% within-team context, Y% between-team context, and Z% luck. Or at least, that’s the plan. But there are some issues that have to be dealt with.
Selective sampling issue
One problem with comparing same-team YTY r’s to different-team YTY r’s is that the two groups we’re comparing are not random mixes of players and may have different characteristics. The different-team group is made up of players who had their contracts expire and then signed elsewhere, players who were traded in the offseason, and players who were waived and then signed elsewhere. What if this group is of a different quality or consistency than the group of players who played for the same team in consecutive seasons? This could potentially distort our results. One way to check this is to compare the stats of each group as a whole. For any stat, we can look at whether one group had a higher average than the other, and we can also compare the observed variances of that stat in each group.
Number of opportunities issue
As I briefly mentioned in discussing the evaluation of player ratings, the authors of The Book have repeatedly emphasized an important issue when looking at year-to-year correlations. They usually refer to it as a problem of sample size. Binomial luck will have a greater impact the smaller the sample (e.g. the fewer the minutes played, the fewer the three-pointers attempted, etc.), so YTY r’s will increase as sample size increases. Which makes it dangerous to simply compare the YTY r’s of two stats - if the stats have different sample sizes this could be what’s causing the difference in YTY r’s. Thus it is important to indicate the sample size you are looking at along with the YTY r, and to compare like sample sizes to like sample sizes in comparing the YTY r’s of different stats (though it’s also important to note the differences in the sample sizes acculated over a season for different stats).
I understand calling this a sample size issue, because that’s often how the term is used. When people say that we can’t draw firm conclusions from the first five games of the season because of the small sample size, they are getting at this very issue. But I don’t like calling it a “sample size” issue for a few reasons. First, it’s ambiguous - in the context of YTY r’s, the sample size could just as easily be referring to the number of players (or player-seasons) that one is looking at. And second, it’s not specific enough - the sample size being too small could mean too few games played, too few minutes played, or too few opportunities in a given stat. The last case is what we’re really talking about, so I’m going to refer to this as a “number of opportunities” issue. That also ties it in to my previous discussion of opportunity rates.
Below is a chart showing how binomial randomness decreases as opportunities increase. The different colored lines represent different league-average rates for whatever stat one wants to look at. Binomial randomness is also tied to this, though not nearly to the degree that it’s tied to opportunities. Stats with league-averages further from 0.5 (both above and below) will be subject to less randomness. You can see that regardless of the stat’s league-average, if the number of opportunities looked at are less than a few hundred, there will be a lot of binomial randomness. This becomes an issue for a stat like three-point percentage, in which many players have less than 200 opportunities (3PA) per season. But it’s not as big a deal for stats like defensive rebounding percentage, where many players have more than 2000 opportunities (team defensive rebounds + opponent offensive rebounds while the player is on the court) per season. This difference between stats in how many opportunities they accumulate per season is vitally important in using past data to predict future performance and in knowing how much regression to the mean to expect (a topic I’ll be focusing on more in a future post).

The practical problem this causes is that in slicing up the data to control for opportunities we are reducing the number of player-seasons we have to work with (the other kind of “sample size”). This becomes an issue especially when looking at low opportunity stats (i.e. three-point field-goal percentage) for the group of players who changed teams between seasons (which is much smaller than the group of players who stayed on the same team). For a given slice of opportunities we may have only ten players or less on which to base the YTY r. This is because we’re not just looking at all players who had between X and Y opportunities in a season - we’re looking only at those players who had that many opportunities in one season and then again had between X and Y opportunities the next season. So the thinner the slices, the more players whose data has to be completely thrown out because their number of opportunities shifted too much between seasons.
There are a variety of ways of dealing with this problem. One can use larger slices, use minimum cutoffs (without maximums) instead of slices (e.g. players who shot at least 100 three-pointers in consecutive seasons), weight correlations by opportunities, and/or interpolate figures where data is limited. I expect to try out many of these methods. It’s also worth noting that while binomial randomness is tied to the number of opportunities, the relationship between same-team and different-team YTY r’s should not be, so for that part of the analysis some of the problems should be able to be avoided.
What’s next
So that sums up the methodology I plan on using to look at the reliability and context-dependency of basic stats. I’m sure that some adjustments will have to be made once I start looking at the data. In my upcoming posts I will be going through a bunch of stats to see what using these techniques can tell us.