### Comparing Player Ratings

I’m still working on that follow-up post on regression to the mean, but in the meantime I wanted to put up a post comparing various player rating systems. For the most part this will be a subjective rather than objective evaluation of the metrics, along the lines of Dean Oliver’s “laugh test” (as in, “a rating system that thinks Dennis Rodman was better than Michael Jordan doesn’t pass the laugh test”). I think looking at how players are rated differently in various systems can tell us a lot about both those players and those rating systems.

#### The Player Ratings

I took a look at seven popular player ratings. Two basic linear weights metrics based on boxscore stats - John Hollinger’s Player Efficiency Rating (PER), and Dave Berri’s Wins Produced (WP). Two metrics built on Dean Oliver’s individual offensive and defensive ratings - Justin Kubatko’s Win Shares (WS), and Davis21wylie’s Wins Above Replacement Player (WARP). And three plus/minus metrics based on team point differential while the player is on the court - Roland Beech’s Net Plus/Minus (Net +/-), Dan Rosenbaum’s Adjusted Plus/Minus (Adj +/-), and Dan Rosenbaum’s Statistical Plus/Minus (Stat +/-). For the purposes of comparison I looked at the per-minute (or per-possession) versions of all these metrics (e.g. WP48 instead of WP, WSAA/48 instead of WSAA, WARPr instead of WARP).

Using data from Basketball-Reference and Doug’s Stats I calculated PER and Wins Produced on my own, so the values may differ slightly from those you’ve seen elsewhere (I should note here that Wins Produced has a position adjustment that sets the average guard’s rating equal to the average big man’s rating, a feature (or bug?) which is not present in any of the other systems). For Win Shares and WARP I got this year’s ratings from Basketball-Reference and this APBRmetrics thread, respectively. For Win Shares I converted Win Shares Above Average (WSAA) to a per-48-minute rating (I was able to duplicate the calculations for Win Shares on my own but I wasn’t sure how to calculate Loss Shares). For Net +/- and Adjusted +/- I used data from BasketballValue, and I calculated Statistical +/- on my own. For all metrics other than Adjusted +/- players who played for multiple teams in the season did not have their stats combined but instead had each stint looked at separately.

First, the top and bottom 10 in each boxscore-based rating system among players who played at least 500 minutes in 07-08:

Next, the top and bottom 10 in each plus/minus-based rating system among players who played at least 500 minutes in 07-08:

Averaging how each player was ranked by all seven metrics, here is a consensus top and bottom ten, along with each player’s rank in each metric:

One thing that jumps out is that despite being rated the first or second best player in the league by five of the seven rating systems, Chris Paul is not among the consensus top ten. He dropped to 14th overall due to his very mediocre Adjusted Plus/Minus ranking (154th out of 329). Exactly why he rated so low in this metric has been the topic of some recent debate. Amare Stoudemire was another player who ranked much lower in Adjusted Plus/Minus (194th) than in the other metrics.

Another eye-popper is seeing Amir Johnson, the 21-year-old Detroit power forward who’s been riding the pine in the playoffs, ranked first in the league in Adjusted Plus/Minus. This actually isn’t as great an anomaly as might be expected - Johnson rated rather well across the board. His consensus ranking was 15th. He was rated lowest by PER (64th), but he ranked 11th in Win Shares and 20th in Statistical Plus/Minus. Obviously one has to use some caution considering he played under 800 minutes on the season, but the fact that he rated well in several metrics could be a good sign for the future.

#### Where the Rating Systems Differ

To further examine the rating systems, for each one I wanted to see which players it liked better (or worse) than the other systems. To do so I found the difference between each player’s ranking in that system and his average ranking in the six other systems. In the chart below if a player is ranked very high in the given metric but much lower in the other six, then he will appear near the top of the list, as a player that that metric “likes” more than other metrics do (e.g. PER likes Kevin Durant a lot more than other systems, but doesn’t like Shane Battier as much as other systems). This could be seen as a list of players that the metric overrates (or underrates, if you’re looking at the bottom players) relative to other rating systems. Or, if you think the players at the top of a list tend to be underrated (statistically), then maybe that metric is the one for you.

Win Shares is highly tied in to team wins, and that can be seen clearly from how highly it rates role players from great teams and how poorly it rates stars from awful teams. One can make corresponding diagnoses for the other metrics based on these lists, in terms of systems over- or under-rating usage, rebounding, scoring, etc.

Next, expanding on the Chris Paul example from above, here are the players that the rating systems are either in greatest agreement on, or in greatest dispute over. To calculate this I just took the standard deviation of the players’ rankings in the seven metrics. All the rating systems agree that Manu Ginobili is pretty good and Acie Law is pretty bad, but they can’t agree on whether Al Jefferson is one of the best or one of the worst players in the league.

#### Correlating the Rating Systems

One thing that stands out on the last chart is that some of the metrics seem to group together in their evaluation of players. Net Plus/Minus and Adjusted Plus/Minus both rated Casey Jacobsen much better than the other five metrics, and they both rated Al Jefferson much worse. To quantify how much each player rating is in agreement with each other rating, I calculated the correlation coefficient between each metric for all players who played at least 500 minutes last season (here again it should be noted that Adjusted Plus/Minus was calculated using season totals even for players who changed teams during the season, unlike the other metrics which split things up).

Here we can see that Net +/- and Adjusted +/- are similar to one another (correlation of 0.73) but very different from the boxscore-based ratings. Statistical +/-, which is meant to be a boxscore-based estimation of Adjusted +/-, does estimate it better than the other boxscore metrics with a correlation of 0.49, but also correlates pretty strongly with those boxscore metrics. WARP is somewhat surprisingly very highly correlated with PER (0.93), perhaps due to the weight both place on usage.

I’ll try to put together a spreadsheet so anyone can download all this data soon. Until then I’d be interest to hear any interpretations of these charts that people have.

Is there any chance you can somehow use regression to the mean for Amir Johnson’s Adjusted Plus/Minus ranking? He only had around 800 minutes played, so what is the average for his position?

Comment by Ryan J. Parker — May 28, 2008

As Dave Berri alleges, PER seems to value inefficient scorers (Eddy Curry, anyone?) and completely fails to value non-scores (Shane Battier, Bruce Bowen). Dave Berri’s linear-weights WP48 is immensely powerful at the team level (you tell me the team’s total stat-line for the season, I tell you how many wins they had) but has serious weaknesses when applied to individual players. The main problem is the value of defense: on the team level, a defensive rebound equals a defensive stop. When you apply the linear weights to an individual’s statline this amount to assigning the defensive stop to the player who got the defensive rebound. Evaluating player defense is a problem for all box-score based methods.

The most theoretically sound system is Adjusted ±, especially because it is the only way I know of that measures defensive value. To assure ourselves that it is practically sound, however, the following test (inspired by Berri) needs to be done: the total Adjusted ± of the team, weighted by playing time, should give the team efficiency (scoring margin per possession). It would also be nice to see (as Berri does for WP) individual season-to-season correlations. For WP the “team total” test is touted by Berri despite being a tautology (a linear weights system is by definition additive, and the weights were chosen by regression to work on the team level) but this would not the case for Adj ±.

Comment by Lior — May 28, 2008

Ryan, that’s a good idea that I’ll look into. For a starting point one can look at the standard error, which Aaron lists on his site as 6.53 for Johnson ( http://basketballvalue.com/topplayers.php ). The higher the standard error, the more you’d want to regress.

Lior, adjusted plus/minus will pass that “team total” test you propose as long as you account for the strength of the opposing lineups faced as well. The starting point for adjusted +/- is player on-court point differential, which sums to team point differential exactly. As for year-to-year correlations, adjusted plus/minus doesn’t fare well in that area because it’s pretty noisy unless you have a large (multiple season) sample.

Comment by Eli — May 28, 2008

To make more meaningful RMS numbers, you need to start with a linear variable. The rankings are far from linear, as the “quality” of the players is from some parent distribution with lots of players in the middle, probably an exponentially falling high edge, and a hard to guess at low edge (where the GMs ability to diagnose sub par players, or willingness to pay them comes in).

A standard bell curve should be a more reasonable starting point: translate the rankings using the inverse standard distribution function and then looks at the RMS.

This should eliminate the artifact that the players which the systems most agree on are of extreme quality: 9 of the ten “most consistently ranked” are exceptionally good players, and one has exceptionally bad stats.

Comment by Amnon — May 28, 2008

Here is a spreadsheet with all the data I used for the post (including players who played less than 500 minutes):

http://spreadsheets.google.com/pub?key=pLWcAQTLnESu2LzGKCY-gOQ

Comment by Eli — May 28, 2008

Eli,

This is very fascinating work, thank you for putting the effort forth and sharing. I have a lot of interest in this material, and the next immediate thought in my head is on how to apply these metrics and player data to team performances in 2007/2008. I’m not sure if that question interests you, but I’m subscribing to this blog anyway to see where you take it.

Thanks!

Erich

Comment by Erich — May 28, 2008

Eli, would it be correct to say that the probability that Amir Johnson’s true adjusted +/- rating is higher than Kevin Garnett’s is 55.5%? I calculated this by finding the difference between the two normal distributions (as they’re defined by the model). In R: 1-pnorm(0, mean=(12.46-11.33), sd=sqrt( 6.53^2 + 4.77^2 ))

Not only is this pertinent to this topic, but I’m interested in knowing if that is a correct statement in gerneral to models where we assume the “predictions” (if you will) follow a normal distribution. We always get focused on the deterministic piece of the output, but I’m more interested in the probability aspect when comparing the ratings of players (or other output of other models in general).

Comment by Ryan J. Parker — May 29, 2008

Yeah, I think that’s the correct method. But I actually linked to the wrong page. Those are the figures combining the regular season and playoffs. The regular season page is here: http://basketballvalue.com/topplayers.php?&year=2007-2008

Using those numbers, the formula is 1-pnorm(0, mean=(14.65-9.68), sd=sqrt(6.84^2 + 5.25^2)), which gives a result of 72%.

Comment by Eli — May 29, 2008

So KG wasn’t a good example then. Thanks (in large part) to Howard’s standard error, Amir Johnson ends up sitting at just 53% ahead of Howard.

I think this is a piece of the puzzle that we don’t really focus on all that much. Although Amir heads that list, there’s still a 31% chance that Joe Johnson’s rating is better. Assuming independence, when you take say the top 10 and calculate the odds that someone is #1, you being to realize how much uncertainty there really is.

Comment by Ryan J. Parker — May 29, 2008

Good points. I sorted through Aaron’s data and found that Amir’s standard error was the 12th largest among the 339 players that he calculated adjusted plus/minus for. Chris Paul had the 25th largest standard error. Ginobili had the 14th smallest.

Comment by Eli — May 29, 2008

I think the wild fluctuations on Al Jefferson’s value is tied to the fact that he is very, very, very bad at defense. As a Minnesota fan who watched the huge disparity between his effort on offense versus defense, I say he is definitely not 13th best player in the league. Maybe 13th best scorer, but not 13th best player. His defense is just not there. Therefore I can see how the different metrics rate offense by seeing how they rate Big Al; the higher rating he gets, the more the metric overrates scoring and underrates defense.

Comment by J. Prince Lawrence — May 30, 2008

the missing link in all of this is coaching. If you look at lineups, it matters if a coach has the right combination of players together. Put Chris Paul with the wrong 4 guys and he will look bad. Put him with the wrong guys, in the wrong situation, against a better matched unit, his numbers, no matter what the parameters will suffer relative to his ability.

if you want to start seeing some interesting information, take all the same information and formulas you have, and apply it to coaches. Look at what happens to the performance of players/teams and lineups before and after a coaching change.

Comment by mark cuban — May 30, 2008

Interesting stuff. Another way to average the ratings is to create z scores within each of the 7 systems, then average the z scores. Might be a little better than rank, as it captures magnitude of differences between players more precisely, and the average z score might reveal that the gap between player 4 and 5 is much larger than players 3 and 4 (or whatever). You could also then calculate an r for each metric against the average z scores.

Comment by Guy — May 30, 2008

Using z-scores, here’s the consensus top ten:

Pretty similar for the most part. Paul’s poor adjusted +/- gets balanced out by his extremely strong statistical +/-.

If we use z-scores for the standard deviation (consistency) chart, things change a lot. Here’s the new list of the players the metrics are in greatest agreement on:

I think I prefer using rankings for that, since they in a way admit that none of the metrics have that much precision, and flatten things out.

Comment by Eli — May 31, 2008

Eli: Do you think it’s meaningful to look for the correlation of each metric with the average z score?

Comment by Guy — May 31, 2008

I thought about that. Basically that would tell you which metric was most like all the others. I think that has some value, but it’s also pretty dependent on which metrics are included (i.e. the more boxscore-based metrics there are, the more each one will look representative). Here’s what I get:

So statistical plus/minus is first, because it has similarities to both the boxscore-based metrics and the plus/minus metrics. Win Shares is the least representative boxscore method, probably because of its strong tie-in to team wins. The pure plus/minus metrics are the most unique. Part of that could be that they are more prone to small sample size fluctuations.

Comment by Eli — May 31, 2008

Following up on Mark Cuban’s points, player stats are indeed impacted by lineup choices of coaches and their construction quality. The degree and location of impact will vary. You could attempt to estimate it or make guesses about it from careful study of lineup data. And adjust lineups to increase yield from players. Year to year comparison of lineup performance with a coaching change would be useful for evaluating coaching impact and player fit.

Coaches also decide what plays to call on offense when in a particular lineup or against a particular type of opposing lineup and the defense too. If the team tracked what coaching play calls were made (and I assume good teams do) you could evaluate not only the average overall efficiency of the lineup but the efficiency of the coaching calls within lineups and against various type matchups. And of course player calls (especially PG calls) within coaching play calls also matter and could be examined and tweaked. It is a many layered cake.

Player stats are aren’t simply player stats or player vs opponent. It is about players in lineups. And Mark’s post correctly notes that even adding consideration of lineup is not enough. They are also influenced by coaching and player calls.

I will mention my last post in this thread: http://sonicscentral.com/apbrmetrics/viewtopic.php?t=1780

5 man lineups play the game and win or lose it.

They are composed of individuals. They are also directed by coaches. The goal is maximum net efficiency vs opponent.

Should the metric applied during a season be weighted according to how the league performs or how that team performs? In the off-season and long-term I can see the former. But within a season I wonder if team style of winning is real and should influence the weights. I guess I’d look at it both ways.

Comment by Mountain — May 31, 2008

Lately I’ve been focused on matching up player level performance measures with lineup performance and that remains of interest and could offer significant gains but these are roll-ups and the game is really at play level.

I can envision an evaluation system whereby you could with the right support track every level of choice / action / result down to play level. The better teams probably are more accurately self-aware and adjusting of not only players and lineups but also coaching. Manage at the margin not just optimize averages.

Comment by Mountain — May 31, 2008

The Carlisle / Kidd era will play out and I wonder if guidance from the numbers will be more strongly / effectively applied compared to this season, not knowing how much it was used under Johnson.

I have wondered if Dallas (and a few others) already have adjusted 4 factor information by player and if player-lineup performance crosswalk guidance is tighter than what has been assembled yet in public. Probably. I’d like to see the the tools for fan / analysts to catch up. Outgrowths from this thread and various apbr threads could take further steps in that direction.

Getting down to coaching play call / results probably not possible from the outside but public knowledge of coaching management of players and lineups can become deeper.

Though as I’ve noted previously if you look at the hundreds of lineups coaches use, most for small minutes and small minute lineups for 70% of all minutes league wide, it is hard to get much sense of design or guidance from it. It looks so scattered, so situational, mostly unsupported by decent sample size previous experience to provide meaningful guidance.

Dallas last season did not use a single lineup for over 200 minutes. Impacted by the Kidd trade, but still…

It looks to me that more study and use of that data, though certainly always hampered by small sample sizes especially when you consider opponent or game situation, could improve lineup and overall results. If that information was being worked hard I’d think that the lineup distributions should be quite different than what we typically see. Fewer lineups used more often, duds decreased or eliminated more frequently.

Looking at the Dallas information duds among the most used seem fairly low. Maybe Dallas has a more flexible team allowing a greater spread of lineups.

In playoffs 4 of the top 7 lineups used had big negatives- against the match-up. What guidance did regular season give on those 4 negative lineups? Two were positive in regular season … but in pretty small minutes compared to what you could have tested to gain more confidence in them. The other 2 were negative. Again in inconclusive small samples. More test of key lineups you end up using in playoffs would seem wise to me.

Comment by Mountain — May 31, 2008

The 2 lineups used in playoffs despite negative regular season results were non-Kidd (subject to much discussion) and one involved Bass (whose playing time reportedly was also a subject of debate). Only represent about 10% of playoff minutes so maybe that seems too small to make a big deal about - but these lineups produced 25% of the total negative actual point differential experienced in the series.

Comment by Mountain — May 31, 2008

I’ve moved form the general topic to a specific case study but want to make a few more comments to finish up because I think it flows some the general discussion and suggests an important role for data analysis of lineups in game strategy planning and review.

Trolling small sample lineups can yield returns.

Dallas used Bass-Howard-Kidd-Nowitski-Terry just 38 minutes regular season but made it 2nd most used lineup in playoffs, perhaps due to the huge positive +/- it produced in regular season. It was a big playoff winner too. So that was a nice success. You could ask though should they have used it even more in regular season to be more confident and then use it more than 16% of the time in the playoffs?

On the other side of the ledger there was an Allen lineup that had good fortune in short minutes in regular season. Several Allen lineups were used in playoffs but not that one (with Dampier). Was the testing or the review of Allen lineups adequate? Time for it was short and maybe it was a small sample fluke. You catch some once, some again / some not. But you want to catch as many again as you can. Short tests leave it at guessing. Longer tests are still guesses but perhaps a little better informed.

The top 7 Dallas lineups in playoffs got used for about 2/3rds of those minutes but only got “tested” by under 20% of the regular season’s minutes. If 5 man lineup are unique creatures knowing the players and their average performance is not really enough. Playing time slips out quick and you are really to win regular game games first and foremost but I still think disciplined testing could be advanced further for benefit.

Comment by Mountain — May 31, 2008

Heavy use of lineup analysis might assume and need to assume that lineup performance has an underlying true value, a value determined by how players behave in that specific lineup which might be different from the sum of their overall average performances. Else why bother? Adjusted lineup analysis finds an estimate of that value as best it can, though there is of course error, making it hard to read and use as a guide.

The boxscore based metrics have correlations under .5 with adjusted +/- for players. I would guess it is even lower when the sum of individual scores are compared to adjusted +/- for lineups. The best performance is from net counterpart +/- which benefits from having more of defense. Net statistical +/- might well outperform regular counterpart +/- and explain a large share of at least player adjusted +/-.

The size of the gap between the sum on net +/- (standard or statistical) for players and adjusted lineup +/= is the measure of the effects of lineup and randomness with this data. Comprehensive crosswalk of player and lineup adjusted scores would help address the age old question of how much player synergy matters. Adjusted 4 factors detail would help account for gains and slippages by factor and player.

I know this is rough in places and jumps around but I hope some of it may be helpful.

Comment by Mountain — June 1, 2008

Should I use this to select my fantasy basketball team?

Comment by Tim — June 1, 2008

It’d probably be more useful for fantasy football. The Patriots could have used Dwight Howard’s vertical leap to knock down that pass to David Tyree.

Comment by Eli — June 1, 2008

Tim, the weights of fantasy leagues are often quite different. Stick to the weights they list for that game.

Comment by Mountain — June 1, 2008

Thanks for all the comments, Mountain. I’ll have to let them digest some.

Comment by Eli — June 1, 2008

I am guessing I missed the intended humor of Tim’s remark here that I picked up from your exchange in the the other thread.

Thats fine Eli. Much of it was tangential. Thanks for tolerating it.

Good luck with the offensive / defensive splitting of adjusted +/-.

I look forward to seeing your statistical +/- and its correlation with pure adjusted.

Maybe I’ll ask again if you think doing it by team would have any value in addition to the league wide approach?

Comment by Mountain — June 2, 2008

No sweat, Mountain. Tim’s an old friend.

I’m always wary of reducing sample sizes, but I’m with you on the importance of context, whether it be teammates, opponents, or the coach and system a player plays under.

Comment by Eli — June 2, 2008

I was just wondering if a team specific set of weights for statistical +/- might get the correlation with pure adjusted up past .7. If it could that would be attractive. But if you can get it there for your league=wide weights, great.

It just seems rough to say an defensive rebound or 3 pointer is worth x or y because that is the best fit value for the league when they will have somewhat different values on specific teams. An important point of statistical approach is get to actual game value over generic valuation.

Comment by Mountain — June 2, 2008

Or could there be a middle ground? Statistical +/- weights 50=75% from league-wide and 25-50% from specific team?

Recalling a recent academic study that used regression and found different weights (values) for box-score stats by position I’d also wonder if it would be a worthwhile experiment to apply a mix of league-wide and position weights, but I know that would be controversial.

Comment by Mountain — June 2, 2008

i think it important to calculate if a player hold the person he is guarding below his season field goal average. If he is able to do so then he is an elite defender. He calculate data for all his games and compare how many times in his matchups he hold the player he is guarding below his overall field goal average then you have a good measure. an important aspect to calculate is overall field goal average which means including free throws. just because a player shoot 2/10 from field but hits 10/10 ft’s he overall field goal average should be 12/20. and 2/10. 60percent overall field goal average is alot better than 20 percent and this figure should be calcuated for the entire year. obviously this takes alot of effort and you stat geeks want the easy way out but this is the only way to do it.

Comment by wdynasty — January 13, 2009

secoundly…did the player hold the man he was guarding below his season’s point per game average. This is separate category and one that is not as important as field goal average but weight should be given. If a player is held below his season points per game average..then the offensive team has to rely on another player to create his offensive production. points should be awarded to a defensive player that causes a disruption in the opposing teams average offensive production per player.

Comment by wdynasty — January 13, 2009

finally offensive rebounds should be counted as serious defensive stat since…in theory..their offensive rebound amounts to a full stop on the defensive end and a new shot sttempt for the offensive. this should be calculated as a missed shot attempt on the person that they guard on defense! Obviously these challenges are for the best of the best in statistics but major props should be given for the person who creates a system incorporating these fundamental elements of defense in a player defensive rating. I believe offensive rebound, a steal, a forced charge…all should be calculated as missed field goal attempts on the person they are guarding. this is the only true way to measure the true impact of the defensive ACTION ON THE COURT AND ITS EFFECT ON THE GAME. lETS GO STAT GEEKS.

Comment by wdynasty — January 13, 2009

Hi i am kavin, its my first occasion to commenting anywhere, when i read

this paragraph i thought i could also create comment due to

this good paragraph.

Comment by Omar — March 5, 2013

Hey! This is my 1st comment here so I just wanted to give a quick shout

out and say I truly enjoy reading your posts. Can you recommend any

other blogs/websites/forums that deal with the same subjects?

Thank you so much!

Comment by http://www.william-mathew.co.uk — March 8, 2013

Asking questions are truly nice thing if you are not

understanding anything entirely, except this post offers pleasant understanding yet.

Comment by vessie — March 16, 2013

Nayaz 26 was arrested by SOG with mobile sim card obtained

using fake identity. Police nabbed the accused for obtaining a mobile sim card.

Abdul Mujeeb 22 of DJ Halli had purchased two

cards from a well-known company and made phone calls worth over

Rs 50, 000 which they stole from their employer firm on JC Road.

At present, mobile phone handsets in Japan are developed exclusively for a particular service provider.

Kalasipalya police arrested four employees of a hardware store and recovered PVC water tanks

worth Rs 50, 000 which they stole from their employer firm on JC Road.

Comment by muasamnao.com — May 20, 2013

Youre so cool! I dont suppose Ive learn anything like this before. So good to find any person with some unique ideas on this subject. realy thanks for beginning this up. this web site is one thing that is wanted on the web, somebody with a little bit originality. useful job for bringing one thing new to the web!

Comment by when to start potty training — November 30, 2013