Calculating Adjusted Plus/Minus
Adjusted plus/minus is a way of rating players first developed by Wayne Winston and Jeff Sagarin in the form of their WINVAL system (more here). The basic idea is simple. For each player, it starts with the team’s average point differential for each possession when they are on the court (sometimes referred to as the player’s on-court plus/minus). This gives a number showing how effective the player’s team was when they were in the game. The problem with using this to evaluate individual players is that it is biased in favor of players who play alongside great teammates (and players who play against weak opponents). This can be seen by looking at the 2007-08 leaders in on-court plus/minus, which can be seen here (the Overall Rtg, On column) or here (the On Court/Off Court, On column). Kendrick Perkins rode his teammates’ coattails to the second highest mark in the league. Adjusted plus/minus uses regression analysis to control for these biases by controlling for the quality of the teammates a player played with and the opponents he played against.
That sounds easy enough, but it’s actually kind of complicated, and the specifics of WINVAL were never made public (Mark Cuban reportedly was paying a handsome sum to use the system for the Mavs). Thankfully, in 2004 Dan Rosenbaum spelled out the details of the methodology in an article. He called his version adjusted plus/minus, and released a series of analyses using the metric (here and here). Eventually Dan was hired to consult for the Cleveland Cavaliers, but because he had spelled out the methodology others were able to duplicate his work for future seasons. David Lewin published rankings for the 2004-05 and 2005-06 seasons, and Steve Ilardi and Aaron Barzilai have done the same for the 2006-07 and 2007-08 seasons (up-to-date ratings can be found here).
I always wanted to try to calculate adjusted plus/minus on my own, but I was intimidated. I figured that I didn’t know enough about running regressions and that I didn’t have the data, software, or computing power to run such a large analysis. But I finally sat down and tried to do it a few days ago, and I discovered that it’s not that difficult. Using Dan Rosenbaum’s description of his method, publicly available data from BasketballValue, Excel 2007, and the free statistics program R, I was able to set up and run the whole thing in less than an hour. Here’s how I did it.
Getting the software
I used Excel 2007 to set up the data. Unfortunately, I don’t think an earlier version of Excel (such as Excel 2003) will work because previous versions were limited to 256 columns, and we’ll be working with a big table with over 300 columns (basically one for each player in the league). I haven’t really looked into ways of getting around this issue if you don’t have Excel 2007. It’s probably possible to do the data setup in a database program, in R itself, or through some scripting. If anyone has an ideas on this please let me know.
I used R to run the regression. I used R because it’s free, but I’m sure other programs like SPSS, SAS, Stata or Minitab would work as well. I’m not going to get into the details of installing R, but I will point to the R for Windows FAQ and the download page for the latest version.
Getting the data
Thanks to Aaron Barzilai’s fantastic site BasketballValue, the raw data needed to calculate adjusted plus/minus is readily available in an easy to use format. The key is having play-by-play data grouped into stints by the lineups on the court. The file we want is labeled “List of each matchup of one unit against another” on this page. Download the one from the 2007-08 regular season (matchups20080417.zip). While you’re there also download the “Statistics of all players across all teams played for” file (playerstats20080417.zip). Unzip both of these files.
Setting up the data in Excel
Open a new Excel spreadsheet. Go to the “Data” tab and select “From Text” in the “Get External Data” section. Navigate to wherever you saved the unzipped matchups200804170110.txt file, select it, and click “Import”. Make sure “Delimited” is selected and click “Next”. Make sure “Tab” is selected and click “Next”. Click “Finish” and then click “OK”. This will import 35459 rows with data in columns A through AU.
In cell AV1 type “HomePlayers” and in cell AW1 type “AwayPlayers”. In AV2 type this formula:
="P"&F2&",P"&G2&",P"&H2&",P"&I2&",P"&J2&","
Double-click on the bottom right corner of the cell to autofill this formula into all of the cells in the column. In AW2 type this formula and then autofill it down:
="P"&K2&",P"&L2&",P"&M2&",P"&N2&",P"&O2&","
These will be used to quickly check whether a given player was on the court for a particular stint.
In AX1 type “MARGIN”. In AX2 type this formula and then autofill it down:
=100*(IF(AH2=0,125002/114854,AD2/AH2)-IF(AI2=0,120811/114896,AE2/AI2))
This is basically PointsScoredHome/PossessionsHome minus PointsScoredAway/PossessionsAway, which gives us the per-possession plus/minus of the lineup that was on the court for the home team. The complications come from stints in which one of the teams had zero possessions. In those cases we substitute the league average home (or away) team points per possession, which is what 125002/114854 and 120811/114896 represent (I got those four numbers by simply summing the PointsScoredHome, PossessionsHome, PointsScoredAway, and PossessionsAway columns). I’m not completely sure this is what should be done when one lineup has zero possessions, but it was the best sense I could make of Rosenbaum’s footnote.
At this point you may notice that this new MARGIN column has similar values to the OverallRtgHomevsAway column (AL), which is equal to OffensiveRtgHome (AJ) minus OffensiveRtgAway (AK). The reasons I didn’t just use the OverallRtgHomevsAway column are because it doesn’t have the adjustment for zero possessions detailed above, and because it contains some screwy numbers that I can’t explain. This can be seen in the very first row, where PointsScoredAway is 5 and PossessionsAway is 14, yet OffensiveRtgAway is 33.3333 (rather than 100*(5/14) = 35.7143). I don’t know if this is just an error in Aaron’s data or what, but as a result I decided to calculate things on my own.
Now that we have the dependent variable (MARGIN) for each observation, we have to set up the independent variables. Basically, every player in the league is an independent variable, other than those players who are excluded due to playing low minutes (these players become the reference players relative to which all other players’ adjusted plus/minus is calculated). So we first have to decide what our minutes cutoff will be. In his initial work Rosenbaum used 250 minutes over two seasons as his cutoff. Others have used 500 minutes in one season. For the purposes of this example I will use the cutoff used by Aaron Barzilai and Steve Ilardi of only those players who were in the top 75% in the league in minutes played. For 2007-08 this works out to a cutoff of 388 minutes, with 339 of the 467 players in the league qualifying and the other 128 serving as the reference players (I know this doesn’t come out to exactly 75%, but it’s the players that BasketballValue uses so it’s what I will use).
Here’s where that other file we downloaded comes in handy. Go to Sheet2 of the spreadsheet and then go to the “Data” tab and select “From Text” in the “Get External Data” section. Import the playerstats200804170110.txt file following the same steps as were described above to import the matchups200804170110.txt file. Once this is imported, if you scroll all the way over to the AdjustedPM column (AL), you’ll see that non-qualifying players are listed as “NULL”. Click on AL1 and sort the column A to Z. Scroll back over and insert a column immediately to the left of column B (right-click on “B” and select “Insert”). In this new column, type “PID” in B1. In B2 type the formula =”P”&A2 and then autofill this down. Scroll back over to column AM and scroll down until you find the last row with a value (before all the NULLs). This should be row 340. Scroll back over to column B. Click in B340, hold, and drag all the way up to B2. Hold down the Control key and hit C to copy the selected cells. Go back to Sheet1 of the spreadsheet and click in cell AY1. Go to the “Home” tab and click on the arrow below “Paste” in the “Clipboard” section. Select “Paste Special” and then select “Values” in the top section and “Transpose” in the bottom right before clicking “OK”. This will create the column headers for each independent variable, the 339 qualifying players (it doesn’t really matter what order they’re in, since we can identify them by their BasketballValue PlayerID number). At this point you should definitely save the spreadsheet, if you haven’t been doing so already.
While we’re working with the playerstats data, we’ll set something else up that will come in handy later when running the regression in R. Go back to Sheet2 and insert a column immediately to the left of column C (right-click on “C” and select “Insert”). In C1 type “fit.adj0708 <- lm(adj0708$MARGIN ~" (not including the quotation marks). In C2 type the formula ="adj0708$"&B2&" +" (including all the quotation marks and the space in the middle), and then autofill it down. Select column C, hit Ctrl-C to copy and then go to the "Home" tab, click the arrow below "Paste", and select "Paste Values". Click in cell C1 and drag down all the way down to cell C340 (but no further). Hit Ctrl-C to copy, then go to Sheet3 of the spreadsheet, go to the "Home" tab, click the arrow below "Paste", and select "Transpose". Immediately hit Ctrl-C to copy, then open Notepad (Start -> Run -> notepad) and paste (Ctrl-V). Scroll up to the top, and select the blank space between the tilde (~) and the “a” in adj0708$P235. Hit Ctrl-C to copy. Go to the “Edit” menu, select “Replace”, and hit Ctrl-V to paste into the “Find what” box (it will look like some blank space got pasted in). In the “Replace with” box type a space, and click the “Replace All” button. Then click “Cancel”. Click at the very bottom of the file which should put the cursor at the start of a blank line. Hit “Backspace” three times (which should put the cursor right after adj0708$P732), then type a comma, then a space, and then “weights=adj0708$Poss)” (not including the quotation marks). Save this text file, as we will need it later (if for some reason you can’t get this to work, you can download a completed version of the text file here).
This next step is the most computer processor-intensive of the whole process. For each stint we have to code each player as either being on the court for the home team in that stint (1), on the court for the away team (-1), or not on the court (0). This can be automated by a simple Excel formula, but then that formula has to run on over 12 million cells (339 x 35458). So before you do this, make sure you’ve saved the spreadsheet and that no other programs are running. This is probably a reason why the data setup should be done in a database program like MySQL rather than Excel, but oh well.
In AY2 type the following formula:
=IF(ISNUMBER(FIND(AY$1&",",$AV2)),1,IF(ISNUMBER(FIND(AY$1&",",$AW2)),-1,0))
Click on the bottom right corner of AY2, hold, and drag all the way over to the right until you reach the last column (NY). At this point you could double-click on the bottom right corner of NY2 to autofill down for all the columns, but I wouldn’t recommend it unless you have a very fast computer. Instead, I’d recommend splitting the columns up into chunks depending on your processor speed. Scroll back over to column AY, click in AY2 and drag over to the right about 10 cells. Then double-click on the bottom right corner of the right-most cell you selected and wait for it to autofill. This will give you a sense of how long the process will take. While Excel is working you might see a green progress bar in the bottom right of the screen, or the screen could appear to freeze with the Excel title bar saying that the program is “Not Responding”. Just ignore this and wait until it completes (if it never does, you can try to abort by pressing the “Break” key on the keyboard (usually it’s near the top right) or by using Ctrl-Alt-Delete to kill Excel). Once the autofill is done, before clicking anywhere else (while all the cells are still selected), hold down the “Control” key and hit “C.” Then go to the “Home” tab and click on the arrow below “Paste” in the “Clipboard” section. Select “Paste Values”. This also may take a little time to complete. If you did all that and your computer didn’t freeze, you can try a larger chunk. For me, 50-column wide chunks worked well. I selected the next 50 cells in row 2, autofilled them down (which took about 30 seconds), copied and pasted values (another 10 seconds), saved the whole file, and then went on to the next 50 columns. If 50 columns seems like pushing it, you could do 25 at a time. Overall there are 339 columns that have to be autofilled. Once you finish and save the file the filesize will be almost 50 megabytes.
If your computer doesn’t agree with all that, you could set up the data in a database program, or just download this file I created that has all the work done for you (including the step described in the next paragraph).
Now we have all the values for the independent variables. All we have to do now is clean up the spreadsheet and save the data we need to a CSV file. First select the MARGIN column (AX), hit Ctrl-C to copy, and go to Paste Values to paste (as described above). Next, select column AY, right-click on the AY heading, and select “Insert” to insert a column. In AY1 type “Poss”. In AY2 type the formula =AH2+AI2, and then autofill down. Select this newly created column, hit Ctrl-C to copy, and go to Paste Values to paste. This created a column of cells telling us how much each stint should be weighted (which is by the total number of possessions).
Save the spreadsheet. Then select all the columns from A to AW, right-click in one of the column headings (e.g. A) and select “Delete”. Do NOT click save after this. Instead, go to “Save As” -> “Other Formats” and in the “Save as type” dropdown select “CSV (Comma delimited)”. Name the file “adj0708″ and click “Save”. Excel will warn you that you can’t save multiple sheets, click “OK”. It will then warn you that there may be incompatible features, click “Yes”. You can now exit Excel (it will ask you whether you want to save changes to the CSV and you should say “No”). The CSV you created should be around 25 megabytes. Again, here’s a download link to the CSV that I created by following these steps - you can use it as a point of comparison if you want.
Running the regression in R
Now we have to import the CSV we created into R. To do this you’ll need the full path to wherever you saved the CSV file. The only trick is that when entering the path in R you should replace all the back-slashes (\) with forward-slashes (/). Enter the following command into R (replacing the path as necessary) to load the data into a table called “adj0708″ (it may take a minute to read the file):
adj0708 <- read.csv(file="c:/directoryname/adj0708.csv")
Next we have to enter the details of the regression we want to run. The basic format for using the lm() function is this:
RegressionName <- lm(TableName$DependentVariableColumn ~ TableName$IndependentVariable1Column + TableName$IndependentVariable2Column + ... , weights = TableName$WeightsColumn)
Here’s where that text file we created earlier comes in handy. It contains the exact command to enter into R to run the regression. Open the file, hit Ctrl-A to select all and Ctrl-C to copy. Go to the R command prompt and hit Ctrl-V to paste. You will see it paste everything in character by character. Once it has finished, hit enter, and the regression will run. This should take under a minute.
Once that’s complete, enter the following two commands to output the results of the regression to a file called results.txt in the R files directory:
sink("files/results.txt")
summary(fit.adj0708)
Viewing the results
Open the results.txt file which should be in your R files directory. First you’ll see the formula for the regression. Next is the coefficients section. Select this whole section (starting at the beginning of the line below “Coefficients:” and ending just before the “—” line) and copy it. Open a new Excel spreadsheet and paste in the selection. Go to the “Data” tab and select “Text to Columns” in the “Data Tools” section. Make sure “Fixed width” is selected and click “Next”. In the Data preview section insert a line by clicking just after the right parenthesis in Pr(>|t|). Click “Next”, then click “Finish”. In A1 type “PlayerCode” and in F1 type “Signif”. Go to Sheet2 of the spreadsheet then go to the “Data” tab and select “From Text” from the “Get External Data” tab. Navigate to the playerstats200804170110.txt file that we downloaded and unzipped earlier and click “Import”. Make sure “Delimited” is selected and click “Next”. Make sure “Tab” is selected and click “Next”. Then click “Finish” and then “OK”. Go back to Sheet1 of the spreadsheet. Right-click on the “B” column heading and select “Insert”. In B1 type “Player”. In B2 type the following formula and then autofill it down:
=VLOOKUP(VALUE(RIGHT(A2,LEN(A2)-9)),Sheet2!A:AM,2,FALSE)
Click on cell C1 and then sort Z to A. This will sort all players by their adjusted plus/minus. The magnitudes will be greater than those on BasketballValue because they center their results around the league average. To do this first right-click on the “C” column heading and select “Insert”. Type “Minutes” in C1. In C2 type this formula and then autofill down:
=IFERROR(VLOOKUP(VALUE(RIGHT(A2,LEN(A2)-9)),Sheet2!A:AM,4,FALSE),"")
Next right-click on the “E” column heading and select “Insert”. Type “Centered” in E1. In E2 type the formula =D2-SUMPRODUCT(D$2:D$341,C$2:C$341)/SUM(C$2:C$341) and then autofill it down (sumproduct(adj,min)/sum(min) is the minutes-weighted league average). This gives the final results, which should be close to the listings on BasketballValue (which are also listed in column AL on Sheet2). I’m not sure why the numbers aren’t identical - it may have something to do with the weird values in OverallRtgHomevsAway that I discussed earlier. Regardless, here’s the top ten that I got (which hopefully matches your results if you followed the steps outlined):
Player Adj +/- StErr --------------- ------- ----- Amir Johnson 13.57 6.53 Dwight Howard 12.71 10.12 Ronnie Price 12.18 6.73 Antawn Jamison 11.89 5.00 Thaddeus Young 11.86 3.96 Peja Stojakovic 11.85 5.52 Chris Bosh 11.64 4.13 Kobe Bryant 11.60 5.16 Manu Ginobili 11.56 3.63 Jamal Crawford 10.08 4.71
What’s next
I need to figure out how to separate this into offensive and defensive adjusted plus/minus. I’m not sure how to do this. Running a different regression using ORtg (or DRtg) as the dependent variable instead of MARGIN would seem to make sense, but the problem is that the independent variables would have to be a mix of player adjusted offensive +/-’s and opponent adjusted defensive +/-’s (but of course these would have to change for different observations). Using MARGIN as the dependent variable but doubling the number of independent variables (with each player having one entry for their offense and one for their defense) doesn’t seem to work either. Dan Rosenbaum’s original article suggested that he combined adjusted plus/minus with efficiency ratings to get separate offensive and defensive ratings, but I’m not sure exactly what he meant by that. Anyway, if anyone has any ideas, let me know. Once I figure out how to split things up I should also be able to calculate adjusted rebound rates (really adjusted on-court team offensive and defensive rebounding percentages). BasketballValue has the data needed to calculate those but again I’m not sure of the methodology.
I also hope to regress boxscore stats (and some other advanced stats) onto these adjusted plus/minus results to calculate my own weights for statistical plus/minus.
Nice article. You know what would be nice? Some cool graphics and pictures of players getting slam dunks and stuff, and maybe for controversial toplics have a picture of players arguing with the refs.
Comment by Tim — June 1, 2008
Ha! Thanks for the suggestion Tim.
Comment by Eli — June 1, 2008
I never realized you could fit adjusted +/- with lm(). I thought some weird variant of glm() would be needed. This will be a handy bookmark to have.
Comment by Ryan J. Parker — June 2, 2008
I think the problem you’re having with separating the defensive and offensive values is that in addition to doubling the number of independent variables, you should also think of each data line as two observations: rather than using MARGIN you should write two linear relationships: one using the home team offensive ratings and the away team defensive ratings to approximate the home team efficienty during that “mini match-up”, the other replacing “home” and “away”. Given the noisiness of the usual calculation you will probably need to increase the number of reference players.
Comment by Lior — June 2, 2008
Ah, so maybe two separate regressions, one with ORtg as the dependent variable and one with DRtg, each with twice as many independent variables? That sounds like it might work. In the Home ORtg one, the five home offensive players would get a 1, the five away defensive players would get a -1, and the five home defensive players and the five away offensive players would all get a 0. I’ll try that out later. (Or maybe you’re saying there should be one regression but with the number of observations doubled?)
Comment by Eli — June 2, 2008
Eli, what I’ve done is what Lior suggested, a single regression, twiche the independent variables, twice the observations. And you need to think about what the neutral ORtg is, and how to insert home advantage into the model (a dummy variable should suffice for this).
Regarding using a script to create the observations, that’s what I do, and it’s way faster than Excel (what takes the most time in my computer is the actual regression, I blame it to memory issues). At
http://sonicscentral.com/apbrmetrics/viewtopic.php?p=10823
you can see a Perl script I used for 2006 data. With little changes, it can handle the current basketballvalue format. I may post the updated script later.
Comment by cherokee_acb — June 2, 2008
Thanks for the transparency of your work here, it is really refreshing. Sagarin & Winston were charging something astronomical just to get the details of how they calculated their ratings.
Comment by JTapp — June 2, 2008
Great work, Eli… I’m bookmarking this page for the day when I finally take the plunge and try it for myself. :)
Comment by Neil — June 2, 2008
Eli,
When adjusting the ratings so league average is zero you want to use the minute weighted average, so that the average minute played has an adjusted plus-minus of zero, not the average player.
Comment by Dave Lewin — June 2, 2008
Thanks Cherokee, I’ll try that.
That makes sense, Dave. I’ll update the post with that alteration. Though any centering is going to be somewhat arbitrary regardless.
Comment by Eli — June 2, 2008
Another tip worth mentioning is that if when creating the CSV you use all players rather than just the qualifying (non-reference) ones, then you can later run regressions with different minutes cutoffs just by changing the regression formula you use (which is a lot quicker than having to rebuild the CSV over again).
Comment by Eli — June 2, 2008
I’ve just found that I don’t have an up-to-date script valid for the 2008 season format. If I decide to write it, I’ll share. I’ll offer another tip instead: in R, use “as.formula” and “paste” (see an example in the APBRmetrics thread linked above) to dynamically construct the regression formula from the data.
Comment by cherokee_acb — June 3, 2008