Batter's Box Interactive Magazine Batter's Box Interactive Magazine Batter's Box Interactive Magazine
Clay Davenport discusses Regression Towards the Mean, as it applies to a team's true talent level. My heart was sent aflutter. There is no more important topic on analyzing baseball performance numbers than sample size and regression towards the mean.

It's a great stab by Clay, but he goes slightly wrong. Here's an "open letter" to Clay that I hope will foster additional discussions among readers.


There are two issues with regards to regression towards the mean:

1 - Regressing the sample performance to reflect the likely true rate
2 - Determining the error range around that true rate

The common factor among both of these is the sample size. The more games you have, the less you regress, and the narrower your error range.

I put out a study on clutch hitting that made use of this issue:

On a thread at Primate Studies last year, we went into details as to exactly how to capture the above two issues. I would have to say that a true rate, after 140 games, where the error range is 1 SD = .100 is flat out wrong. Just taking a guess, and I'd say it should be under .050, and more likely something like .030. When I get a chance, I'll work out the exact numbers.

Thanks, Tom
Regression Towards the Mean | 30 comments | Create New Account
The following comments are owned by whomever posted them. This site is not responsible for what they say.
_tangotiger - Thursday, September 23 2004 @ 12:29 PM EDT (#31216) #
While I praised Andy Dolphin in the above article, I want to re-praise him for those who are not going to hit that link. I'm only his disciple in this matter.
Mike Green - Thursday, September 23 2004 @ 01:20 PM EDT (#31217) #
Funny that you should link to the clutch-hitting article today. Yesterday, we were tossing around the topic in the roundup thread.

Thanks, Tom.
_tangotiger - Thursday, September 23 2004 @ 01:25 PM EDT (#31218) #
Let's start with a subset of Clay's table, with a column added:

Pythag Pythag
G R2 1-r

10 0.155 0.61
20 0.227 0.52
30 0.278 0.47
40 0.333 0.42
50 0.369 0.39
60 0.402 0.37
70 0.405 0.36
80 0.403 0.37
90 0.410 0.36
100 0.408 0.36
110 0.397 0.37
120 0.366 0.40
130 0.320 0.43
140 0.237 0.51

"1-r" is derived from r-squared. If r-squared = .155, then r = .39, and 1-r = .61.

What 1-r gives you is the amount to regress towards the mean. As you can see, up until the 70 game mark, as the number of games goes up, the amount to regress goes down (i.e., less luck involved).

Something interesting happens for the next 40 games, as teams get into a holding pattern, in terms of regression. This is probably indicative of teams changing personnel and focus at that point in the season. That is, the team from game 70 onwards might be somewhat different from the team at game 110 onwards.

After that, things get loopy. The amount of regression actually increases as the number of games increase. At this point, there must be a bunch of teams that simply change their personnel, especially for September. Remember, at game 140, we are trying to predict the next 22 games (i.e., September games).

What would be interesting to see if the "remaining record", up until Aug 31. If my thought is correct, then we should see a typical pattern as the number of games increase, the regression amount decreases.

Anyway, that said, we have now established the amount to regress. Now, we just need a simple function of regression to games played. And, here's an easy one: x / ( x + y), where x = 5 and y = sqrt(G).

I will add a column to the above table that shows the PredictedRegression, using this function.

Pythag Pythag PredictedRegression
G R2 1-r

10 0.155 0.61 0.61
20 0.227 0.52 0.53
30 0.278 0.47 0.48
40 0.333 0.42 0.44
50 0.369 0.39 0.41
60 0.402 0.37 0.39
70 0.405 0.36 0.37
80 0.403 0.37 0.36
90 0.410 0.36 0.35
100 0.408 0.36 0.33
110 0.397 0.37 0.32
120 0.366 0.40 0.31
130 0.320 0.43 0.30
140 0.237 0.51 0.30

As you can see, an excellent fit (up until game 100 or so).

More to come...
_tangotiger - Thursday, September 23 2004 @ 01:29 PM EDT (#31219) #
One other sidebar, before I continue: it would have been good if a regression of BOTH actual and pythag was also ran. After all, there's no reason to only use one or the other.
_tangotiger - Thursday, September 23 2004 @ 01:37 PM EDT (#31220) #
I just read a good comment at BTF, which gives support to what I said in my long post. When you look at the amount to regress, comparing the actual to pythag records, we see that we need to regress less with the actual record at game 140 onwards.

Why would that be? Because if you are a really good team, your roster is pretty well set. If you are a bad team, you are much more likely to use September callups. What it shows, probably, is that the actual record is telling you that: (a) not that it's a better predictor than pythag, because pythag is less reliable, but (b) that the actual record contains an extra piece of information that the pythag record doesn't contain. And that extra information is: should I keep using the same team going forward?

At this point, we are really comparing apples and oranges. Being a .480 actual team or .510 pythag team is not that big a deal. But, being those records at game 140 is a big deal, because it determines your fate for the next month.
_tangotiger - Thursday, September 23 2004 @ 01:56 PM EDT (#31221) #
Ok, now to part 2 of my intro: how much is the error sample of the true talent level. Remember, we've already established the likely true talent level, but now we are trying to figure out the error range of that likelihood.

If you look at all teams from 2000-2003, you will find that the standard deviation of the win% is .080. That is, two-thirds of teams have a win% between .420 and .580.

However, remember that even after 162 games, these actual records are only samples of their true underlying (and unknow-able) talents. Doing an initial regression of 30%, and a little back and forth, and the spread of the talent level of teams is around .060. That is, the underlying talent of the teams shows that two-thirds of the teams have a true talent level of .440 to .560.

(Note: this assumes a sort of random distribution of players. This is somewhat hard to accept, but it'll probably still keep us close).

Therefore, if we know that the league, as a whole, is distributed such that 1 SD = .060, it would be impossible that our estimate of error of the true talent of a team in that distribution would be .100, as Clay mentions.

At this point, I'll have to try to remember what Andy taught me... it's been a while since I had to handle a similar situation.
Craig B - Thursday, September 23 2004 @ 02:40 PM EDT (#31222) #
At this point, there must be a bunch of teams that simply change their personnel, especially for September.

I think this is also the effect of the large numbers of changes which now occur at the August 31 trade deadline. In fact, I wouldn't be surprised if more of the change was due to the trade deadline and injured players being shut down for the year, than to September callups and the like.

Thanks, Tango. This is excellent work.
Craig B - Thursday, September 23 2004 @ 02:41 PM EDT (#31223) #
By the way, the guy who didn't like us talking about the all-born-in-September team is really going to blow his stack when he sees us talk about regressing "true talent" records towards the mean in September.
_tangotiger - Thursday, September 23 2004 @ 02:51 PM EDT (#31224) #
It is a rather shocking situation.

After the first 25 games of the season, you can predict the rest of the season by regressing the record by 50%. If amount to regress = .50 = 1-r, then r=.5, and r-squared = 25%. So, only 25% of the variation from games 26 to 162 can be explained by the performance of the team from games 1 to 25.

Incredibly, after 140 games, you can predict the rest of the season by regressing the exact same amount. That is, you know as much as to what the record will be from 140 onwards as you know about game 25 onwards.


I think it's important that Clay's numbers be broken down by: record to July 31 (as Craig is noting) and record to Aug 31. It's only by these kinds of breakdowns can you determine how you can estimate the rest of the season.

As well, you'd like to have "out of the money" games from making playoffs to go along with the pythag record, when doing the regression.


Thanks Craig!
Craig B - Thursday, September 23 2004 @ 02:56 PM EDT (#31225) #
Sorry, yes, July 31 trade deadline. Don't know where my proofreading skills have gone to.
_tangotiger - Thursday, September 23 2004 @ 04:58 PM EDT (#31226) #
Hmmm... something is bothering me. Let's look at the 10 game set, and remember a couple of equations:

variance(observed) = variance(true) + variance(luck)

regression towards the mean = variance (luck) / variance (observed)

variance(i) = standard deviation(i) ^ 2

r = 1 - regression towards the mean

If all teams were .500, figuring out the variance of luck would be easy.
SD(10 games) = sqrt(.5*5/10) = .158

If we assume that the true spread of talent among teams is a standard deviation of .070, then we get:

variance (observed) = .070^ 2 + .158 ^ 2 = .173 ^ 2

regression towards the mean = .158 ^ 2 / .173 ^ 2 = .836

r = 1 - .836 = .164

However, what does Clay show? r-SQUARED = .155

I sent an email to Clay earlier, asking him to review this thread. I hope he can double-check his numbers, and confirm that what he is presenting might actually be r and not r-squared.

It seems hard to accept that after 10 games, the r would be .40.
_tangotiger - Thursday, September 23 2004 @ 05:13 PM EDT (#31227) #
What's interesting is if you do a year-to-year regression analysis, you end up with an r of .65, or regression towards the mean of .35.

This number would be somewhat consistent with Clay's findings, and my PredictedRegression function:
5 / (5 + sqrt(162) ) = .28

So, maybe I'm wrong, and Clay is presenting r-squared. I dunno, I'm going home now. My head is spinning.
_Andrew K - Thursday, September 23 2004 @ 05:36 PM EDT (#31228) #
This is the result of 10 minutes thinking. I don't have time to do more thinking than that tonight. So take with a pinch of salt.

I have a strong gut belief that the "best" solution is none of the above methods. The two main problems were 1) regression to mean and 2) estimating accuracy of monte carlo parameters. Actually performing a regression analysis for the former, and trying to guess standard errors for the latter, feel arbitrary to me.

Both problems can be solved, it appears to me, if one takes a Bayesian approach.

That is, set up priors for each team, and then compute posterior distributions for their future winning percentage. By nature of Bayes' Theorem (thinking off the top of my head here) the fact that the prior is distributed close to .500 will pull observations back to mean (but as more data comes in the prior has less leverage), and simultaneously provide quasi-confidence intervals from which to run prediction intervals using monte carlo methods.

The problem, as always with Bayesian methods, is to find a sensible prior. A simple method would be to use the historial distribution of teams' winning percentage. But this doesn't take account of a teams strength, as observed before the start of a season. Perhaps some clever analysis can produce a more finely honed prior, even if it simply uses historical records based on the team's previous year's performance. Nonetheless, it is often that case that a halfway sensible prior can produce very useful results.

I am not, however, a Bayesian. Perhaps someone who is can comment?

Incredibly, after 140 games, you can predict the rest of the season by regressing the exact same amount. That is, you know as much as to what the record will be from 140 onwards as you know about game 25 onwards.

While this may be true (it does seem very odd on it's face, though) you must surely have more certainty in your results after more games.
_Andy Dolphin - Thursday, September 23 2004 @ 06:35 PM EDT (#31229) #
A few random notes...

The random variance equals ~0.24/N for the actual record (which is based on binomial statistics), not for the Pythagorean record (which is not). So Tango's attempt above to calculate the expected regression at 10 games is valid for the actual record (and indeed he finds A=0.173, while Clay's empirical data give A=0.176).

If the standard deviation in actual records is 0.080 after 162 games, the random variance is 0.038^2, so the Bayesian prior has a width of 0.070.

Assuming that team qualities remain fixed throughout the season, Tango is right that the amount to regress equals the ratio of random to total variance. There shouldn't be a sqrt() though; it's just 49/(N+49). The reason one regresses less than this is, as noted, because good teams get better and bad teams get worse with trades and use of minor leaguers in the second half of the season. So you're probably better off using the empirical regression factors than a theoretical one.

The uncertainty in the regressed record is given by 1/sqrt(1/V1+1/V2), where V1 and V2 are the random (0.24/N) and population (0.07^2) variances, respectively. This gives an uncertainty of sqrt(0.24/(N+49)). Thus, after zero games, the uncertainty equals the width of the prior, and after a large number of games it approaches sqrt(0.24/N), the random uncertainty.

Looking at the Pythagorean data, the best early season fit to the regression seems to be something like 34/(N+34). Since the "true talent" distribution has to be the same as for the actual record, this implies the random variance in the Pythagorean record to be 0.17/N (compared with 0.24/N for random variance in the actual record). Thus the uncertainty in the regressed Pythagorean record is sqrt(0.17/(N+34)).
_tangotiger - Thursday, September 23 2004 @ 07:45 PM EDT (#31230) #
Andy, thanks for stopping by.

I agree that I should not have done the square root for my equation, but it was the only thing that worked. I should have realized that the changing teams throughout the season would account for the differences I was noticing.

Your entire post was excellent, and I hope that all sabermetricians bookmark this page, and reference it every time they need to do regressions.
_studes - Thursday, September 23 2004 @ 07:58 PM EDT (#31231) #
While this may be true (it does seem very odd on it's face, though) you must surely have more certainty in your results after more games.

What I find I have to keep remembering here is that Clay is predicting the rest of the games of the year, not the overall "true winning percent" or whatever. At 140 games, he's using the data to predict the next 22 games, or whatever. That's why the R2 peaks at 100 games -- because there is likely to be so much variation as the number of remaining games decreases.

I don't know what all this means (I never did figure out Bayes in my pathetic home schooling way) but you actually have less "certainty" when you get to 140 games, because you're only predicting the next 22 games.
_studes - Thursday, September 23 2004 @ 08:37 PM EDT (#31232) #
To say it another way, the R2 at 22 games should be the same as the R2 at 140 games, and it looks like they are by Clay's table. The difference is that, when you work from the 140 games, you have a more correct slope. But the standard errors are still off by the same amount.

Now, I'm not enough of a mathematician to know if that means regression to the mean should be the same at both levels -- I think I'm discussing some else altogether. But I'm not sure.
_MGL - Thursday, September 23 2004 @ 09:35 PM EDT (#31233) #
Studes, that is another level of uncertainty, the random binomial variance around a 22 game sample w/l record. Clay is not concerned with that. The "uncertainty" he is talking about is the uncertainty surrounding the estimated true w% (or the true w% of each team going forward). That has nothing to do with how many games are left. Obviously the final result of the sim will have everything to do with how many games are left, but as the model itself, it doesn't matter.

If we plug 140 or 150 games (N) into Andy's .17 / (N + 43) formula (if I am using the right formula for the variance of each team's w% going forward estimate), we do indeed get around .03 as Tango suggested might be the case.

I am going to link this thread from the BTF thread. Once again, nice work by Andy and Tango and others here. I miss the old Primer Studies!
_tangotiger - Friday, September 24 2004 @ 10:10 AM EDT (#31234) #
There's something that's just looks wrong with Clay's numbers. Look at Game 40 actual, and Game 130 actual. The r-squared is around .30 for both.

Yet, the regression equations are:
.47win% + .265 for Game 40 and
.85win% + .074 for Game 130

It's hard to accept that if the current win% tells you not much, and needs to be heavily regressed, that you can have two such equations. I wouldn't be surprised that what we are seeing with the Ax+B equation is the full-season win%, and not the rest-of-season win%.

If the r-squared is .30, then the r is .55, then the regression towards the mean should be .45 (i.e., 1-r). So, the equation, in both cases, should be:

amount to regress = .45(win%-.50)
= win% - amount to regress
= win% - .45(win%-.50)
= .55win%+.225

You'll note that this is simply:
= r * win% + (1-r)/2

For example, say you have these data points:

1 7
2 5
3 10
4 2
5 5
6 2
7 4
8 13
9 8
10 19
11 9
12 20
13 19
14 10
15 16
16 10
17 15
18 11
19 12
20 13

The r-squared is .32, and the slope (the A in Clay's equation) is .52. This is very much what we expected, since r = .57.

I can't imagine finding an equation where the slope is .9, yet the r-squared is at .30.
_tangotiger - Friday, September 24 2004 @ 11:49 AM EDT (#31235) #
Using Andy's equations, here are the uncertainty levels, around the likely true talent estimates, assuming that the true talent spread in teams is 1 sd = .070.

G Actual Pythag
10 0.064 0.062
20 0.059 0.056
30 0.055 0.052
40 0.052 0.048
50 0.049 0.045
60 0.047 0.043
70 0.045 0.040
80 0.043 0.039
90 0.042 0.037
100 0.040 0.036
110 0.039 0.034
120 0.038 0.033
130 0.037 0.032
140 0.036 0.031

If you are confused, these are the numbers that should be used, instead of the ".100" that Clay references. It'll make a big deal during the season, but not so much with only a few games to go.
_clay - Friday, September 24 2004 @ 11:57 AM EDT (#31236) #

Thanks for all the comments.

My own interpretation of the results follows Studes. The regression errors, the uncertainty I am supposed to be talking about, are coming from both uncertainty in the input and uncertainty in the output. The spread in the input goes down as you play more games; the spread in the output goes up as you have fewer games left to play. When games played ~ games left, you get the best predictions, with dropoffs in either direction. The fact that the slope approaches 1 as games played go up suggests that record is becoming an increasingly better predictor of future results, but the low number of games to which it is applied spoils the results.

I did run regressions for both types of records; at N=75 games played,it came out as .511*Pyth + .214*Act + .137, with a R2 of .405 and a standard error of .0756. Compare to Pyth alone: .721P + .139, .399, .0759, or actual alone .658A+.171, .375, .0774. As you could probably guess from the individual components and knowing that they are highly correlated with one another, the combined equation typically weights the Pyth at about twice the actual, although the ratio declines with games played. The gain in r-squared, compared to pythagorean alone, is less than .01 for each N I have listed in this notebook (25, 50, 75, 100, 125).

As I said, I was suuposed to be worried about both types of errors, but wound up producing a system that modeled both the input and output errors as belonging to the input error; since I still get the output error (variation) from the model, I'm basically double-counting it. A straight .100 error for the input is wrong, and the source of the problem. Since I wrote the article, I ran the playoff odds while tracking the variation in the rest-of-season winning percentages. Using a straight .100 value, I have too much variation, something like .130 at the times I should have .100. Using an estimate of sd=.5/sqrt(N), where N is games already played, to set the normal distribution, does a much better job of replicating the standard errors I found empirically. For right now, at ~152 games, that would be .041. That is going in ASAP.

I've also been directed toward using a beta function rather than a gaussian function, which I will look into. I don't remember (ever knew?) enough about the beta function to say more just yet.

_tangotiger - Friday, September 24 2004 @ 12:20 PM EDT (#31237) #
Clay, thanks for stopping by as well! Not often you get MGL, Andy, and Clay in the same room. Is Pete Palmer around?

Your shorthand for the uncertainty around the true likelihood won't work as n approaches 0. For example, after 25 games, your uncertainty level will be .5/sqrt(25) = .100. However, we know that the top-end (at n=0) must be .070, because that's your prior. This is why Andy's equation should be used.

You make an interesting point on the out-of-sample (outputs) results. I will let Andy and the gang talk more to this, as I'm not very qualified. However, my guess is that because of your huge sample size (1900 teams), that the uncertainty in the out-of-sample is almost irrelevant.
_Moffatt - Friday, September 24 2004 @ 12:50 PM EDT (#31238) #
I've also been directed toward using a beta function rather than a gaussian function

I come to the Batter's Box to escape work talk. I've spent the entire morning working with beta functions in MAPLE. So much for escapism. :)

From a programming perspective beta functions are really nice since they're either equivalent to or good approximation of more standard distributions. Want a uniform dist? Simply set your beta parameters to (1,1). By using them you don't have to re-write large segments of your code whenever you want to experiment with a new distribution; you only have to change a parameter value.

Also, I agree with Andrew K. This kind of model is just begging to be done using a Bayesian approach. I imagine even a prior of simply .500 would give you good results.
_tangotiger - Friday, September 24 2004 @ 01:34 PM EDT (#31239) #
At n=0, the prior would be .500, with the estimate of uncertainty being 1 SD = .070. At n=10, if the team is 7-3, the prior would be the regressed team's record (about .530), with the estimate of certainty being a shade under .064.

Mike, Andrew: this is what you are saying right? If so, then Andy's model should fit the bill.
_Moffatt - Friday, September 24 2004 @ 02:20 PM EDT (#31240) #
RE: Bayesian analysis.

Here's a good page on Bayesian analysis.

It's pretty straight forward stuff. The difficult question is how informative you want to make your prior distribution, which is the issue Andrew K was talking about.
_tangotiger - Friday, September 24 2004 @ 02:42 PM EDT (#31241) #
Great, will check it out.

The prior should really be based on the actual talent level of your players, with the expected number of PAs and IPs (which itself would also require a prior).

If you resign yourself to only looking at year-to-date, team-level data, you obviously limit yourself to that data as your universe, and so, the prior becomes .5 to start the season, and a regressed value as the season goes on.

You'd have different priors as you get past July 31 and Aug 31, since we have historical knowledge of those time periods as well.
_Andy Dolphin - Friday, September 24 2004 @ 05:31 PM EDT (#31242) #
Given the narrow distribution, it really doesn't matter what functional form you choose for the prior.

Tango, the prior is always a distribution centered at 0.50 and with a standard deviation of about 0.07, no matter how many games have been played. It would be interesting to see the preseason projections used as priors, which would allow narrower prior distributions and thus would give more accurate posteriors.

Clay, in your 1/sqrt(N) uncertainty formulation, you're basically ignoring the whole point of regressing records in the first place -- to provide a more accurate estimate of team strength. What I worked out earlier is sqrt(0.24/(N+49)) for the estimate using record, and sqrt(0.17/(N+34)) for the estimate using the Pythagorean.
_tangotiger - Friday, September 24 2004 @ 09:14 PM EDT (#31243) #
Andy, I must be missing something then.

I understand that to start the season, we would use the same prior (.500) and the same uncertainty level (.070).

If a team starts 7-3, I would then regress that, and get a true talent level of .530, with an uncertainty level of .063. Doesn't that then become my prior for that particular team?

So, my .530 would be similar to using the pre-season projection for that team as a prior, as you are suggesting. Using the pre-season, I'd guess the uncertainty level to be somewhat lower than .030 (what you would get for 162 game season).
_AMBA - Friday, September 24 2004 @ 11:18 PM EDT (#31244) #

I posted some simulation results on BTF, but the good discussion seems to be going on here, so I will repost:

I essentially took up mgl's suggestion to run a simulation (as in post 54) to produce a distribution of probable team ability based on observed record. To do so, I first looked at mlb team records since 1987. The distribution of team records was centered around 81 wins with a SD of 11.6 games, or a variance of 113.89games^2. Using the binomial formula for variance: var = p*(1-p)*n where p=probability of winning, and n = # of games, we can estimate the observational variability as 40.5. This leaves the true variability of 93.3944 games^2, or a standard deviation of 9.66 games. I then used a binomial distribution with a standard deviation of 9.66/162 = 0.0596 as the a priori distribution of team talent. I'll skip the rest of the implementation details here, and instead say that, for a team with an observed level of play of 0.560 after 162 games, the posterior distriution of their TRUE ability is centered around 0.543 with a standard deviation of 0.033. So, basically, the 0.100 SD used by Clay is too large by a factor of 3.

After reading this discussion (sorry for not having done so before) I see these are the same ranges of numbers being discussed here as well.

One interesting point that came out of my simulation approach was that the regression amount I was getting did not appear to be constant for each team winnning level. In other words, if the obswerved win% of a team was 0.560, the posterior distribution mean was shifted to 0.543. If instead the observed talent was 0.650, the posterior distribution mean was shifted all the way down to 0.555. This second distribution was also wider than the first (0.04 SD), and was right-skewed significantly.

I am guessing that all of this work can be done analytically, since I was using only binomial and normal distributions, but I don't have the math chops to attempt it. Can anyone assist?

_tangotiger - Sunday, September 26 2004 @ 02:57 PM EDT (#31245) #
I was writing with Walt Davis, and he tried to straighten me out as to how it would be possible to have a low correlation, but the slope to be close to 1.

I'll try to illustrate his point with one of my own examples. Say you have 3 guys, and the each roll a die. If the first guy rolls a 1 to 4, that's a win. If the 2nd guy rolls a 1 to 3, that's a win. The third guy rolls a 1 or 2 for a win. So, it's like a weighted coin. And, we know which coins are weighted, and by how much.

They each roll the die 24 times. On average, the first guy will win 16 and lose 8 times. But, it will be a distribution around that. He does it for 1900 set of 24 rolls. So, he'll be all over the map, but centered around 16. Do the same with the other 2 guys, and do it for a few hundred more, where the known "true rates" are predetermined.

The correlation will not be perfect. We know that it'll be y=x. The slope must be 1, but the correlation will be less.

So, this is what may be happening with Clay's data. The slope continues to approach 1, but with only 22 games in the out-of-sample data, the correlation will be rather weak.
Regression Towards the Mean | 30 comments | Create New Account
The following comments are owned by whomever posted them. This site is not responsible for what they say.