Regression to the Mean: some comments on its application to stock returns

Regression to the Mean comes in various flavours:

  • Tall fathers will have tall sons, but the height of the sons will be closer to the mean (or average) of the current adult male population. The same holds for short fathers and their short sons who, nevertheless, tend to be more average than their father. (That's the origin of the word regression, first studied by Francis Galton over a hundred years ago. We'll look at "regression lines" below.)
  • The very-low (and very-high) scores on a mid-term test will tend to be more like the average score, when the final exam rolls around. (Meaning that low scores will increase toward the average and high scores ... well, you get the idea.)
  • And, the one I like best:
    High-flying stocks will Regress to the Mean; their returns will become more like the average of stocks in the same asset class ... over time.

It's the last example that we wish to discuss here ... eventually.

Once upon a time (see this, taken from the Fund Library discussion forum) there was a discussion of Regression to the Mean ... and it's taken me this long to write a tutorial (because I really didn't understand why financial analysts apply the concept to the stock market).

Let's start at the beginning, with Francis Galton and his father-son height correlation:


We run out and measure the heights of a father-son pair and, if x is the father's height and y the son's height (both in inches), we get one point on a chart: (x,y). We repeat this umpteen times and get umpteen points which might look like Fig. 1

It would appear that, in general, if the father is taller ... then the son is taller. To see this we draw in a best fit straight line (shown in blue). This is the regression line. More on this below.

The equation of this line (the best estimate of a son's height, given the father's height) is: y = 41 + 0.44 x


Fig. 1
This regression line suggests that if a father's height increases by 1 inch we might expect his son's height to increase by 0.44 inches since, increasing x by "1" would increase y by ...

>Increases by in inch? How can that happen? Does he grow while we ...?
Uh ... what I mean is, if we consider two father-son pairs and the second father is one inch taller than the first, then we might expect the second son to be 0.44 inches taller than the first son ... on average ... sort of.

>On average? Sort of?
Well, it'd be our best guess, based upon our study of umpteen father-son pairs. Of course, if we considered thousands of pairs instead of just forty-five, we'd feel more comfortable with our estimate of a son's height, given the father's height.

>Forty-five? I count thirty-six points in Fig. 1.
Well, there are actually forty-five in my example, but many (x,y)-pairs are the same and give a single point (and I've made these points larger) and some are outliers ... outside the chart.

Anyway, let's consider a son's height as the "regression line height" plus some deviation. That is, we consider a particular father-son pair with heights xn and yn respectively. (The subscript "n" means it's just one pair out of "N" pairs that we'll eventually consider.)

The regression line gives an estimate of the son's height, namely   41+0.44 xn, but the son's actual height is unlikely to be exactly this ... so we consider the deviation: yn - (41+0.44 xn) = en (where we use e to represent the error - or deviation from the regression line).

Suppose we now calculate the average (or mean) of all the father's heights ... all N = 45 of 'em (in our example):

xmean = (1/N)(x1+x2+...+xN) = 69.4

Now calculate the mean of all N son's heights:

ymean = (1/N)(y1+y2+...+yN) = 71.4

Now compare ymean with   41+0.44 xmean = 41 + 0.44 (69.4) = 71.5

>Mamma mia! They're nearly the same!
Yes. Although there are errors (which we called en), when we calculate an average, these errors almost cancel out. We write:

(1)           yn = 41+0.44 xn + en
so, when we add all the yn's and xn's and divide by N (in our example, it's N=45), we get:

(2)           ymean = 41+0.44 xmean + emean
and all those errors, both positive and negative, pretty well cancel out ... so emean is almost zero (compared to an average height).

Now we consider a potential new father; his wife is pregnant (with a son!). Suppose his height is x. On the basis of our investigation of umpteen father-son pairs we estimate (predict?) that his son will have height:

(3)           y = 41 + 0.44x

Now subtract (2) from (3) and we get:

(4)           y - ymean = 0.44 (x - xmean)

which says that ...

>Wait! What happened to the error guy?
Well, it's small so we'll ignore it ... at least for the time being. However, you can stick it in if you like ... but it won't make a difference to our conclusion.
>Conclusion? What conclusion?
We conclude that if the new father's height differs significantly from the mean (meaning x - xmean is large), then our best estimate for his son's height, y, is closer to the mean since y - ymean is only 0.44 (x - xmean).

>So the son's deviation from the mean is only 44% of the father's.
Well, I wouldn't go so far as saying that. Let's just say we'd expect to see the son closer to the mean than the father. So if the father's height is well below the mean, his son would be expected to be taller ... but still short. If the father is much taller than the mean, his son would be tall, but not so tall as his father.

That's Regression to the Mean.

>Can we talk about the stock market?
Okay, but first let me say one other thing about our father-son stuff:

If we calculate the Mean of the Squares of these "errors", for all umpteen father-son pairs (namely, "N" pairs), we get:
(1/N) (e12+e22+ ...+eN2)
and (surprise!) the numbers 0.44 and 41 have actually been chosen to minimize this Mean Square Error ... which tells us how people go about generating regression lines

Anyway, here's a plot where the (x,y) pairs are the monthly returns for the S&P 500 and Microsoft, throughout the 1990s.
There are lots of outliers: S&P monthly returns run from about -15% to +11% while the MSFT returns run from about -25% to +33% (!)

If we stare at the coefficient 1.25 of the regression line, and equation (4) above which, for this example, would read:

y - ymean = 1.25 (x - xmean)
it's tempting to conclude that this is some kind of "proof" that MSFT Returns does NOT regress to its Mean or to the Mean of the U.S. market (assuming the S&P 500 represents the U.S. Market) since the MSFT Returns are ...
>Larger, by 25%?
Well, no. It just means that the MSFT Returns are much more volatile than the S&P Returns and deviate from their Mean more than S&P Returns deviate from their Mean.
>More, by 25%.
Well, for this particular time period, but we're assuming that ...
>That the error term is negligible.
Quite so. To get y - ymean = 1.25 (x - xmean), we've tacitly assumed that MSFT Returns are related in some roughly linear, straight-line fashion, to general market Returns, constructed a Regression Line, and assumed that, although the individual returns, (x,y) = (S&P,MSFT), may not lie precisely on the line, the Mean of the MSFT and S&P Returns do.
>But you've got the data. Is that error term negligible?

The neglected error term emean (the average of all the deviations from the regression line - which we expect to add up to something negligible) ... this error term is indeed small (0.2) and that gives us some confidence in neglecting it.

>The errors are huge, but their average is negligible ... so we ignore them. Is that what you're saying? Isn't that the old cliche:"My head's in the freezer and my feet in the furnace so, on average, I'm quite comfortable"?
Well ...

For the father-son stuff, we measure all their heights on Monday and generate the regression line. Then, on Tuesday, we measure the height of the "new" father and estimate the height of his son. But with the MSFT, S&P plot, it was over a ten year period. What we should do is investigate a "time series" to see if there is Reversion to the Mean as time progresses.

Here's a plot of their monthly returns, over time:

We'll investigate a possible Reversion to the Mean ... as time progresses.

Click for Part II.