Regression to the Mean: some comments on its application to stock returns 
Regression to the Mean comes in various flavours:
 Tall fathers will have tall sons, but the height of the sons will be closer to the
mean (or average) of the current adult male population. The same holds for short fathers and their
short sons who, nevertheless, tend to be more average than their father.
(That's the origin of the word
regression, first studied by Francis Galton over a hundred years ago. We'll look at
"regression lines" below.)
 The verylow (and veryhigh) scores on a midterm test will tend to be more like the
average score, when the final exam rolls around. (Meaning that low scores will increase toward
the average and high scores ... well, you get the idea.)
 And, the one I like best:
Highflying stocks will Regress to the Mean; their returns will become more
like the average of stocks in the same asset class ... over time.
It's the last example that we wish to discuss here ... eventually.
Once upon a time
(see this,
taken from the Fund Library discussion forum) there was a discussion of Regression to the Mean
... and it's taken me this long to write a tutorial (because I really didn't understand
why financial analysts apply the concept to the stock market).
Let's start at the beginning, with Francis Galton and his fatherson height correlation:
We run out and measure the heights of a fatherson pair and, if x is the
father's height and y the son's height (both in inches), we get one point on a chart:
(x,y). We repeat this umpteen times and get umpteen points which might look
like Fig. 1
It would appear that, in general, if the father is taller ... then the son is taller. To see
this we draw in a best fit straight line (shown in blue).
This is the regression line. More on this below.
The equation of this line (the best estimate of a son's height, given the father's
height) is: y = 41 +
0.44 x
 Fig. 1

This regression line suggests that if a father's height increases by 1 inch we might expect
his son's height to increase by 0.44 inches since,
increasing x by "1" would increase y by ...
>Increases by in inch? How can that happen? Does he grow while we ...?
Uh ... what I mean is, if we consider two fatherson pairs and the second father is
one inch taller than the first, then we might expect the second son to be
0.44 inches taller than the first son ... on average ... sort of.
>On average? Sort of?
Well, it'd be our best guess, based upon our study of umpteen fatherson pairs. Of course, if
we considered thousands of pairs instead of just fortyfive, we'd feel more comfortable
with our estimate of a son's height, given the father's height.
>Fortyfive? I count thirtysix points in Fig. 1.
Well, there are actually fortyfive in my example, but many (x,y)pairs are the same and give
a single point (and I've made these points larger) and some are outliers ... outside the chart.
Anyway, let's consider a son's height as the "regression line height" plus some
deviation. That is, we consider a particular fatherson pair with heights x_{n}
and y_{n} respectively. (The subscript "n" means it's just one pair out of
"N" pairs that we'll eventually consider.)
The regression line gives an estimate of the son's height, namely
41+0.44 x_{n},
but the son's actual height is unlikely to be
exactly this ...
so we consider the deviation: y_{n} 
(41+0.44 x_{n})
= e_{n} (where we
use e to represent the error  or deviation from the regression line).
Suppose we now calculate the average (or mean) of all the father's heights
... all N = 45 of 'em (in our example):
x_{mean} = (1/N)(x_{1}+x_{2}+...+x_{N}) = 69.4
Now calculate the mean of all N son's heights:
y_{mean} = (1/N)(y_{1}+y_{2}+...+y_{N}) = 71.4
Now compare y_{mean} with
41+0.44 x_{mean}
= 41 +
0.44 (69.4) = 71.5
>Mamma mia! They're nearly the same!
Yes. Although there are errors (which we called e_{n}), when we calculate
an average, these errors almost cancel out. We write:
(1)
y_{n} =
41+0.44 x_{n}
+ e_{n}
so, when we add all the y_{n}'s and x_{n}'s and divide by N
(in our example, it's N=45), we get:
(2)
y_{mean} =
41+0.44 x_{mean}
+ e_{mean}
and all those errors, both positive and negative, pretty well cancel out
... so e_{mean} is almost zero (compared to an average height).
Now we consider a potential new father; his wife is pregnant (with a son!).
Suppose his height is x. On the basis of our investigation of umpteen fatherson pairs
we estimate (predict?) that his son will have height:
(3)
y = 41 + 0.44x
Now subtract (2) from (3) and we get:
(4)
y  y_{mean} = 0.44
(x  x_{mean})
which says that ...
>Wait! What happened to the error guy?
Well, it's small so we'll ignore it ... at least for the time being. However, you can stick
it in if you like ... but it won't make a difference to our conclusion.
>Conclusion? What conclusion?
We conclude that if the new father's height differs significantly from the mean
(meaning x  x_{mean} is large), then
our best estimate for his son's height, y, is closer
to the mean since
y  y_{mean} is only 0.44
(x  x_{mean}).
>So the son's deviation from the mean is only 44% of the father's.
Well, I wouldn't go so far as saying that. Let's just say we'd expect to see the son closer
to the mean than the father. So if the father's height is well below the mean, his son
would be expected to be taller ... but still short. If the father is much taller than the mean,
his son would be tall, but not so tall as his father.
That's Regression to the Mean.
>Can we talk about the stock market?
Okay, but first let me say one other thing about our fatherson stuff:
If we calculate the Mean of the Squares of these "errors", for all umpteen fatherson pairs
(namely, "N" pairs), we get:
(1/N) (e_{1}^{2}+e_{2}^{2}+
...+e_{N}^{2})
and (surprise!) the numbers 0.44 and
41 have actually been chosen to minimize this
Mean Square Error ... which tells us how people go about generating
regression lines
Anyway, here's a plot where the (x,y) pairs are the monthly returns for the
S&P 500 and Microsoft, throughout the 1990s.
There are lots of outliers: S&P monthly returns run from about 15% to +11%
while the MSFT returns run from about 25% to +33% (!)


If we stare at the coefficient
1.25 of the regression line, and equation (4) above which,
for this example, would read:
y  y_{mean} = 1.25
(x  x_{mean})
it's tempting to conclude that this is some kind of "proof" that MSFT Returns does NOT
regress to its Mean or to the Mean of the U.S. market (assuming the S&P 500 represents the U.S.
Market) since the MSFT Returns are ...
>Larger, by 25%?
Well, no. It just means that the MSFT Returns are much more volatile than the S&P Returns and
deviate from their Mean more than S&P Returns deviate from their Mean.
>More, by 25%.
Well, for this particular time period, but we're assuming that ...
>That the error term is negligible.
Quite so. To get
y  y_{mean} = 1.25
(x  x_{mean}),
we've tacitly assumed that MSFT Returns are related in some roughly linear, straightline fashion,
to general market Returns, constructed a Regression Line, and assumed that, although
the individual returns, (x,y) = (S&P,MSFT), may not lie precisely on the line,
the Mean of the MSFT and S&P Returns do.
>But you've got the data. Is that error term negligible?

The
neglected error term e_{mean} (the average of all the deviations
from the regression line  which we expect to add up to something negligible)
... this error term is indeed small (0.2) and that gives us some confidence in neglecting it.
>The errors are huge, but their average is negligible ... so we
ignore them. Is that what you're saying? Isn't that the old cliche:"My head's in the freezer
and my feet in the furnace so, on average, I'm quite comfortable"?
Well ...

For the fatherson
stuff, we measure all their heights on Monday and generate the regression line. Then, on
Tuesday, we measure the height of the "new" father and estimate the height of his son.
But with the MSFT, S&P plot, it was over a ten year period. What we should do is investigate
a "time series" to see if there is Reversion to the Mean as time progresses.
Here's a plot of their monthly returns, over time:


We'll investigate a possible Reversion to the Mean ... as time progresses.
Click for Part II.
