thanks to Ron M. for pointing out the topic
Cointegration & Unit Roots
When studying stock prices we usually start with the prices themselves
which constitute a series of numbers, each associated with a particular time.
For example: P0, P1, P2, ... where P0 is the starting price and
P1 is the next price etc.
>Are we talking daily closing prices ... or what?
It doesn't matter, but we can (if we wish) think of them as daily closing prices. In any case, the series of numbers is called a time series.
The prices don't change smoothly, but may be considered to have some trend upon which is superimposed some random component,
as in Figure 1 (where the "trend" is shown in red).
We'd like to associate with the random component some probability distribution characterized by a Mean, Variance, etc.
However, it's clear that any Mean we associate with the prices will increase with time if the trend is increasing and ...
>For me the trend is always down!
Pay attention. In order to analyze the evolution of prices (in particular the random component) one normally considers
not the prices themselves
but the returns P1/P0 -1, P2/P1-1, ... and their distribution. For example,
the distribution of returns which give rise to the time series in Figure 1 might have a distribution as in Figure 2.
>You're assuming a normal or lognormal distribution?
To investigate the random component one normally makes some assumption regarding the distribution of returns, and normal or lognormal
are the most popular assumptions. Indeed, the usual assumption is that the returns are random selections from a distribution where the parameters
of the distribution (Mean, Variance etc.) don't change with time.
Note that the series of returns is also a time series. A significant difference between the return time series and the stock price time series is that
the stock prices have a Mean which changes with time. Such a time series is called non-stationary.
>And the returns time series ... is it stationary?
It would be if the distribution parameters (Mean, Variance etc.) are constant over time.
>And are they?
Are they constant over time? Look at Figure 3 where we note the Mean and Variance (that's StandardDeviation2) of the monthly
returns for GE stock, over the 1960s, 1970s etc..
>I'd say non-stationary.
Yes, me too.
Another thing we note (about how one often analyzes stock prices and/or returns) is the connection between two stocks in our portfolio.
It's customary to consider the correlation between the two returns series measured by
R-squared / Pearson Correlation. Portfolio optimization and
risk-reward analysis have usually been based upon correlation of returns, but a more recent consideration is the so-called cointegration.
When dealing with stock prices, it's been customary to first consider the difference between successive prices (and consider, instead, returns).
This "differencing" eliminates any trends which may be involved and ...
>What's this cointegration stuff?
Correlation is used to measure the ... uh, correlation between returns. When the returns for one stock go up or down, does the other tend to
go up or down as well? It's a short term measure of interdependence.
On the other hand, cointegration attempts to measure common trends in prices over the long haul. For example, suppose that the time
series associated with two stock returns have a high correlation and the prices have high cointegration, as in Figure 4a.
See the similarity in stock price trends?
Okay, suppose we change the returns on one stock by a very small amount, replacing a return r by r - 0.001 (for example).
The correlation between r and r - 0.001 is high. (Indeed, it's 100% !!)
>But the stock prices are no longer cointegrated, eh?
Yes, as in Figure 4b.
>Okay, I know how to calculate correlation, but how do I calculate ...?
How to calculate cointegration? That's what we'll talk about.
First, notice that if stock price (at time = n) increased according to Pn = P0 + r n + en where r is
some growth rate and en is a random variable with Mean = 0 and Variance = v (and the correlation between pairs, such as
en and en-m, is zero), then the mean of Pn
would be P0 + r n (hence it's changing with the time, n) but the Variance would be constant at v
(introduced by the random component en).
You can find the stat stuff here.
Anyway, if you could identify "r" then the NEW variable Pn - (P0 + r n) would have constant Mean = 0 as
well as constant Variance v. This NEW variable would be stationary, eh? In fact, from the relation Pn = P0 + r n + en
we can see that Pn - Pn-1 = r + en - en-1 so this differencing has produced a
stationary time series since en - en-1 has constant statistical parameters like: Mean = 0, Variance = 2v.
>A stock price that increases in a straight line? I've never seen any ...?
That's just an example. However, we might also be talking about spending habits where, as time progresses, we spend more and the national
population expenditures might be considered to grow linearly ... with a random component. Or, the time series defined by
Pn = P0 + r n + en might be a relation between the logarithms of successive stock prices,
like log[Un] = log[U0] + r n + en meaning that Un = U0er n+en
implying a growth from some initial price of U0 with annual (or monthly or daily) returns having a Mean equal to r.
Now let's consider a Random Walk
where Pn = Pn-1 + en.
Then we'd have P1 = P0+ e1 and P2 = P1+e2
= P0+e1+e2 and, eventually,
Pn = P0+e1+e2+ ... +en.
Then the Variance is:
 VAR[Pn] = VAR[e1]+VAR[e2]+ ... +VAR[en] = n v.
>That makes the Standard Deviation grow like SQRT(n), right?
Yes, the volatility would increase as time progressed ... becoming infinite.
Now consider a time series defined by Pn = ρ Pn-1 + en where, as before,
en is a random variable with Mean = 0 and Variance = v.
Then P1 = ρ P0 + e1, P2 = ρ P1 + e2
= ρ2 P1 + ρ e1 + e2 and, eventually,
Pn = ρn P0 + ρn-1 e1 + ρn-2e2 + ... + en.
Or, to put it differently:
Pn = en + ρen-1 + ρ2en-2 + ρ3en-3 + ...
Note that this expression defines the time series Pn as a moving average and ...
It's a sum with yesterday's en-1 having a weight of ρ and the day before having a weight of ρ2
and the day before that having ...
So the Mean of Pn is the sum of the means, which is zero, and the Variance is the sum of the variances (because the random
components en, en-1, etc. have zero correlation) so Variance is:
 VAR[Pn] = VAR[en]+VAR[ ρen-1]+VAR[ρ2en-2]+...
= v+ρ2v + ρ4v + ρ6v + ... = v / (1 - ρ2)
>Where's P0 and how did you get ...?
Okay, the fact that VAR[ρx] = ρ2VAR[e] is
and, uh ... we'll assume the time series is infinite and goes back forever.
>To simplify the math, eh?
Well ... yes, and it gives us a nice formula for the Variance since 1+ρ2+ρ4+ρ6+ ... = 1/(1 - ρ2).
>Okay, I'll stick in ρ = 10 and get ...
Uh, no, we can only write 1+ρ2+ρ4+ρ6+ ... = 1/(1 - ρ2)
if -1 < ρ < 1.
However, if -1 < ρ < 1 then we can see from  that the Mean and Variance (and other stochastic parameters associated with Pn,
like covariance) are constants so this time series is stationary. Of course, if ρ = 1 then we're in trouble.
>ρ = 1? Is that the unit root stuff?
Yes. In fact, the assumption that we started our time series in the infinite past gives rise to a stationary time series.
If we had started at some t = 0, we'd get a Variance which ended with ρ2n-2en so depends upon n ... so it wouldn't be stationary.
Also, if had ρ > 1 it wouldn't be stationary.
Now consider two time series defined by:
[3A] Pn = P0 er n en
[3B] Pn = Pn-1 er en
If en = 1, then [2A] and [2B] are the same. In fact, [3B] gives Pn = P0 er n e1e2...en.
>That notation ... using "e" for the random component and for er ... you've done that before ... it's confusing!
Anyway, one may be tempted to say that there is little to choose between [3A] and [3B], however, we can write:
[3A.1] log[Pn] = log[P0] + r n + log[en]
[3B.1] log[Pn] = log[P0] + r n + log[e1]+log[e2]+...+log[en]
For the [3B] time series, there's a sum of random components (not just the current component, as in [3A]) and that makes the Variance grow with time, n.
>Is this stuff useful?
We'll see, but the point is that having a stationary time series is very handy. If, on the other hand, the Mean and/or Variance change from
day to day, it's more awkward to do the statistical analysis. We'd like to see if some combination of the terms in the time series will result
in a stationary time series.
>Like considering the returns instead of the prices?
Yes. Taking differences in successive prices or considering the percentage change ... that can eliminate trends as we saw in Figure 1.
>But calculating returns isn't the same as taking differences between successive prices ... is it?
Well, no. The returns involve a ratio: Pn+1 / Pn. However, the difference
between the logarithms gives log[Pn+1] - log[Pn] = log[Pn+1/ Pn].
The effect of differencing is to remove the trend (called, would-you-believe, "de-trending") and that may be bad as it removes the possibility of detecting common trends
between two or more stocks. Nevertheless, it allows one to deal with a stationary time series. Imagine how one would determine the
correlation between two time series if we dealt with prices or other non-stationary series. Now, using cointegration, one hopes to eliminate the need to calculate correlation
coefficients yet identify common trends.
Suppose we wanted a portfolio of stocks chosen so as to "follow" the DOW. If we could arrange it so the tracking error (the difference
between our portfolio and the DOW) was stationary, we'd be happy. It'd mean that our portfolio might deviate from the DOW but these deviations
would have a Mean = 0 so portfolio values would oscillate about the DOW and ...
>Isn't that Mean Reversion?
Yes, if cointegration exists then there'd be mean reversion to the DOW index.
>Okay, so what IS cointegration? I mean ...
Two (or more) non-stationary time series are said to be cointegrated if a linear combination of the terms results in a stationary time series.
For example, if Un and Vn are non-stationary but Un - C Vn is stationary (for some constant C),
then the two series are cointegrated (and there's an underlying, common trend). This would be the case if the "error", en
= Un - C Vn is stationary and therefore has time-independent statistical parameters: Mean, Variance and Autocovariance.
Yes, the correlation between en and en-m, namely the Mean of (en - M)(en-m - M)
... where M is the Mean of the en.
This Mean can depend upon m (the "lag") but NOT upon n (the time). That (along with the time independence of
the Mean and Variance) would make the time series en stationary. Note that, if m = 0, then the mean of
(en - M)(en-m - M) = (en - M)2 is just the Variance ... or (Standard Deviation)2.