Cointegration & Unit Roots

thanks to Ron M. for pointing out the topic
When studying stock prices we usually start with the prices themselves
which constitute a series of numbers, each associated with a particular time.
For example: P_{0}, P_{1}, P_{2}, ... where P_{0} is the starting price and
P_{1} is the next price etc.
>Are we talking daily closing prices ... or what?
It doesn't matter, but we can (if we wish) think of them as daily closing prices. In any case, the series of numbers is called a time series.
The prices don't change smoothly, but may be considered to have some trend upon which is superimposed some random component,
as in Figure 1 (where the "trend" is shown in red).
We'd like to associate with the random component some probability distribution characterized by a Mean, Variance, etc.
However, it's clear that any Mean we associate with the prices will increase with time if the trend is increasing and ...
>For me the trend is always down!
Pay attention. In order to analyze the evolution of prices (in particular the random component) one normally considers
not the prices themselves
but the returns P_{1}/P_{0} 1, P_{2}/P_{1}1, ... and their distribution. For example,
the distribution of returns which give rise to the time series in Figure 1 might have a distribution as in Figure 2.
>You're assuming a normal or lognormal distribution?
To investigate the random component one normally makes some assumption regarding the distribution of returns, and normal or lognormal
are the most popular assumptions. Indeed, the usual assumption is that the returns are random selections from a distribution where the parameters
of the distribution (Mean, Variance etc.) don't change with time.
Note that the series of returns is also a time series. A significant difference between the return time series and the stock price time series is that
the stock prices have a Mean which changes with time. Such a time series is called nonstationary.

Figure 1
Figure 2

Figure 3

>And the returns time series ... is it stationary?
It would be if the distribution parameters (Mean, Variance etc.) are constant over time.
>And are they?
Are they constant over time? Look at Figure 3 where we note the Mean and Variance (that's StandardDeviation^{2}) of the monthly
returns for GE stock, over the 1960s, 1970s etc..
>I'd say nonstationary.
Yes, me too.

Another thing we note (about how one often analyzes stock prices and/or returns) is the connection between two stocks in our portfolio.
It's customary to consider the correlation between the two returns series measured by
Rsquared / Pearson Correlation. Portfolio optimization and
riskreward analysis have usually been based upon correlation of returns, but a more recent consideration is the socalled cointegration.
>Huh?
When dealing with stock prices, it's been customary to first consider the difference between successive prices (and consider, instead, returns).
This "differencing" eliminates any trends which may be involved and ...
>What's this cointegration stuff?
Correlation is used to measure the ... uh, correlation between returns. When the returns for one stock go up or down, does the other tend to
go up or down as well? It's a short term measure of interdependence.
On the other hand, cointegration attempts to measure common trends in prices over the long haul. For example, suppose that the time
series associated with two stock returns have a high correlation and the prices have high cointegration, as in Figure 4a.
See the similarity in stock price trends?
>Yeah, so?
Okay, suppose we change the returns on one stock by a very small amount, replacing a return r by r  0.001 (for example).
The correlation between r and r  0.001 is high. (Indeed, it's 100% !!)
>But the stock prices are no longer cointegrated, eh?
Yes, as in Figure 4b.
>Okay, I know how to calculate correlation, but how do I calculate ...?
How to calculate cointegration? That's what we'll talk about.

Figure 4a
Figure 4b

First, notice that if stock price (at time = n) increased according to P_{n} = P_{0} + r n + e_{n} where r is
some growth rate and e_{n} is a random variable with Mean = 0 and Variance = v (and the correlation between pairs, such as
e_{n} and e_{nm}, is zero), then the mean of P_{n}
would be P_{0} + r n (hence it's changing with the time, n) but the Variance would be constant at v
(introduced by the random component e_{n}).
>Huh?
You can find the stat stuff here.
Anyway, if you could identify "r" then the NEW variable P_{n}  (P_{0} + r n) would have constant Mean = 0 as
well as constant Variance v. This NEW variable would be stationary, eh? In fact, from the relation P_{n} = P_{0} + r n + e_{n}
we can see that P_{n}  P_{n1} = r + e_{n}  e_{n1} so this differencing has produced a
stationary time series since e_{n}  e_{n1} has constant statistical parameters like: Mean = 0, Variance = 2v.
>A stock price that increases in a straight line? I've never seen any ...?

Figure 5

That's just an example. However, we might also be talking about spending habits where, as time progresses, we spend more and the national
population expenditures might be considered to grow linearly ... with a random component. Or, the time series defined by
P_{n} = P_{0} + r n + e_{n} might be a relation between the logarithms of successive stock prices,
like log[U_{n}] = log[U_{0}] + r n + e_{n} meaning that U_{n} = U_{0}e^{r n+en}
implying a growth from some initial price of U_{0} with annual (or monthly or daily) returns having a Mean equal to r.
Now let's consider a Random Walk
where P_{n} = P_{n1} + e_{n}.
Then we'd have P_{1} = P_{0}+ e_{1} and P_{2} = P_{1}+e_{2}
= P_{0}+e_{1}+e_{2} and, eventually,
P_{n} = P_{0}+e_{1}+e_{2}+ ... +e_{n}.
Then the Variance is:
[1] VAR[P_{n}] = VAR[e_{1}]+VAR[e_{2}]+ ... +VAR[e_{n}] = n v.
>That makes the Standard Deviation grow like SQRT(n), right?
Yes, the volatility would increase as time progressed ... becoming infinite.
Now consider a time series defined by P_{n} = ρ P_{n1} + e_{n} where, as before,
e_{n} is a random variable with Mean = 0 and Variance = v.
Then P_{1} = ρ P_{0} + e_{1}, P_{2} = ρ P_{1} + e_{2}
= ρ^{2} P_{1} + ρ e_{1} + e_{2} and, eventually,
P_{n} = ρ^{n} P_{0} + ρ^{n1} e_{1} + ρ^{n2}e_{2} + ... + e_{n}.
Or, to put it differently:
P_{n} = e_{n} + ρe_{n1} + ρ^{2}e_{n2} + ρ^{3}e_{n3} + ...
Note that this expression defines the time series P_{n} as a moving average and ...
>Huh?
It's a sum with yesterday's e_{n1} having a weight of ρ and the day before having a weight of ρ^{2}
and the day before that having ...
>Yeah. So?
So the Mean of P_{n} is the sum of the means, which is zero, and the Variance is the sum of the variances (because the random
components e_{n}, e_{n1}, etc. have zero correlation) so Variance is:
[2] VAR[P_{n}] = VAR[e_{n}]+VAR[ ρe_{n1}]+VAR[ρ^{2}e_{n2}]+...
= v+ρ^{2}v + ρ^{4}v + ρ^{6}v + ... = v / (1  ρ^{2})
>Where's P_{0} and how did you get ...?
Okay, the fact that VAR[ρx] = ρ^{2}VAR[e] is
here
and, uh ... we'll assume the time series is infinite and goes back forever.
>To simplify the math, eh?
Well ... yes, and it gives us a nice formula for the Variance since 1+ρ^{2}+ρ^{4}+ρ^{6}+ ... = 1/(1  ρ^{2}).
>Okay, I'll stick in ρ = 10 and get ...
Uh, no, we can only write 1+ρ^{2}+ρ^{4}+ρ^{6}+ ... = 1/(1  ρ^{2})
if 1 < ρ < 1.
However, if 1 < ρ < 1 then we can see from [2] that the Mean and Variance (and other stochastic parameters associated with P_{n},
like covariance) are constants so this time series is stationary. Of course, if ρ = 1 then we're in trouble.
>ρ = 1? Is that the unit root stuff?
Yes. In fact, the assumption that we started our time series in the infinite past gives rise to a stationary time series.
If we had started at some t = 0, we'd get a Variance which ended with ρ^{2n2}e_{n} so depends upon n ... so it wouldn't be stationary.
Also, if had ρ > 1 it wouldn't be stationary.
Now consider two time series defined by:
[3A] P_{n} = P_{0} e^{r n} e_{n}
[3B] P_{n} = P_{n1} e^{r} e_{n}
If e_{n} = 1, then [2A] and [2B] are the same. In fact, [3B] gives P_{n} = P_{0} e^{r n} e_{1}e_{2}...e_{n}.
>That notation ... using "e" for the random component and for e^{r} ... you've done that before ... it's confusing!
Then concentrate! Anyway, one may be tempted to say that there is little to choose between [3A] and [3B], however, we can write:
[3A.1] log[P_{n}] = log[P_{0}] + r n + log[e_{n}]
[3B.1] log[P_{n}] = log[P_{0}] + r n + log[e_{1}]+log[e_{2}]+...+log[e_{n}]
For the [3B] time series, there's a sum of random components (not just the current component, as in [3A]) and that makes the Variance grow with time, n.
>Is this stuff useful?
We'll see, but the point is that having a stationary time series is very handy. If, on the other hand, the Mean and/or Variance change from
day to day, it's more awkward to do the statistical analysis. We'd like to see if some combination of the terms in the time series will result
in a stationary time series.
>Like considering the returns instead of the prices?
Yes. Taking differences in successive prices or considering the percentage change ... that can eliminate trends as we saw in Figure 1.
>But calculating returns isn't the same as taking differences between successive prices ... is it?
Well, no. The returns involve a ratio: P_{n+1} / P_{n}. However, the difference
between the logarithms gives log[P_{n+1}]  log[P_{n}] = log[P_{n+1}/ P_{n}].
The effect of differencing is to remove the trend (called, wouldyoubelieve, "detrending") and that may be bad as it removes the possibility of detecting common trends
between two or more stocks. Nevertheless, it allows one to deal with a stationary time series. Imagine how one would determine the
correlation between two time series if we dealt with prices or other nonstationary series. Now, using cointegration, one hopes to eliminate the need to calculate correlation
coefficients yet identify common trends.
>Huh?
Suppose we wanted a portfolio of stocks chosen so as to "follow" the DOW. If we could arrange it so the tracking error (the difference
between our portfolio and the DOW) was stationary, we'd be happy. It'd mean that our portfolio might deviate from the DOW but these deviations
would have a Mean = 0 so portfolio values would oscillate about the DOW and ...
>Isn't that Mean Reversion?
Yes, if cointegration exists then there'd be mean reversion to the DOW index.
>Okay, so what IS cointegration? I mean ...
Two (or more) nonstationary time series are said to be cointegrated if a linear combination of the terms results in a stationary time series.
For example, if U_{n} and V_{n} are nonstationary but U_{n}  C V_{n} is stationary (for some constant C),
then the two series are cointegrated (and there's an underlying, common trend). This would be the case if the "error", e_{n}
= U_{n}  C V_{n} is stationary and therefore has timeindependent statistical parameters: Mean, Variance and Autocovariance.
>Autocovariance?
Yes, the correlation between e_{n} and e_{nm}, namely the Mean of (e_{n}  M)(e_{nm}  M)
... where M is the Mean of the e_{n}. This Mean can depend upon m (the "lag") but NOT upon n (the time). That (along with the time independence of
the Mean and Variance) would make the time series e_{n} stationary. Note that, if m = 0, then the mean of
(e_{n}  M)(e_{nm}  M) = (e_{n}  M)^{2} is just the Variance ... or (Standard Deviation)^{2}.
