the Kalman Filter

motivated by email from Mark F.
Every so often I get email asking about the Kalman filter and ...
>And you know nothing about it, right?
Uh ... not exactly, but I reckon it's about time to write about it since it seems to be popular these days in financial stuff.
Rudolf Emil Kalman, born in 1930 in Hungary, was trained as an Electrical Engineer.
He is most famous for the Kalman filter, a scheme for extracting a signal from a series of noisy
(or incomplete or corrupted or chaotic) measurements.
His original 1960 paper was, apparently, received with scepticism. It is now a popular prescription for extracting a signal from a noisy environment.
He is a member of the National Academy of Sciences (USA), the National Academy of
Engineering (USA), and the American Academy of Arts and Sciences (USA).
He is also a
foreign member of the Hungarian, French, and Russian Academies of Science and has numerous honorary doctorates.
>Are we talking electrical engineering?
Not at all.
 Suppose you wanted to measure the level of water in a large tank. The surface sloshes about and there's water flowing into the tank.
 Suppose you want to know your position at sea and you take measurements of star positions. Your readings have noise.
 Suppose you wanted to know the position of a moving object captured by two radars. The object is moving. There's noise in the radar images.
You use the radar info to estimate the possible range of positions at some time t. Theres a probability distribution associated with this position.
Do you accept the average? The median? The position of maximum probability?
 Suppose ...
>I'd accept the Kalman position. Am I right?
Patience.
Note that there's a range of possible answers to your question: "What's the value of x?"
The probability that the true value of x is L will depend upon the measurement(s) you've taken. That makes it a "conditional probability".
If you've measured the level of water as z_{1} = 123 metres, then it's unlikely that the "true" level is x = 500 metres.
There's a distribution of possible xvalues which, we'd expect, would vary about z_{1} = 123 metres.
Indeed, with a single measurement we'd expect a range of values like the blue curve in Figure 1.
That's probably the best you can do with a single measurement. It's your best estimate. It's centred on z_{1} = 123 metres.
(For our example, we've picked a Standard Deviation of 15 and a Normal distribution.)
Aah, but suppose you take a second measurement: z_{2} = 142.
Suppose, too, that you know that this second measurement is more accurate.
There's a distribution about z_{2} = 142 perhaps like the green curve in Figure 1.
It's more congested near 142, more compact, less spread out
... because 142 is presumably a better estimate.
(We've picked Standard Deviation = 10 and a Normal distribution.)
Here comes the big question:
What does the xprobability distribution look like if two independent measurements gave the
blue and green distributions in Figure 1?
That is, what's the conditional probability, given the probability distributions for z_{1} and z_{2}?
 Figure 1 
In fact, we'd get the red distribution in Figure 1.
It's less spread out then either of the two other distributions.
Indeed, if the two other distributions have Mean values of M_{1} = z_{1} and M_{2} = z_{2} and Variances of
V_{1} and V_{2}, then the red distribution will have:
[1] Mean = (V_{2} z_{1} + V_{1} z_{2}) / (V_{1}+V_{2})
= ( z_{1}/V_{1} + z_{2}/V_{2}) / (1/V_{1}+1/V_{2})
and 1/Variance = 1/V_{1} + 1/V_{2} 
>Mamma mia! Is that really necessary? That looks bad and ...
The resultant Mean for the red distribution is just a linear combination of
z_{1} and z_{2}.
Look at it this way:
You have two objects, with masses M_{1} and M_{1}.
They're on a seesaw and you want to balance the seesaw by placing a fulcrum at the right spot
so as to balance the two weights.
The balance point must be placed so that the total mass, concentrated at its location, has the same moment about one end point.
 Figure 2 
That'd require:
M_{1} Z_{1} + M_{2} Z_{2} = (M_{1}+M_{1}) Z
hence:
Z = (M_{1} Z_{1} + M_{2} Z_{2}) / (M_{1}+M_{1})
See the similarity?
>No! Besides, you're calling the masses M_{1} and M_{2} and that's what you called the Means, earlier.
Uh ... sorry about that, but it's necessary to keep you awake.
Anyway, just replace the masses like so: M_{1} = 1/V_{1} and M_{2} = 1/V_{2} and the two formulas are the same.
The reciprocals of the Variance play the role of masses. That makes sense, right?
After all, the red distribution is supposed to combine the blue and green distributions
and you'd expect the resultant Mean would be closer to the Mean with the narrower distribution, since that's presumably the more accurate measurement.
Then, too, the "total mass" is the sum of the two individual masses, making 1/Variance = 1/V_{1} + 1/V_{2}.
Remember: Variance = (Standard Deviation)^{2}.
In our Figure 1 example, we've chosen M_{1} = 123 and V_{1} = 15^{2} = 225
and M_{2} = 142 and V_{1} = 10^{2} = 100.
That'd give (for the red distribution):
Mean = (100*123 + 225*142)/(225+100) = 31%(123) + 69%(142) = 136.2 and 1/Variance = 1/225+1/100 = 0.01444 so the Standard Deviation = 1/SQRT(0.01444) = 8.32.
In particular, notice that the red distribution is narrower inplying more confidence in the 136.2 estimate and ...
>Wait! You make a single measurement like 123 and you automatically get a distribution and you automatically know that your second measurement is more accurate and ...
Patience. We'll get to all that stuff soon enough.
In the meantime, let's assume that we DO have an estimate for the Standard Deviations of our two measurments
and that we DO adopt a Normal distribution with Means
z_{1} and z_{2} and Variance estimates of V_{1} and V_{2}.
Our analysis would then proceed as we've done above.
Of course, these assumptions should be reasonable, right? But look at the implications:
 If each measurement were equally accurate, the two variances would be equal and the conditional Mean would the the average of the two Means: (z_{1}+z_{2})/2.
 If one measurement were more accurate, the conditional Mean would be closer to that measurement.
 The conditional Variance is always lessthanorequalto than the individual variances. That implies that every additional measurement increases the accuracy of our result.
Let's rewrite the magic formulas [1] like this:
[a]
Mean = z_{1} + [V_{1}/(V_{1}+V_{2})] ( z_{2}  z_{1})
and
Variance = V_{1}  [V_{1}/(V_{1}+V_{2})] V_{1}
In other words, we can write:
[b]
Mean = z_{1} + K_{2} ( z_{2}  z_{1})
and
Variance = V_{1}  K_{2} V_{1}
where K_{2} = [V_{1}/(V_{1}+V_{2})].
We now imagine taking our first measurement at time t_{1} and our second measurement at time t_{2}.
Further, let's call X_{1}, X_{2}, etc. the increasingly accurate estimates of the Mean as we take more measurements.
We'll let U_{1}, U_{2}, etc. be the successive estimates of the Variance.
After the first measurement of z_{1}, our "best" estimates are X_{1} = z_{1} and U_{1} = V_{1}.
After the second measurement at time t_{2} our best estimate is obtained from [b], above:
[c]
X_{2}= z_{1} + K_{2} ( z_{2}  X_{1})
and
U_{2} = U_{1}  K_{2} U_{1}
Do you see where we're heading?
>No! It's mumbohumbo. You're just ...
We're marching ahead in time and, at each step, trying to predict the value (and associated distribution) of some variable.
>The value of what variable?
Maybe it's the alpha or beta of a stock or maybe the inherent Volatility or maybe ...
>Please proceed.
Okay, we'll assume that the "true" values of our variable are x(t_{1}), x(t_{2}), x(t_{3}) etc.
Further, we assume they evolve in time according to the following prescription:
[d] x(t_{n+1}) = F(t_{n}) x(t_{n}) + w(t_{n})
where w is random noise.
That is, each new value depends upon the previous value ... but there's some random variation introduced by the variable w.
Our observed (that is, "measured") values are z(t_{1}), z(t_{2}), z(t_{3}) etc.
and they evolve in time according to the following prescription:
[e] z(t_{n+1}) = H(t_{n}) x(t_{n}) + v(t_{n})
where v is random noise (or "error").
We'd like to get at the "true" xvalues by looking carefuly at our sucessive zobservations and ...
>With all that noise?
We have to make some assumptions about the noise.
In fact, we'll assume the noise is sometimes positive, sometimes negative but the average noise is zero.
In fact, we'll assume the noise is selected at random from a Normal distribution with Mean = 0 (and some asyetunspecified Variance).
In fact, we'll also assume the two random noise terms, w(t_{n}) and v(t_{n}), are independent.
In fact, we'll assume that after having made an observation like z(t_{n}) we estimate the "true" value of x(t_{n+1}) in terms of
a conditional probability, given that we've just added a new zobservation.
In fact ...
>Can you just continue?
Okay. Notice that we're trying to generate a "best estimate" of the true xvalues by using the zobservations and, at each step, we (hopefully)
improve our estimate recursively.
>Recursively?
Yes. We don't go back and look at all the observed zvalues, but just update at each time step.
That's like calculating an average by using: A_{n} = (1/n) (a_{1} + a_{2} +...+ a_{n}).
We use, instead: A_{n} = (1/n) [ (n1)A_{n1} + a_{n} ]. That way we can update A without looking at all previous avalues.
Then, too, if we make assumptions about the noise in our Kalman filter, w(t_{n}) and v(t_{n}), we'll have their covariances ... in advance.
Finally, we want our zestimates to be such that their average over time approaches the "true" xvalues.
In fact, we want a procedure such that our zestimates give, on average, the smallest estimation error.
>You're dreaming ...
Actually, Kalman does just that: of all such algorithms it gives the "best" estimate in the sense that the meansquare error is minimized.
