In the past 3 months, I’ve been trying to understand the Kalman Filter. I have tried to implement it, I have watched YouTube tutorials, and I have read some papers about it and its operation (update, predicate, etc.). However, I still am unable to understand it fully, or in depth.

Can someone explain it in a very simple method, and how it works for multiple data?

**Answer**

Let’s start from what a Kalman filter is: It’s a method of predicting the future state of a system based on the previous ones.

To understand what it does, take a look at the following data – if you were given the data in blue, it may be reasonable to predict that the green dot should follow, by simply extrapolating the linear trend from the few previous samples. However, how confident would you be predicting the dark red point on the right using that method? how confident would you about predicting the green point, if you were given the red series instead of the blue?

$\quad\quad\quad$

From this simple example, we can learn three important principles:

- It’s not good enough to give a prediction – you also want to know the confidence level.
- Predicting far ahead into the future is less reliable than nearer predictions.
- The reliability of your data (the noise), influences the reliability of your predictions.

$$\color{red} {\longleftarrow \backsim *\ *\ *\sim\longrightarrow}$$

Now, let’s try and use the above to model our prediction.

The first thing we need is a **state**. The state is a description of all the parameters we will need to describe the current system and perform the prediction. For the example above, we’ll use two numbers: The current vertical position ($y$), and our best estimate of the current slope (let’s call it $m$). Thus, the **state is in general a vector**, commonly denoted $\bf{x}$, and you of course can include many more parameters to it if you wish to model more complex systems.

The next thing we need is a **model**: The model describes how we think the system behaves. In an ordinary Kalman filter, the model is always a linear function of the state. In our simple case, our model is:

$$y(t) = y(t-1) + m(t-1)$$

$$m(t) = m(t-1)$$

Expressed as a matrix, this is:

$${\bf{x}}_{t} =

\left(\begin{array}{c} y(t)\\ m(t)\\ \end{array} \right) =

\left(\begin{array}{c} 1 & 1\\0 & 1\\ \end{array} \right)\cdot

\left(\begin{array}{c} y(t-1)\\ m(t-1)\\ \end{array} \right)

\equiv F {\bf{x}}_{t-1}$$

Of course, our model isn’t perfect (else we wouldn’t need a Kalman Filter!), so we add an additional term to the state – the **process noise**, $\bf{v_t}$ which is assumed to be normally distributed. Although we don’t know the actual value of the noise, we assume we can estimate how “large” the noise is, as we shall presently see. All this gives us the **state equation**, which is simply:

$${\bf{x}}_{t} = F {\bf{x}}_{t-1} + {\bf{v}}_{t-1}$$

The third part, and final part we are missing is the **measurement**. When we get new data, our parameters should change slightly to refine our current model, and the next predictions. What is important to understand is that one *does not* have to measure exactly the same parameters as the those in the state. For instance, a Kalman filter describing the motion of a car may want to predict the car’s acceleration, velocity and position, but only measure say, the wheel angle and rotational velocity. In our example, we only “measure” the vertical position of the new points, not the slope. That is:

$$\text{measurement} =

\left(\begin{array}{c} 1 & 0 \end{array} \right) \cdot

\left(\begin{array}{c} y(t) \\ m(t) \end{array} \right) $$

In the more general case, we may have more than one measurement, so the measurement is a vector, denoted by $\bf{z}$. Also, the measurements themselves are noisy, so the general measurement equation takes the form:

$${\bf{z}}_t = H {\bf{x}}_t +{\bf{w}}_t$$

Where $\bf{w}$ is the **measurement noise**, and $H$ is in general a matrix with width of the number of state variables, and height of the number of measurement variables.

$$\color{orange} {\longleftarrow \backsim *\ *\ *\sim\longrightarrow}$$

Now that we have understood what goes into modeling the system, we can now start with the **prediction** stage, the heart of the Kalman Filter.

Let’s start by assuming our model is perfect, with no noise. How will we predict what our state will be at time $t+1$? Simple! It’s just our state equation:

$${\bf{\hat x}}_{t+1} = F {\bf{x}}_{t}$$

What do we *expect* to measure? simply the measurement equation:

$${\bf{\hat z}}_{t+1} = H {\bf{\hat x}}_{t+1}$$

Now, what do we **actually measure**? probably something a bit different:

$$\bf{y} \equiv {\bf{z}}_{t+1} – {\bf{\hat z}}_{t+1} \neq 0 $$

The difference $\bf{y}$ (also called the **innovation**) represents how wrong our current estimation is – if everything was perfect, the difference would be zero! To incorporate this into our model, we add the innovation to our state equation, multiplied by a matrix factor that tells us how much the state should change based on this difference between the expected and actual measurements:

$${\bf{\hat x}}_{t+1} = F {\bf{x}}_{t} + W \bf{y}$$

The matrix $W$ is known as the **Kalman gain**, and it’s determination is where things get messy, but understanding why the prediction takes this form is the really important part. But before we get into the formula for $W$, we should give thought about what it should look like:

- If the measurement noise is large, perhaps the error is only a artifact of the noise, and not “true” innovation. Thus, if the measurement noise is large, $W$ should be small.
- If the process noise is large, i.e., we expect the state to change quickly, we should take the innovation more seriously, since it’ plausible the state has actually changed.
- Adding these two together, we expect:

$$W \sim \frac{\text{Process Noise}}{\text{Measurement Noise}} $$

$$\color{green} {\longleftarrow \backsim *\ *\ *\sim\longrightarrow}$$

One way to evaluate uncertainty of a value is to look at its **variance**. The first variance we care about is the variance of our prediction of the state:

$$P_t = Cov({\bf{\hat x}}_t)$$

Just like with ${\bf{x}}_t$, we can derive $P_t$ from its previous state:

$$P_{t+1} = Cov({\bf{\hat x}}_{t+1}) = \\

Cov(F {\bf{x}}_t) = \\

F Cov({\bf{x}}_t) F^\top = \\

F P_t F^\top$$

However, this assumes our process model is perfect and there’s nothing we couldn’t predict. But normally there are many unknowns that might be influencing our state (maybe there’s wind, friction etc.). We incorporate this as a covariance matrix $Q$ of process noise ${\bf{v}}_t$ and prediction variance becomes:

$$P_{t+1} = F P_t F^\top + Q$$

The last source of noise in our system is the measurement. Following the same logic we obtain a covariance matrix for ${\bf{\hat z}}_{t+1}$.

$$S_{t+1} = Cov({\bf{\hat z}}_{t+1}) \\

Cov(H {\bf{\hat x}}_{t+1}) = \\

H Cov({\bf{\hat x}}_{t+1}) H^\top = \\

H P_{t+1} H^\top$$

As before, remember that we said that our measurements ${\bf{z}}_t$ have normally distributed noise ${\bf w}_t$. Let’s say that the covariance matrix which describes this noise is called $R$. Add it to measurement covariance matrix:

$$S_{t+1} = H P_{t+1} H^\top + R$$

Finally we can obtain $W$ by looking at how two normally distributed states are combined (predicted and measured):

$$W = P_{t+1} H^{\top} S_{t+1}^{-1}$$

To be continued…

**Attribution***Source : Link , Question Author : xsari3x , Answer Author : Community*