Intuitive Explanation of Non-stationary Gaussian Process Kernels

With the Tokyo Olympics coming to an end, this is a great time to look at how the athletes have performed over the years. Particularly, let us take a look at the pace of Marathon Gold Medal winners since 1896.

Which of the two images do you think better describe the expected trend and uncertainty of the runners' pace over the years?

The right-side plot that produces a better fit (and better confidence interval) for Olympics data is made using something known as a non-stationary kernel in Gaussian processes.

1. Background

Let us first look at the basics of Gaussian Processes.

1.1 What are GPs

Gaussian Processes (GPs) are powerful machine learning methods designed to solve classification and regression problems. GP methods additionally provide uncertainty estimates along with predictions, unlike other traditional methods. If you didn’t notice, Gaussian Processes are named after Gaussian distributions. In regression problems, such as finding a function $y = f(X)$ that can best describe the data $X$, we assume that the data is generated from a multivariate gaussian distribution (this is known as prior in GPs).

The multivariate Gaussian distribution is defined by a mean vector μ and a covariance matrix Σ.

$X = \begin{bmatrix} X_{1}\\ X_{2}\\ .\\ .\\ .\\ X_{n} \end{bmatrix} \sim N(\mu , \sum)$

The mean vector would give you the expected value, like any other regression model. Therefore, the covariance matrix makes the core of the GP regression models. The covariance functions (also called the kernels) involved in Σ describe the joint variability of the Gaussian process random variables.

Görtler, et al. provide an excellent visual exploration of Gaussian Processes with mathematical intuition as well as a deeper explanation of GP components. This article will go over some shortcomings of standard GPs and talk about the extensions that can be built upon them.

1.2 Kernels

Kernel is a key component in building GP models. Lets us look at an analogy to covariance functions or kernel functions or just kernels.

Analogy for Covariance in Kernel functions

Assume we have a few children standing side-by-side in a line, and we ask them to move randomly back and forth. We’ll see that each child could be standing wherever they desire, unaffected by other children. However, if we do the same experiment with each child, holding the hands of the neighbouring child tightly, it would be observed that the neighbouring children would be standing relatively closer to each other in the final configuration.

The initial set-up for the experiment is analogous to low correlation among nearby Data points, while the latter resembles the opposite. A kernel function in GP regression captures the overall trend in data and generalizes over unseen data.

Radial basis function (RBF) (also known as Gaussian kernel) is an example of a widely used covariance function in GP modelling as described in Eq. 1.

\begin{align} K_{rbf}(\mathbf{x}_i, \mathbf{x}_j) &= \sigma^2 \exp\left(\frac{||\mathbf{x}_i - \mathbf{x}_j||_2^2}{2l^2}\right)\\ \end{align}

There are a variety of kernels that can be used to model various kinds of functions. Following kernel parameters play a significant role in the modelling of a GP in most standard kernels:

  • Lengthscale ($l$)
  • Variance ($\sigma^2$)

We discuss two broad categories of kernels, stationary and non-stationary, in Section 4 and compare their performances on standard datasets.

Now, let us visualize GPs fitted on some standard regression datasets.

2. Stationary GP on noisy sine curve dataset

Notice that noisy sine data is having uniform noise over the entire input region. We can also see that smoothness of the sine function remains similar for any value of input $X$.

Now, we show the same model fit over a bit more complex data.

There are two similarities between the noisy sine curve dataset and the noisy complex dataset: i) noise in data points is uniform across $X$; ii) the underlying function that generates the dataset seems equally smooth (stationary) across $X$.

It is completely possible that datasets may not follow one or more of the above properties in the real world. Now, we will show the performance of stationary GPs on a real-world dataset.

3. Stationary GP on Olympic marathon dataset

Olympic Marathon dataset includes gold medal times for Olympic Marathon from 1896 to 2020. One of the noticeable points about this dataset is that, in 1904, Marathon was badly organized, leading to prolonged times - Athletics at the 1904 St. Louis Summer Games: Men's Marathon.

Let us see how standard GP performs over this dataset. These fits are obtained by optimizing the likelihood function over the two essential parameters: length-scale ($l$) and variance ($\sigma^2$).

From the above fit, we can see that data is more irregular or has higher noise until 1950, and after that, the trend in data becomes clearer and narrow. In other words, noise in data decreases from the left to right side of the plot. Predictive variance in the first fit is overestimated due to anomaly present in the year 1904. Once we adjust the observation at 1904 with another value, the fit produces a reasonable predictive variance.

If we think of an ideal model for the original dataset, it should decrease predictive variance and increase smoothness in fitted function as the year increases. Standard or stationary GPs are not well-equipped internally to deal with such datasets. Such datasets are known as non-stationary, and now we formally discuss stationarity and nonstationarity.

4. Stationarity v/s Non-stationarity

A definition of a stationary process from Wikipedia is as the following,

  • In mathematics and statistics, a stationary process (or a strict/strictly stationary process or strong/strongly stationary process) is a stochastic process whose unconditional joint probability distribution does not change when shifted in time.

The definition above also applies to space or any input space. Now, let us see what does it mean in the context of Gaussian processes.

Till now, the learnt parameters are constant for the whole input space. This suggests that the model considers the distribution of the parameters as stationary, which might be a constraint for real datasets. The length scale is an essential parameter in the RBF kernel as it can control the smoothness of the learnt distributions. A high length scale would mean the correlation term in RBF to be small, allowing us to get a smoother distribution. In contrast, a small length scale will enable us to capture high variance distributions of data.

Unlike stationary kernels, the non-stationary kernels are also dependent on the position of the input points along with the distance between them. So, how can we construct a kernel function in a way that can address these input-dependent variations?

  • Stationary: $K(x_{1}, x_{2}) = f(x_{1}-x_{2})$
  • Non Stationary: $K(x_{1}, x_{2}) = f(x_{1}-x_{2}, x_{1}, x_{2})$

4.1 Non-stationary GPs with Local Length Scale (LLS) kernel

Now that we have built up all the necessary intuition for Non-stationary GPs, let's look at a unique way of introducing nonstationarity through varying length scales. LLS (Local Length Scale) GP is a two-tiered framework that allows the variation of length scales over the input space. More precisely, an additional independent GP is used to model the distribution of the length scale.

The varying length scale can allow us to adjust the function's smoothness for different input positions.

4.2 Non-stationary GP Fit on Olympic Dataset

The above fit is representative of the extra power of non-stationary GPs. We observe that the fit can capture the outlier by having a lower length scale there. Lower length scale values are indicative of the lesser covariance between the points around the outlier. This allows the model to be less smooth close to the outlier, while it can smoothen itself in later years when the pace timings of athletes seem to converge slowly.

Let us see some more comparisons to understand non-stationary GPs better.

Noisy Damped Sine Wave

This noise in this sine wave is introduced in proportion to the wave's amplitude at that particular input. Hence, there is a lot more noise in the lower x positions. The NonStationary LLS GP can capture this high data variance in the lower x positions by having a lower length scale. And in the later positions, it adjusts the length scale to a higher value which gets us a smoother curve later on. However, the stationary GP, having its constraints, cannot capture this varying variance of the data.

4.3 Smooth 1-D Variation

Smooth 1-D is a smoothly varying function with a substantial bump close to x = 0 . The figures here show the comparison of the stationary GP with the non stationary LLS GP. If we closely observe the learnt length scale figure, we can infer that the LLS GP captures the existence of bump, i.e., the area having a lesser correlation with the other points and has lower length scales compared to both the sides of the bump. Although the stationary GP here seems to fit the data decently, we can observe that the smoothness of the trends does not remain intact near the outer regions (x < -2 and x > 2). In contrast, the LLS GP gives a much better generalization with the help of the learnt length scales.

5. Conclusion

Non-stationary GPs have shown great potential in overcoming the shortcomings of stationary GPs. In this article, we have just touched the tip of the iceberg of a massive area of research. We believe that through this article, the reader can get good intuition about where the research in GP is heading towards and how the use of non-stationary kernels can improve the performance of supervised learning.

With this, we also leave readers with several other non-stationary GP resources:

  1. Treed GPs: This method partitions data into several regions with decision trees, then fits stationary GPs on each region, making the overall fit non-stationary.
  2. Deep Kernel Learning: This method transforms input data with the help of a neural network and applies a stationary GP model on transformed data.
  3. Deep GPs: This method adopts a similar concept as deep neural networks to model the inputs of a GP by another GP.