back

kernel density estimation

While reading through the neural empirical Bayes paper, I decided I want a concrete understanding of kernel density estimation.

kernel density estimation (KDE)

I’ve used KDE before in seaborn. If you have some histogram you can throw a KDE curve on top to make it look nicer. We can even control the smoothness of the curve with the bandwidth parameter. But how do we actually get this curve?

The KDE formula is defined as:

\[\hat{f}(x) = \frac{1}{n}\sum_{i=1}^{n}{K_\sigma(x-x_i)}\]

where the Gaussian kernel with bandwidth \(\sigma\) is:

\[K_\sigma(x-x_i) = \frac{1}{(2\pi\sigma^2)^\frac{d}{2}}\exp(-\frac{(x-x_i)^2}{2\sigma^2})\]

This video has a clear visualization that made this instantly click. Specifically, \(K_\sigma(x-x_i)\) is just the equation for a Gaussian centered at \(x_i\). Then the KDE formula made so much more sense.

For each datapoint \(x_i\), we plot a Gaussian, \(\mathcal{N}(x_i, \sigma^2)\), then add up all the Gaussians together. A valid PDF needs to have a total area of 1. While each individual Gaussian meets this criteria, when we add up all the Gaussians, we now have a new area equal to \(n\cdot1\) where \(n\) is the number of of datapoints. So to get this sum of PDFs to be a valid PDF, we just have to divide the area by \({n}\).

We can then change the \(\sigma\) to change the spread of each individual Gaussian. A bigger \(\sigma\) would increase the smoothness of the KDE, while a smaller \(\sigma\) results in a more jagged KDE. There’s a bias-variance tradeoff with \(\sigma\) since if it’s set too high, we have high bias and low variance, and if it’s too low, vice-versa.

In code, to plot a curve, we evaluate \(\hat{f}(x)\) at multiple values of \(x\), by plugging in the values into the KDE formula. Also, other kernels can be used, like triangle or square kernels.