noise contrastive estimation

Initially, I was going to learn more about InfoNCE. But seeing that I didn’t know anything about NCE, I decided to take a detour, and read through the paper introducing noise contrastive estimation (NCE)¹.

introduction

We have a random vector \(\mathbf{x} \in \mathbb{R}^n\), that follows an unknown PDF \(p_d(.)\). We assume the PDF can be be modeled by some family of functions \(p_m(.;\alpha)\), where \(\alpha\) is a vector of parameters. Then, for some specific parameters \(\alpha^\star\), \(p_m(.;\alpha^\star)= p_d(.)\).

Now, we need to estimate \(\alpha\) by maximizing the likelihood of the observed sample.

\(\hat{\alpha}\) needs to result in a valid PDF (\(\textbf{u}\) is just the dummy variable representing all possible values of \(\textbf{x}\)):

\[\int p_m(\mathbf{u};\hat{\alpha}) \text{d}\mathbf{u} = 1\]

The PDF can be re-written as²:

\[p_m(.;\alpha) = \frac{p_m^0(.;\alpha)}{Z(\alpha)} \quad Z(\alpha) = \int p_m^0(\textbf{u};\alpha) \text{d}\textbf{u}\]

\(p_m^0(\textbf{u};\alpha)\) is the un-normalized score function, and \(Z(\alpha)\) is called partition function, a normalization constant to bring the total area under the PDF to 1.

Apparently, the calculation of the partition function is intractable. Let me try to reason why this is the case. In order to calculate \(Z(\alpha)\), we would need to integrate over all values of \(\textbf{u}\). If we were using 32x32 pixel images, \(\textbf{u} \in \mathbb{R}^{32 \times 32}\). Even if the pixels were binary (black or white), that is \(2^{1024}\) values in \(\mathbf{u}\). So yes, calculating the partition function will be a problem for most real-world cases.

Instead of calculating the partition function, what we treat it as just another parameter to be learned. However, this idea does not play nicely with maximum likelihood estimation.

We can trivially increase the log-likelihood of a datapoint by simply setting \(Z(\alpha)\) to be a very small positive value.

\[\lim_{Z(\alpha) \to 0^+} \log p_m^0(.;\alpha) - \log Z(\alpha) = \infty\]

This means the optimizer can learn a poorly fitting \(\hat{\alpha}\) by simply “cheating” with a tiny \(Z(\alpha)\).

a better way to estimate \(p_m^0\) and \(Z(\alpha)\)

To address the limitation above, the authors introduce a new way to estimate the parameters by learning to discriminate between data and noise. Thus, the new method is referred to as noise contrastive estimation (NCE).

We reparameterize the model as:

\[\log p_m(.;\theta) = \log p_m^0(.;\alpha) + c\]

Where \(\theta = \{\alpha, c \}\) and \(c = - \log Z(\alpha)\).

Let \(X = ( \mathbf{x}_1, ..., \mathbf{x}_T)\) be the observed dataset with \(T\) datapoints, and \(Y = ( \mathbf{y}_1, ..., \mathbf{y}_T)\) the noise samples drawn from the known distribution \(p_n(.)\).

\(\hat{\theta}_T\) is defined to be the parameters that maximize the objective function:

\[J_T(\theta) = \frac{1}{2T} \sum_t \ln[h(\mathbf{x}_t;\theta)] + \ln[1-h(\mathbf{y}_t;\theta)]\]

where \(h(.;\theta)\) is the logistic function:

\[h(\mathbf{u};\theta) = \frac{1}{1+\exp[-G(\mathbf{u};\theta)]}\]

and

\[G(\mathbf{u};\theta) = \ln p_m(\mathbf{u};\theta) - \ln p_n(\mathbf{u})\]

This is a lot, so let me try to break things down into bite-sized pieces I can wrap my head around. First, lets start with \(G(\mathbf{u};\theta)\).

If we re-write it as a fraction:

\[G(\mathbf{u};\theta) = \ln \left( \frac{ p_m(\mathbf{u};\theta)}{p_n(\mathbf{u})} \right)\]

We see that \(G\) is the log-likelihood ratio between the model’s density and the noise density. And \(h(\mathbf{u};\theta)\) is just the sigmoid function that takes the log-likeliehood ratio as input.

If the ratio is large, it means the data is more likely under the model. A large \(G\) means a smaller \(\exp[-G]\), meaning \(h\) is close to 1.

If \(\textbf{u}\) is more likely under the noise distribution, then the opposite is true and \(h\) is closer to 0.

That means in order to maximize \(J_T\), given a pair of datapoints, \(\mathbf{x}_t\) and \(\mathbf{y}_t\),

\(h(\mathbf{x}_t;\theta)\) needs to be as close to 1 as possible.
\(h(\mathbf{y}_t);\theta\) needs to be as close to 0 as possible.

so it’s just binary cross-entropy?

Binary cross-entropy loss is defined as:

\[\mathcal{L}_{\text{BCE}} = \frac{1}{N} \sum_{i=1}^{N} y_i \log \hat{y_i} + (1-y_i) \log (1-\hat{y_i})\]

Where \(y\) is the true label,\(\hat{y}\) is the predicted probability (sigmoid applied to logit), and \(N\) is the total number of datapoints. If the true label is 1, the second term disappears. If the true label is 0, the first term disappears.

In NCE, we have \(T\) data samples and \(T\) noise samples for a total of \(2T\) datapoints. If we pair real data with the true label 1 and noise with true label 0, the loss is effectively:

\[= \frac{1}{2T} \left[ \sum_{i=1}^T \log h(G(\mathbf{x})) + \sum_{i=1}^T \log (1-h(G(\mathbf{y}))) \right]\]

If we combine the sum, we reconstruct NCE’s objective function, \(J_T\):

\[= \frac{1}{2T} \sum_{i=1}^T \log h(G(\mathbf{x})) + \log (1-h(G(\mathbf{y})))\]

So the NCE loss function is just BCE loss with a few details:

We have an equal number of data and noise examples
The logits come from \(G(\mathbf{u};\theta)\), the log-likelihood ratio

conclusion

Let’s return to the original problem: we want to learn the best parameters \(\theta = \{\alpha, c\}\). If we tried to maximize log-likelihood directly, we run into the issue of the trivial optimization, making \(c = -\log Z(\alpha)\) a very large positive number. This allows us to achieve a very high log-likelihood without considering the model fit at all.

The NCE objective prevents this “cheating” behavior because in order to maximize the NCE objective, we need to selectively assign high log-likelihood to real data and low log-likelihood to noise.

Simply using a large \(c\) would inflate the log-likelihood for both, which results in a poor NCE score. In other words, a model that does well on the NCE objective must actually learn a good set of parameters \(\theta\), not just game the normalization constant.

But I realized I didn’t address a central question – how do we know a good classifier equals a good density model? This post seems like it gets at the heart of the problem, and once I read through this I’ll close up the logic.

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models ↩
The dot vs \(\mathbf{u}\) notation confused me for a bit. \(p_m(.;\alpha)\) returns the density value for an arbitrary datapoint. We can get the unnormalized score of the arbitrary input, but we need to divide by the partition function. The partition function requires an integration of \(\mathbf{u}\), which are all the possible values. ↩