gearnet
Notes from the paper Protein Representation Learning By Geometric Structure Pretraining. Part one of my deep dive in to the seminal papers of protein structure representation learning.
representation learning from structures
Given the relatively cheap cost of sequencing, unlabeled sequence data is abundant and has been the fuel for protein language models. However, the authors, and many others before, recognized the disconnect between protein sequence and function. Thus, the natural idea is to take advantage of the available structures and learn structural protein representations.
Geometry-aware relational graph neural network (GearNet) is not the first structural encoder, but is notable for a few reasons:
- Pre-trained on the largest available structural database (AFDB)
- Uses protein graphs with sequential AND structural edges
- SimCLR-inspired contrastive pre-training with protein-specific augmentation strategies
The results showed GearNet can outperform existing protein encoders on a variety of tasks.
motivation
The goal of protein representation learning is to get representations that are useful for specific applications of interest, e.g. predicting protein function, properties, or interactions. While sequence-based representation learning methods like pLMs are incredibly popular, structure-based protein encoders are relatively underexplored.
gearnet
The authors propose GearNet, a model that takes structure as input and extract a representation capturing its spatial and chemical information.
protein graph construction
A desirable property of any representation of protein structure is invariance under translations, rotations, and reflections in 3D space1. Given a 3D structure of a protein, like a PDB file, it can be represented as a graph: $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{R})$, where $\mathcal{V}$ and $\mathcal{E}$ represent the nodes and edges, and $\mathcal{R}$ is the type of edge.
- The edge between node $i$ and node $j$ with type $r$ is denoted as $(i,j,r)$
- $m$: # nodes; $n$: # edges
- Node is the alpha-carbon of a residue
- $f_i$: node $i$ feature; $f_{(i,j,r)}$: edge $(i,j,r)$ feature
relational graph convolutional layer
After defining a protein graph, a GNN is used to extract per-residue and per-protein representations. A relational graph convolutional layer builds upon GCNs by introducing a edge-type specific kernel ($\vert \mathcal{R} \vert$ different kernels).
The authors use 3 main types of edges (there are multiple sequential edge types)
- Sequential edges
- K-nearest neighbor edge
- Radius edge
Sequential edges are defined by nodes that are less than $d_{\text{seq}}$ positions apart, and given the specific edge type $k$, where $k=j-i$. There are a total of $2d_\text{seq}-1$ sequential edge types.
For example, given a toy sequence MLPGLA, and $d_{\text{seq}}=3$, we would have the following edges:
| $k$ | Edges |
|---|---|
| $1$ | L-M, P-L, G-P… |
| $-1$ | M-L, L-P, P-G… |
| $2$ | P-M, G-L, L-P… |
| $-2$ | M-P, L-G, P-L… |
| $0$ | M-M, L-L, P-P (self-edge) |
Radius edges are spatial edges defined by two nodes that have a Euclidean distance less than some threshold $d_\text{radius}$.
K-nearest neighbor edges are edges between k-nearest neighbors based on Euclidean distance. This is because the scale of distances between nodes can vary between proteins2.
For these spatial edges, the authors filter out edges between nodes close to each other in sequence (i.e. $\vert i-j\vert < d_\text{long}$).
| Parameter | Value |
|---|---|
| $d_\text{seq}$ | 3 |
| $d_\text{radius}$ | 10.0 |
| $d_\text{long}$ | 5 |
| $k$ neighbors | 10 |
update rule
\[u_i^{(l)} = \sigma \Big( \text{BN}\Big( \sum_{r\in\mathcal{R}} \bm{W}_r \sum_{j\in\mathcal{N_r}(i)} \bm{h}_j^{(l-1)}\Big)\Big),\] \[\bm{h}_i^{(l)} = \bm{h}_i^{(l-1)} + \bm{u}_i^{(l)}\]To summarize:
\[\begin{aligned} &\text{For each node } i: \\ &\quad \text{For each relation } r \in \mathcal{R}: \\ &\quad\quad \text{Aggregate features of neighbors connected to } i \text{ via } r \\ &\quad\quad \text{Apply relation-specific transformation } \mathbf{W}_r \\ &\text{BN + activation} \\ \end{aligned}\]gearnet-edge
While the algorithm above does message passing between node features, the authors mention that many geometric encoders benefit from message passing between edges as well.
edge graph construction
We construct the edge graph $\mathcal{G}^\prime = (\mathcal{V}^\prime,\mathcal{E}^\prime,\mathcal{R}^\prime)$3. We treat each edge of $\mathcal{G}$ as a node of $\mathcal{G}^\prime$, creating edges between then if the edges form a path ($i \rightarrow j, j \rightarrow k$). The edge type corresponds to the angle between the nodes $(i, j, r_1)$ and $(w,k,r_2)$ in the original graph. For computational simplicity, the range of angles $[0, \pi]$ is discretized into 8 bins, and the bin index is used as the edge type.
edge message passing layer
We use the same idea for node message passing (aggregate features of connected nodes per relation, apply relation-specific weights, then aggregate again).
\[m_{(i,j,r_1)}^{(l)} = \sigma \Big( \text{BN}\Big( \sum_{r\in\mathcal{R}^\prime} \bm{W}_r^\prime \sum_{(w,k,r_2)\in\mathcal{N_r}^\prime((i,j,r_1))} \bm{m}_{(w,k,r_2)}^{(l-1)}\Big)\Big)\]In GearNet-Edge, the edge message passing is used to update the node features of $G$.
\[u_i^{(l)} = \sigma \Big( \text{BN}\Big( \sum_{r\in\mathcal{R}} \bm{W}_r \sum_{j\in\mathcal{N_r}(i)} (\bm{h}_j^{(l-1)} + \text{FC}(\bm{m}_{(i,j,r_1)}^{(l)}))\Big)\Big)\]$\text{FC}(\cdot)$ is a linear transformation after the edge message passing layer.
geometric pretraining
Given a large number of unlabeled protein structures, what’s the best self-supervised learning strategy?
multiview contrastive learning
In computer vision, one of the most notable frameworks for self-supervised learning is SimCLR4. The core idea is to create multiple “views” of one image (e.g. cropping, rotation, distortion) and teach the model that the representations of these views should be similar. The same idea can be applied to proteins if we have a good way to generate these views.
The authors propose two methods: subsequence cropping and subspace cropping. In subsequence cropping, we take a left reside $l$ and a right residue $r$ and select all residues between $l$ and $r$. However, since this doesn’t take into account the 3D relationships, we also use subspace cropping, where we select a random residue $p$, then select all residues within a pre-defined Euclidean distance $d$. We reconstruct $G$ using the subsampled nodes. Lastly, we apply one of two noising functions to increase the diversity of views. The noising function are identity (do nothing) or random edge masking (each edge has a 15% chance of being masked5).
For each protein $\mathcal{G}$, we sample $\mathcal{G}_x$ and $\mathcal{G}_y$ using some combination of sampling and noising described above. The structure encoder produces the graph-level representation $\bm{h}_x$ and $\bm{h}_y$, which are then passed to an MLP to obtain the lower-dimensional representations $\bm{z}_x$ and $\bm{z}_y$. We use the InfoNCE loss to align the views.
\[\mathcal{L} = -\log \frac{\exp(\text{sim}(\bm{z}_x,\bm{z}_y)/\tau)}{\sum^{2B}_{k=1}\mathbb{1}_{[k \neq x]}\exp(\text{sim}(\bm{z}_x,\bm{z}_k)/\tau)}\]$B$ is batch size, $\tau$ is temperature, and cosine similarity is used to measure similarity of views.
To intuitively understand InfoNCE, let’s start from the most obvious idea. The goal of ML is to minimize a loss function. We want to minimize $-\log(\cdot)$. Given the behavior of $-\log(\cdot)$, the models objective is to make the fraction inside as big as possible. In other words, the numerator has to dominate the denominator.
The numerator says to take the cosine similarity of the latent representations of the two views and exponentiate it.
The denominator says sum the exponentiated similarities of $\bm{z}_x$ to all other proteins in the batch, and excludes the self similarity. As the denominator includes the similarity between $\bm{z}_x$ and $\bm{z}_y$, the largest value is obtained if all other similarities are minimized.
Putting everything together, InfoNCE tells the model, within a batch of protein views, maximize the similarity between correlated views, and decrease the similarity to all other views. Last detail is the purpose of $\tau$. If we use a low temperature, the exponentiated values become amplified, and vice versa for a high temperature.
If we had a similarity of 0.5, and $\tau=1$, $\exp(0.5/1)\approx 1.65$. However, with $\tau=0.2$, we get $\exp(0.5/0.2)\approx 12.2$. With $\tau=2$, $\exp(0.5/2)\approx 1.28$. So lowering the temperature exaggerates the similarities, while increasing temperatures smoothes it. With a lower temperature, having similar non-correlated views results in a higher loss.
-
Intuitively, a protein floating in solution will be experiencing translation and rotation without any change in its properties. The authors state reflection-invariance as desirable, but I would say the chirality of proteins do affect its properties. ↩
-
For example, if we set the radius threshold to be 4 angstroms, and we have a protein where the average distance between nodes is 10 angstroms, very few radius edges may be formed. ↩
-
AKA a line graph. ↩
-
A Simple Framework for Contrastive Learning of Visual Representations ↩
-
I could be wrong, but I’m interpreting this as removing the edge. ↩