Transforming everything to vectors with Deep Learning: from Word2Vec, Node2Vec, to Code2Vec and Data2Vec

A beautiful scene

In this article, we will discuss the state-of-the-art methods for transforming every kind of input data into fixed-length vectors, including Word2Vec, Doc2Vec, Image2Vec, Node2Vec, Edge2Vec, Code2Vec, and Data2Vec.

Word2Vec

Word2Vec is the umbrella name for all techniques that convert words into vectors.

One-hot encoding

While the simplest method is to represent each word by a one-hot vector, these one-hot encodings do not do well in preserving the relationships between similar as well as opposite words. In one-hot representation, the difference between 2 words that share similar meanings is not distinguishable from a pair of unrelated words. For example, both pairs of words (image, photo) and (image, delicious) have 2 differences in their one-hot encodings, even though the former pair consists of 2 similar words that can be used interchangeably in many cases, while the latter shows 2 completely unassociated words.

Another disadvantage of one-hot encoding is its inefficient use of dimensionality. The vector representation of every word has a length equal to the size of the vocabulary, which can be as large as tens of thousands up to millions in normal situations. This enormous size not only consumes a lot of computer memory but also triggers the various adversities from the curse of dimensionality.

In the next sub-sections, we will introduce great approaches that address these problems. These methods feature words into vectors of a smaller size (from hundreds to several thousand) while preserving the relationship in their meanings (more similar words are represented by more similar vectors).

Continuous Bag-of-words (CBOW) and Skip-gram

Efficient Estimation of Word Representations in Vector Space, Mikolov et al.

Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al.

CBOW and Skip-gram [1] were invented by exploiting 2 key observations:

  • Words in a similar context often have similar meanings. Put it differently, similar words tend to appear in similar word neighborhoods. For example, since image and photo are similar, they are often surrounded by the same words, like this image looks nice, this photo looks nice. But something like this delicious looks nice nearly never appears in normal corpora.
  • Neural networks with gradient descent can easily and effectively optimize the vector representation (or embeddings) of words to reflect these similarities in meaning.

Training

Both CBOW and Skip-gram use a neural network as their main computational model. Their architectural differences are shown in the figure below.

The computational scheme of CBOW and Skip-gram
The computational schemes of CBOW and Skip-gram.

CBOW: At first, the sum of the (embeddings of the) neighborhood words is computed. Then, the neural model uses this sum to predict the center word. This method uses multiple (surrounding) words to predict one word (i.e. the center word).

Skip-gram: The neural network uses the (embedding of the) center word to predict the neighborhood words. This method use one word (i.e. the center word) to predict multiple (surrounding) words. Skip-gram showed its superior performance compared to CBOW in most cases and thus, is then re-used in subsequent works that will be presented in this and the following sections.

Experimental results

Note that both methods are self-supervised, meaning they do not require explicit labels from the training data. Nevertheless, the pretrained word embeddings from these models sharply enhance performance in almost all downstream tasks, especially in cases when training data is small or labels are scarce.

Furthermore, since they use networks without non-linear hidden layers, the linear regularities of words are then preserved in their vector representation. For example, vector("King") - vector("Man") + vector("Woman") is most similar to vector("Queen").

An improved version

Putting all these fancy results aside, these 2 methods have a big technical drawback. Be reminded that when the objective is classification, we usually use the Softmax operation on the last layer of the neural networks. Since Softmax is a costly operation (mind the exponentials), having a large number of classes would require a lot of computational power. In this setup, the last layer of the neural network (the classification layer) has a size equal to the vocabulary size (normally tens to hundreds of thousands) and so applying Softmax to all neurons in this layer is not really practical.

In their second paper [2], Mikolov et al. proposed a trick called Negative Sampling to mitigate this issue. The idea is to not apply Softmax to all neurons in the last layer, but only to the positive neurons (i.e. neurons that correspond to the correct classifications) and a random sample of negative neurons. This dramatically reduces the computation while retaining similar effectiveness in training. It is worth it to mention that before Negative Sampling, another trick called Hierarchical Softmax [3] was used for the same purpose. However, Negative Sampling is simpler, easier to implement, and is empirically observed to be better for frequent words and lower-dimensional vectors (though Hierarchical Softmax is better for infrequent words).

Also in the same paper, the author suggested composing phrases out of words prior to training the model. For example, New York should be considered one token instead of a sequence of two tokens New and York. Then, the model will operate on sequences of tokens (a token can be either a word or a phrase) rather than sequences of words as before. How to identify phrases? the frequency of a phrase should be much larger than the product of each word’s frequency alone.

Global Vectors (GloVe)

Glove: Global vectors for word representation, Pennington et al.

GloVe is another approach to represent words by vectors (and is claimed to be better than CBOW and Skip-gram). GloVe shares the same goal with the above 2 methods, which is to learn embeddings of words such that the distances between vectors reflect the differences in the meaning of the words, and that the linear regularities of words are preserved.

Training

GloVe captures the global relationship between words by using a co-occurrence table constructed from the whole training corpus. In simplified terms, GloVe tries to learn the word vectors so that the following equation is satisfied as much as possible:

w_i^Tw_j = \log Y_{ij}

for all 1 \le i, j \le \text{vocabulary size}, where:

  • w_i is the embedding of the i-th word in the vocabulary,
  • w_i^Tw_j is the dot product of the embeddings corresponding to the i-th and j-th words,
  • Y is the normalized co-occurrence table of words. Y is a square matrix with \text{vocabulary-size} rows and columns. Y_{ij} is proportional to the likelihood of observing the j-th word in the context of the i-th word (i.e. the j-th word appears near the i-th word).

This equation makes intuitive sense. If 2 words i and j often appear together, then that means 2 things:

  • word i and word j have some relation (or similarity) with each other
  • Y_{ij} is large, or in other words, \log Y_{ij} is large. This infers that w_i^Tw_j is also large because the model is optimized to make w_i^Tw_j = \log Y_{ij}. Since w_i and w_j are kinds of normalized vectors, their dot product is large if and only if they are similar to each other (a.k.a Cosine similarity).

Thus, optimizing for this equation ensures that GloVe embeds similar words by similar vectors.

To go into a bit more detail, GloVe prioritizes optimizing for more-frequent co-occurrences than less-frequent co-occurrences. For example, say Apple and iPhone co-occur 1000 times in the corpus, while New and Washington co-occur only 10 times. Then, the model would prioritize making w_\text{Apple}^T w_\text{iPhone} closer to \log Y_\text{Apple iPhone} than w_\text{New}^T w_\text{Washington} to \log Y_\text{New Washington}.

To preserve the linear regularities of words in their embeddings, as in CBOW and Skip-gram, GloVe uses only linear transformations in its computation.

Experimental results

In the demonstrated experimental tasks (i.e. word analogy, named entity recognition, word similarity), GloVe outperforms CBOW and Skip-gram by a noticeable margin.

Fasttext

Enriching Word Vectors with Subword Information, Bojanowski et al.

Fasttext was born to address a problem that previous word embedding techniques ignored: the morphology of words. For example, in German, Tischtennis (which is Table tennis in English) is a combination of Tisch (table) and tennis. This morphology implies that Tischtennis has some relationships with Tisch and tennis, and so it is beneficial to jointly learn their embeddings together. Older embedding techniques did not take this into account and treated the 3 words as completely independent words.

Training

Fasttext handles this issue by representing a word by its character n-grams. The idea is, for each word in the vocabulary, we extract all of its character n-grams (typically, for 3 \le n \le 6), each n-gram or the entire word is embedded by a vector. Later, the embedding of a word is the sum of embeddings of its n-grams plus the word itself.

With this simple idea, now, representation learning of Tischtennis is intrinsically connected to that of Tisch and tennis as they share some of their n-grams. All n-grams of Tisch and tennis are also n-grams of Tischtennis. Furthermore, some n-grams of Tischtennis, e.g. schten, are not n-grams of the other 2 words, leaving freedom to distinguish their meanings.

For training, Fasttext re-uses the Skip-gram model with Negative Sampling on n-grams (instead of on words as in the original version). Note that despite its name, Fasttext is not necessarily faster than other methods (in fact, the authors reported it to be 1.5 times slower to train than Skip-gram) due to its larger vocabulary.

Experimental results

Regarding performance, Fasttext significantly outperforms the Skip-gram baseline in morphologically rich languages like German, Czech, and Russian. For other less morphologically rich languages like English, the difference to previous methods is not that clear.

Contextual embedding methods

The above approaches try to learn a fixed embedding for each word (or phrase). However, the meaning of a word can be different in different contexts (e.g. bank might mean a financial establishment that allows us to deposit money or the land alongside a river, depending on the surrounding words). Using one embedding for different meanings might result in ambiguities and affect model performance. Contextual embedding methods, thus, learn to embed words concerning their surrounding contexts. A word (e.g. bank) can then be vectorized differently depending on the words around it (e.g. I loan from that <bank> and the river <bank> is nice).

The most renowned contextual embedding model is probably BERT (Bi-directional Encoder Representations from Transformers) [6]. BERT is a large pretrained model on text data. It incorporates the self-attention mechanism to enable the embedding of a word to focus on some other words. For example, take the above sentence the river <bank> is nice, the model can learn to focus on the word river when trying to embed the word bank and thus, it knows exactly the meaning of bank in this sentence.

To dig deeper into how BERT and similar contextual models work, we refer the reader to the original paper [6] and a comprehensive survey of prominent pretrained language models.

Doc2Vec

The ability to vectorize words is great, however, it is not enough for all cases. For example, the task of sentiment analysis requires the model not only to understand the sentiment of words but also how those words as used together to express the sentiment of the whole text.

doc2Vec

Distributed Representations of Sentences and Document, Le and Mikolov

This paper by Le and Mikolov [7] proposed an interesting approach for this purpose. Their model is trained to represent any variable-length piece of text (e.g. phrase, sentence, paragraph, document; from now we will call it paragraph for short) by a dense vector.

Training

There are 2 types of representation for a paragraph, as described below. Each representation results in a dense encoding vector for every paragraph. The final embedding of a paragraph is the concatenation of these 2 vectors.

Representation 1: a distributed memory model of the paragraph.

In this representation, a paragraph is considered a background context such that given this context and a sequence of words extracted from that paragraph, we can predict the next word for this sequence. For example, given a paragraph I think the cat sat on the mat and a sequence of words (the, cat, sat), the model should be able to predict that the next word for this sequence is on. This is illustrated in the figure below.

Figure 2 from the paper Distributed Representations of Sentences and Document, Le and Mikolov
The embedding of a paragraph is considered the background context so that the model can predict the next word for any given sequence of words extracted from that paragraph.

In the beginning, each paragraph is initialized with a random vector. The same happens for each word (i.e. initialized with a random vector). The model takes as input the concatenation of the paragraph vector and a random sequence of word vectors from that paragraph, and is trained with gradient descent to predict the next word. This process repeats for all paragraphs in the training corpus until convergence. In the end, we have a vector representation for each paragraph that serves well as a context for any sequence of words in that paragraph.

Representation 2: distributed Bag-of-words.

In this representation, the paragraph vector is used to predict words in the paragraph without any order (hence the name BOW), as below.

Figure 3 from the paper Distributed Representations of Sentences and Document, Le and Mikolov
The paragraph vector is used to predict a random sequence of words in the paragraph (without considering their order).

The initialization is the same as in Representation 1: each word/paragraph’s embedding is initialized with a vector of random values. Then, these embeddings are optimized by the model with the objective to predict a random sequence of words in the paragraph without orders (note that this is a multi-label classification problem, in which one input, the paragraph vector, corresponds to multiple labels, the words to be predicted).

After both models for the 2 representations converge, the 2 embeddings for each paragraph are concatenated to form the final embedding of that paragraph.

At inference time (i.e. given a new paragraph, we need to output an embedding for that paragraph), the paragraph vector is inferred by first fixing model parameters and the word vectors, and then training the new paragraph’s vector until convergence.

As an implementation detail, this approach uses a Huffman-tree Hierarchical Softmax to mitigate the problem of using Softmax with many classes.

Experimental results

Experiments show that doc2Vec significantly enhances the performance of the classification model in the tasks of sentiment analysis and information retrieval (i.e. classifying whether a document should be retrieved or not). It not only outperforms baselines like Bag-of-words (which retains no information about the order of words), Bag-of-word-n-gram (which suffers from sparsity), but also more advanced methods like Recursive Neural Networks and Support Vector Machines.

Sentence-BERT

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Reimers and Gurevych

SentenceBERT (or SBERT) is a more recent and more performant approach for embedding pieces of text. Instead of just averaging the BERT output layer or taking the output of the first token (the [CLS] token), which yields lower performance, SentenceBERT attempts to fine-tune the pretrained BERT with a siamese structure to better optimize for the similarities (or differences) between sentences. (Note that while we refer to a piece of text as a sentence, it is equivalent to any variable-length sequence of words, like a phrase, paragraph, or document.)

Training

During the training phase, a siamese structure is constructed with 2 BERT networks (sharing the same weights) as its foundation. This architecture is depicted in the figure below.

Figure 1 from Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Reimers and Gurevych
The SBERT architecture for training with classification objective.

Given a dataset in which each instance consists of a pair of texts and a label describing whether the 2 texts are similar, neutral, or opposite, the training architecture tries to predict the label given the input. At first, each input text is fed through BERT to obtain a sequence of embeddings for its tokens, these embeddings are then Mean-pooled to get a fixed-length vector representation for any variable-length text. The 2 fixed-length embeddings for the 2 pieces of text are called u and v. A concatenation of u, v, and |u - v| is then used to classify the label with a Softmax function at the end. Losses are backpropagated to fine-tune the weights of the BERT model in the first blocks.

The total training time for SBERT, according to the author, is less than 20 minutes with a single V100 GPU.

After this training phase, BERT is expected to learn the embeddings of texts more effectively in terms of sentence similarities (or differences). This is true since it was fine-tuned for the task of identifying the relationships between texts.

At inference time, we compute the sentence embeddings of texts similar to computing u and v in the training. Then, these embeddings can be compared with Cosine-similarity (or negative Manhattan, or negative Euclidean distances, whatever similarity measure we like). See the figure below for an illustration.

Figure 2 from Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Reimers and Gurevych
The SBERT architecture for inference.

Experimental results

SBERT is evaluated on the Semantic Textual Similarity (STS) tasks (unsupervised or supervised) and by the SentEval toolkit [24]. In most evaluation datasets, SBERT not only outperforms existing SOTA methods (InferSent [25], Universal Sentence Encoder [26]) but also beat them in terms of inference speed.

The only outlier is with the Argument Facet Similarity corpus with which SBERT underperforms BERT (but this seems to be the case for not just SBERT but all sentence-embedding methods in general).

Image2Vec

Interestingly, while there are multiple works trying to occupy the trending name of <something>2Vec (e.g. Word2Vec, Node2Vec, Edge2Vec), no prominent research papers claimed the term Image2Vec. What are the reasons?

One fundamental reason could be, that, unlike other data types, digital images are already, technically 2-D (greyscale) or 3-D (color) vectors. So, since images are vectors by default, does that mean there is no need for any technique to transform them into vectors? Not really. Words can be easily transformed into one-hot vectors, but as we mentioned, there are many disadvantages to doing so. Similar problems also occur if we use the raw pixels as a representation for images.

Another reason could be that featuring images into fixed-length vectors is a canonical research field that has been actively worked on for a long time (probably from some time in the 20th century), but the trendy <something>2Vec names only started with Word2Vec in 2013. Thus, it does not make sense to rename a long-existed field to follow the new naming style.

That said, even though there are no approaches explicitly named Image2Vec, there are a lot of models that actually do the work. Actually, most of the Deep neural networks for computer vision compress images into vectors before stacking some last layers on top to do downstream tasks like classification or generation. Interesting models include, but are not limited to AlexNet [9], VGG [10], ResNet [11], EfficientNet [12], or the attention-based Vision Transformer (ViT) [13]. As details about these models go beyond the scope of this article, we refer the reader to the original papers for in-depth views of how they work.

Node2Vec

Node2Vec methods work in the context of graphs, in which we want to embed each node by a dense vector of real values. Similar to other data types, we also require the embeddings to preserve the similarities between nodes. A node’s characteristics are often defined by its type (in the case of heterogeneous graphs, in which there are multiple node- and/or edge types), its neighbors, and the relationships with its neighbors (e.g. edge type, distance).

In this section, we describe 4 eminent models on transformation of nodes to vectors, including DeepWalk [14], LINE [15], Node2Vec [16], and Metapath2Vec [17]. While the first 3 models primarily aim for homogeneous graphs (there is only a single type of nodes as well as edges), the last one targets heterogeneous graphs.

DeepWalk

DeepWalk: Online Learning of Social Representations, Perozzi et al.

In simple terms, DeepWalk uses random walks to generate sequences of nodes from the input graph, then each sequence is considered analogous to a sentence (each node is analogous to a word); this set of sequences is then used to train a Skip-gram model, after whose convergence, a meaningful embedding for each node is established.

The figure below illustrates the use of DeepWalk on a clustering task. It embeds graph nodes into vectors of length 2 before visualizing them on a plot. As can be seen from the figure, most nodes in the same clusters are close to each other, meaning their vector representations are pretty similar.

Figure 1 from DeepWalk: Online Learning of Social Representations, Perozzi et al.
An illustration of DeepWalk for the Node clustering task on the Karate Graph. Nodes with the same color are intrinsically similar.

LINE

LINE: Large-scale Information Network Embedding, Tang et al.

LINE is another interesting approach. The authors claim that LINE optimizes an objective that takes into account both the local and global structure of the graph when learning the embeddings of nodes. It does so by maximizing the embedding similarities between pairs of nodes with first-order proximity (i.e. preserving the local structure) and second-order proximity (i.e. preserving the global structure). The figure below demonstrates the 2 orders of proximity.

Figure 1 from LINE: Large-scale Information Network Embedding, Tang et al.
An illustration of first- and second-order proximity. Nodes 6 and 7 are an example of nodes with first-order proximity because there is an edge connecting them. Nodes 5 and 6 are an example of nodes with second-order proximity as they share some similar neighbors.

Training

In simple words, LINE defines a node (i.e. decides the embedding of a node) by making it similar to its first- and second-order proximity neighbors, while at the same time making it different from other, far-away nodes.

To make neighboring nodes similar, LINE repeatedly samples an edge (in cases of first-order proximity) or a path of length 2 (in cases of second-order proximity) in the graph and maximizes the dot product of the 2 end points’ embeddings. We remind the reader that we can think of the embeddings as normalized vectors, and thus, the larger the dot product the more similar the 2 embeddings (mind the Cosine similarity).

To make far-away nodes different, LINE, on the other hand, samples far-away pairs of nodes and minimizes their end points’ embeddings. This trick is a slight modification of Negative Sampling, as described earlier in this article.

As a side note, if the edges in the graph are weighted, we can do weighted samplings of edges and paths according to the edge weights. This preserves the importance of edges in the graph when encoding the nodes.

As an implementation detail, LINE optimizes for the first- and second-order proximity separately, then concatenates the results to get the final embeddings of the nodes.

Experimental results

The authors evaluate LINE on 3 different tasks: word analogy, document classification in graphs, and compression for visualization. For the first 2 tasks, LINE outperforms then-SOTA models like Skip-gram [1], DeepWalk [14], and GF (Graph Factorization) [18]. For the last task, the result is compared with GF and DeepWalk and is shown in the figure below (read the figure caption for more details). Qualitatively, we can see that the compression given by t-SNE on the embeddings from LINE is much more meaningful and reasonable than the other 2 methods.

Figure 2 from LINE: Large-scale Information Network Embedding, Tang et al.
Visualization of the 3 methods on the co-author graph, which was extracted from the DBLP data. An author is a node in the graph. Each node is mapped to the 2-D space using t-SNE with learned embeddings as input. The node colors indicate the communities of authors (red – data mining, blue – machine learning, green – computer vision).

node2Vec

node2vec: Scalable Feature Learning for Networks, Grover and Leskovec

This is another paper that showed improvement in transforming from nodes to vectors. Its goal is to learn embeddings of nodes that preserve the similarity between (1) nodes in the same community as well as (2) nodes in different communities but have similar roles. The figure below illustrates these 2 types of similarities.

figure showing nodes with 2 types of similarities
The graph shows similarities between nodes. V1 and V2 can be considered as belonging to the same community, while V1 and V8 are in different communities but have similar roles.

Training

Similar to DeepWalk [14], this work also leverages random walks and Skip-gram to learn embeddings of nodes. The random walks generate paths (or sequences of nodes) from the graph, which are treated similarly to sentences, while each node is analogous to a word. The Skip-gram model is then used to learn the embeddings of nodes, ensuring nodes that appear in similar neighborhoods have similar embeddings. The difference to (or improvement over) DeepWalk is the use of second-order biased random walks instead of just the original random walks.

This means when making the decision on what node to walk to next, the probability of selecting the next node depends on the distance from that next node to the previous node. For example, look at the figure below.

Figure 2 from Node2vec: Scalable Feature Learning for Networks, Grover and Leskovec
Illustration of the second-order biased random walk strategy. t is the last visited node and v is the current node.

The figure illustrates the situation when the walk just traversed the edge (t, v) and is now staying at v. \alpha specifies the unnormalized probability of going to the next node, this probability for the neighbors of v depends on the distance from them to the last node t.

\alpha_{v, x} = \begin{cases}1 & \text{if } d_{tx} = 1 \\\frac{1}{p} & \text{if } d_{tx} = 0 \\\frac{1}{q} & \text{if } d_{tx} = 2\end{cases}

where:

  • \alpha_{v, x} is the unnormalized probability to choose x as the next node in the walk,
  • d_{tx} is the distance between 2 nodes t and x,
  • p and q are 2 hyper-parameters to control the behavior of the walks.

High-level speaking, this random walk strategy is somewhat a hybrid of DFS (as in DeepWalk) and BFS (similar to LINE [15]). How to assign values to these 2 hyper-parameters depends on the task and validation performance. (Note that by adjusting p and q, we can also replicate the DFS and BFS random walk strategies.)

Experimental results

The authors tested node2vec on 3 datasets for the task of node classification. In all experiments, node2vec showed superior performance against DeepWalk and LINE.

In this paper, the authors also suggest a simple way to obtain the representation of edges. For an edge e that connects 2 nodes (x, y), they find that having vector(e) = Hadamard(vector(x), vector(y)) (Hadamard product is the element-wise product of 2 similar-size vectors) gives the best performance for all demonstrated edge-based prediction tasks.

Metapath2Vec

Metapath2vec: Scalable Representation Learning for Heterogeneous Networks, Dong and Chawla

Different from DeepWalk [14], LINE [15], and node2vec [16] which only focus on homogeneous graphs, metapath2vec is specifically designed to work on heterogeneous graphs whose nodes and edges may have multiple types.

Training

Similar to most other Node2Vec approaches, metapath2vec also makes use of random walks and Skip-gram for its embedding optimization process. However, the random walks here are not allowed to be truly random, but they need to follow a pre-defined structure of the node-/edge-type sequence.

To make things more concrete, let us take the academic publication graph as an example. In this graph, nodes can be of 1 of the 3 types (author, paper, venue, abbreviated by A, P, and V, respectively) while there are 2 types of edges (a publish edge connects A-P, P-V; and a coauthor edge connects A-A). An exemplar graph is shown in the figure below.

an exemplar graph of academic publication
An exemplar academic publication graph. Yellow nodes are authors, red nodes are papers, and pink nodes are venues.

An essential hyper-parameter we need to decide for metapath2vec is the meta-path scheme, which describes the structure of the node-/edge types that the random walks should follow. For example, a meta-path scheme could be A - publish - P - publish - V - publish - P - publish - A. This scheme captures the publishing-in-the-same-venue relationship between authors. The walks generated by the algorithm need to follow this scheme, i.e. start at an author node, then follow a publish edge to a paper node, and so on, until it finishes at an author node. All these walks are then fed to Skip-gram to learn the embeddings of nodes, similar to in DeepWalk and node2vec.

The authors also suggest a modification of metapath2vec, called metapath2vec++ which alters the Skip-gram model a bit to further enable the simultaneous modeling of structural and semantic correlations of nodes in the graph. This change happens in the classification layer of Skip-gram, in which instead of applying Softmax on all classes as a whole, metapath2vec++ applies Softmax on each type of node separately.

Experimental results

Both variants of metapath2vec are tested on the 2 tasks of node classification and node clustering on heterogeneous graphs. They outperform prior type-ignorant methods (DeepWalk, LINE), showing that their utilization of types is indeed effective.

Edge2Vec

Edge2Vec techniques learn embeddings of edges in graphs.

Earlier in this article, we introduced a simple way to obtain the embedding of an edge by applying the Hadamard product to the embeddings of 2 nodes that this edge connects (the idea is from the node2vec paper [16]).

Another way is through a line graph of the original graph. Suppose we have a graph G, the line graph L(G) of G is constructed as: each node in L(G) corresponds to an edge in G, 2 nodes in L(G) are connected if and only if the corresponding 2 edges in G share one same node. Then, we compute the embeddings of nodes in L(G), which will give us the edge embeddings of G.

However, both methods above do not preserve all properties of edges and are not really performant (as compared to the method we are about to introduce next).

Edge2vec: Edge-based Social Network Embedding, Wang et al.

Since potentially the number of edges in a graph can be very large (can be as large as the square of the number of nodes in asymptotic terms), computing the embeddings for edges in an arbitrary graph can be very memory-consuming. To facilitate research, in this paper, the proposed method is trained and tested only on social network graphs, in which the average degree of nodes is small, resulting in reasonable numbers of edges.

The work defines 2 types of proximity: local and global proximities.

  • Local Edge Proximity refers to whether 2 edges share one vertex.
  • Global Edge Proximity considers the distances between all pairs of edges. In simple terms, 2 edges are globally proximal if their distances to all nodes in the graphs are similar (see the figure below).
Figure 2 from Edge2vec: Edge-based Social Network Embedding, Wang et al.
Example of a directed graph. 2 edges (A-C) and (A-D), and 2 edges (E-H) and (F-H) are both locally proximal. 2 edges (E-C) and (F-D) are globally proximal.

Training

The training process uses a deep autoencoder together with Skip-gram to preserve both types of proximity in edge embeddings.

For Local Edge Proximity, we first list out all pairs of adjacent edges (i.e. pairs of edges that share 1 vertex). The “co-occur” probability of 2 edges in all these pairs is then maximized by Skip-gram with Negative Sampling. As compared to text processing, this is analogous to maximizing the similarity of 2 words in the same context in word2vec [1].

For Global Edge Proximity, given an edge e, we define a neighborhood vector n_e of length 2|V| (where V is the set of nodes) in which each value is in the range [0, 1], showing the information about how close each of the 2 ends of e is to nodes in the graph. The closer the distance, the higher the corresponding value. The global proximity of 2 edges is the similarity of their neighborhood vectors (where the “similarity” can be quantified by, for example, Cosine similarity). Furthermore, to reduce the size of the neighborhood vectors, we compress them with a deep autoencoder.

Experimental results

The approach is evaluated on the tasks of link prediction, social tie sign prediction, and social tie direction prediction. In all experiments, it consistently outperforms other embedding-based deep models (DeepWalk [14], LINE [15], node2vec [16]).

Code2Vec

Computer programs can also be embedded.

While code embedding/understanding is an active research area with many recent works on it, in this article, we will introduce only 1 paper while leaving a comprehensive review of this field to a future thread.

code2vec: Learning Distributed Representations of Code, Alon et al.

The main objective of this method is to represent each code snippet by a fixed-length vector so that similar snippets result in similar vectors. Moreover, these vectors should also preserve the linear relationship (as in word2vec [1]). For example, given 3 functions equals, toLowerCase, and equalsIgnoreCase, we would like to have vector(equals) + vector(toLowerCase) = vector(equalsIgnoreCase).

Training

code2vec decomposes the code into paths in its Abstract syntax tree (AST) and then learns the representation of each path simultaneously with learning how to aggregate them using soft attention.

The AST of a code snippet is a tree that describes its computational structure. The figure below shows the AST of a simple piece of code.

An example of a simple code snippet with its abstract syntax tree
An exemplar Java code and its abstract syntax tree.

Within the scope of this paper, a path in the AST is a walk that connects 2 leaves of the tree without repeating any node (and thus there is only 1 path connecting any pair of nodes).

At first, we randomly initialize 2 matrices, one for the embeddings of paths and the other for the embeddings of nodes.

A path is characterized by the list of nodes and the directions of the edges that connect these nodes (directions can be either up or down the AST). As an implementation detail, only the most frequent paths in the training data are considered, the vocabulary size of paths is 1M. A node is characterized by its type and name.

A path-context representation is a concatenation of 3 vectors: that path’s vector and the 2 leaves’ vectors. Finally, a code snippet is represented by a random subset of path-context extracted from the AST.

There is a global context vector a for attention. Note that the attention here is global and is not similar to the self-attention mechanism from, e.g. the Transformer [21].

The path embeddings, node embeddings, and the global context vector a are jointly trained with gradient descent for the objective of classifying function names given function codes. In this process, the steps are summarized below:

  • Given a function code, we build an AST.
  • From the AST, a subset of context-paths are then randomly extracted.
  • Each context-path is then vectorized as specified earlier.
  • This subset of context-path vectors are fed through a neural network (with an integrated attention mechanism) to predict their corresponding function name.
  • After the network converges, we get the learned embeddings for all nodes and paths, as well as context-paths. These are used in the future when we want to embed new functions.
Figure 2 from code2vec: Learning Distributed Representations of Code, Alon et al.
An example that shows the effectiveness of using attention. While the 3 functions have similar syntactic structures, code2vec, using an attention mechanism, can still distinguish and predict meaningful function names. For each code, 4 paths that are given the most attention are highlighted by colored paths. The width of the paths is proportional to the magnitude of attention. (Click to see a larger image.)

Experimental results

Being tested on the task of function name prediction, code2vec sharply outperforms existing approaches which use Path + Conditional Random Fields, or LSTM/CNN + attention.

Data2Vec

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language, Baevski et al.

Last but not least, we introduce a recent work from Meta. Interestingly, this proposed approach can transform multiple modalities (texts, images, and voice) into vectors using the same training scheme while still maintaining SOTA performance.

Note, however, that each modality needs to be processed alone (i.e. not at the same time). The model can not yet take in multiple modalities at once.

Training

As a one-sentence summary, the approach uses the off-the-shelf transformer architecture with the pretraining objective to predict the representation of the original input given the masked input.

There are 2 models: teacher and student, both have the same architecture (the transformer) and size, but different weights. The input to the teacher is always the original data, while input for the student is masked data. At each time step, the student learns to mimic the average of the last K layers of the teacher, while the teacher parameters are updated as the exponential moving average of the student. High-level speaking, we try to make the encoding of the original input and of the masked input as similar as possible.

Notice that these learned embeddings are contextualized in a different way from well-known models like BERT [6]. In BERT, the target for each masked token is a fixed (correct) token, while in data2vec, the target for each sequence is a vector representation that changes depending on the context.

Also, while the training scheme is the same regardless of input modality, we still have different preprocessing for each type of input data. For images, we use the ViT-strategy [13] of encoding an image as a sequence of patches, each spanning 16×16 pixels. For voice, we use a multi-layer 1-D convolutional neural network that maps 16 kHz waveforms to 50 Hz representation (to reduce the input sequence length). For texts, we split the input into sub-words and encode them with embedding vectors.

The input is then masked for 15%, 60%, and 49% for texts, images, and voice, respectively, before being given to the student network.

The figure below illustrates the workflow of data2vec.

Figure 1 from data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language, Baevski et al.
Illustration of data2vec’s workflow for different modalities. While different input data types are preprocessed and masked differently, the same architecture and scheme are used for all of them. (Click to see a larger image.)

Experimental results

data2vec archives SOTA performance on all 3 modalities. For image classification and speech recognition, it noticeably beats prior SOTA approaches. For text data, its results on the GLUE tasks are comparable to that of BERT [6] and RoBERTa [23].

References:

  • [1] Efficient Estimation of Word Representations in Vector Space, Mikolov et al. (2013), paper.
  • [2] Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al. (2013), paper.
  • [3] Hierarchical Probabilistic Neural Network Language Model, Morin and Bengio (2005), paper.
  • [4] Glove: Global vectors for word representation, Pennington et al (2014), paper.
  • [5] Enriching Word Vectors with Subword Information, Bojanowski et al. (2017), paper.
  • [6] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al. (2018), paper.
  • [7] Distributed Representations of Sentences and Document, Le and Mikolov (2014), paper.
  • [8] Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Reimers and Gurevych (2019), paper.
  • [9] ImageNet Classification with Deep Convolutional Neural Networks, Krizhevsky et al. (2012), paper.
  • [10] Very Deep Convolutional Networks for Large-Scale Image Recognition, Simonyan and Zisserman (2014), paper.
  • [11] Deep Residual Learning for Image Recognition, He and Zhang (2016), paper.
  • [12] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, Tan and Le (2019), paper.
  • [13] An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al. (2020), paper.
  • [14] DeepWalk: Online Learning of Social Representations, Perozzi et al. (2014), paper.
  • [15] LINE: Large-scale Information Network Embedding, Tang et al. (2015), paper.
  • [16] node2vec: Scalable Feature Learning for Networks, Grover and Leskovec (2016), paper.
  • [17] Metapath2vec: Scalable Representation Learning for Heterogeneous Networks, Dong and Chawla (2017), paper.
  • [18] Distributed large-scale natural graph factorization, Ahmed et al. (2013), paper.
  • [19] Edge2vec: Edge-based Social Network Embedding, Wang et al. (2020), paper.
  • [20] code2vec: Learning Distributed Representations of Code, Alon et al. (2019), paper.
  • [21] Attention is all you need, Vaswani et al. (2017), paper.
  • [22] data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language, Baevski et al. (2022), paper.
  • [23] Roberta: A robustly optimized BERT pretraining approach, Liu et al. (2019), paper.
  • [24] SentEval: An Evaluation Toolkit for Universal Sentence Representations, Conneau and Kiela (2018), paper.
  • [25] Supervised Learning of Universal Sentence Representations from Natural Language Inference Data, Conneau et al. (2017), paper.
  • [26] Universal Sentence Encoder, Cer et al. (2018), paper.

Leave a Reply