NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs

Galkin, Mikhail, Denis, Etienne, Wu, Jiapeng, Hamilton, William L.

arXiv.org Artificial Intelligence 

Drawing parallels with subword tokenization commonly used in NLP, we explore the landscape of more parameter-efficient node embedding strategies. To this end, we propose NodePiece, an anchor-based approach to learn a fixed-size entity vocabulary. In NodePiece, a vocabulary of subword/sub-entity units is constructed from anchor nodes in a graph with known relation types. Given such a fixed-size vocabulary, it is possible to bootstrap an encoding and embedding for any entity, including those unseen during training. Experiments show that NodePiece performs competitively in node classification, link prediction, and relation prediction tasks while retaining less than 10% of explicit nodes in a graph as anchors and often having 10x fewer parameters. Representation learning tasks on knowledge graphs (KGs) often require a parameterization of each unique atom in the graph with a vector or matrix. Traditionally, in multi-relational KGs such atoms constitute a set of all nodes n N (entities) and relations (edge types) r R (Nickel et al., 2016). Albeit efficient on small conventional benchmarking datasets based on Freebase (Toutanova & Chen, 2015) ( 15K nodes) and WordNet (Dettmers et al., 2018) ( 40K nodes), training on larger graphs (e.g., YAGO 3-10 (Mahdisoltani et al., 2015) of 120K nodes) becomes computationally challenging. Scaling it further up to larger subsets (Hu et al., 2020; Wang et al., 2021; Safavi & Koutra, 2020) of Wikidata (Vrandecic & Krötzsch, 2014) requires a top-level GPU or a CPU cluster as done in, e.g., PyTorch-BigGraph (Lerer et al., 2019) that maintains a 78M 200d embeddings matrix in memory (we list sizes of current best performing models in Table 1). Taking the perspective from NLP, shallow node encoding in KGs corresponds to shallow word embedding popularized with word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) that learned a vocabulary of 400K-2M most frequent words, treating rarer ones as out-of-vocabulary (OOV). The OOV issue was resolved with the ability to build infinite combinations with a finite vocabulary enabled by subword units. Subword-powered algorithms such as fastText (Bojanowski et al., 2017), Byte-Pair Encoding (Sennrich et al., 2016), and WordPiece (Schuster & Nakajima, 2012) became a standard step in preprocessing pipelines of large language models and allowed to construct fixed-size token vocabularies, e.g., BERT (Devlin et al., 2019) contains 30K tokens and We then concentrate on nodes as usually their size is orders of magnitude larger than that of edge types. Table 1: Node embedding sizes of state-of-the-art KG embedding models compared to BERT Large. Parameters of type float32 take 4 bytes each. FB15k-237, WN18RR, and YAGO3-10 models as reported in Sun et al. (2019), OGB WikiKG2 as in Zhang et al. (2020c), Wikidata 5M as in Wang et al. (2021), PBG Wikidata as in Lerer et al. (2019), and BERT Large as in Devlin et al. (2019).