Not enough data to create a plot.
Try a different view from the menu above.
Pinter, Yuval
Incorporating Context into Subword Vocabularies
Yehezkel, Shaked, Pinter, Yuval
Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context. Nevertheless, the resulting vocabularies are used in language models' highly contextualized settings. We present SaGe, a tokenizer that tailors subwords for their downstream use by baking in the contextualized signal at the vocabulary creation phase. We show that SaGe does a better job than current widespread tokenizers in keeping token contexts cohesive, while not incurring a large price in terms of encoding efficiency or domain robustness. SaGe improves performance on English GLUE classification tasks as well as on NER, and on Inference and NER in Turkish, demonstrating its robustness to language properties such as morphological exponence and agglutination.
Lost in Space Marking
Jacobs, Cassandra L., Pinter, Yuval
Such a claim requires empirical support, but consideration Modern NLP is dominated by large pre-trained of common practice can also be offered models, systems which are large, complex, and to challenge it: for one, pre-tokenization such as costly to train. As a result, much research effort is punctuation separation and accent normalization is put into questions of tuning and configuring the not always applied consistently when moving on to various layers and training regimes for improving a downstream text. A model that was trained on untreated prediction quality on a growing number of text may find it difficult to process an NER tasks (Rogers et al., 2020). Unfortunately, not as dataset (for example) where punctuation is separated much research asks questions about the decisions from preceding words, rendering a word-finalmarking made at the most upstream parts of the models, tokenizer more robust to change; some tokenizers those that deal with input tokenization and subword like BERT's Wordpiece (Devlin et al., 2019) vocabulary creation. "mark" a class of tokens by omission, i.e. marking In this exploratory work, we isolate a single decision the non-initial pieces rather than initial ones. This point which appears to be resolved arbitrarily discrepancy surfaces edge case effects when compared by existing model developers, with no consensus with a seemingly-equivalent tokenizer like but also no underlying theory: should subword GPT-2's (Radford et al., 2019), which marks initial tokenizers mark word boundaries at the pieces but only if they are prepended by a space beginning or the end?
Will it Unblend?
Pinter, Yuval, Jacobs, Cassandra L., Eisenstein, Jacob
Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data. Blends, such as "innoventor", are one particularly challenging class of OOV, as they are formed by fusing together two or more bases that relate to the intended meaning in unpredictable manners and degrees. In this work, we run experiments on a novel dataset of English OOV blends to quantify the difficulty of interpreting the meanings of blends by large-scale contextual language models such as BERT. We first show that BERT's processing of these blends does not fully access the component meanings, leaving their contextual representations semantically impoverished. We find this is mostly due to the loss of characters resulting from blend formation. Then, we assess how easily different models can recognize the structure and recover the origin of blends, and find that context-aware embedding systems outperform character-level and context-free embeddings, although their results are still far from satisfactory.