Lost in Space Marking

Jacobs, Cassandra L., Pinter, Yuval

arXiv.org Artificial Intelligence 

Such a claim requires empirical support, but consideration Modern NLP is dominated by large pre-trained of common practice can also be offered models, systems which are large, complex, and to challenge it: for one, pre-tokenization such as costly to train. As a result, much research effort is punctuation separation and accent normalization is put into questions of tuning and configuring the not always applied consistently when moving on to various layers and training regimes for improving a downstream text. A model that was trained on untreated prediction quality on a growing number of text may find it difficult to process an NER tasks (Rogers et al., 2020). Unfortunately, not as dataset (for example) where punctuation is separated much research asks questions about the decisions from preceding words, rendering a word-finalmarking made at the most upstream parts of the models, tokenizer more robust to change; some tokenizers those that deal with input tokenization and subword like BERT's Wordpiece (Devlin et al., 2019) vocabulary creation. "mark" a class of tokens by omission, i.e. marking In this exploratory work, we isolate a single decision the non-initial pieces rather than initial ones. This point which appears to be resolved arbitrarily discrepancy surfaces edge case effects when compared by existing model developers, with no consensus with a seemingly-equivalent tokenizer like but also no underlying theory: should subword GPT-2's (Radford et al., 2019), which marks initial tokenizers mark word boundaries at the pieces but only if they are prepended by a space beginning or the end?

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found