Context based lemmatizer for Polish language
Karwatowski, Michal, Pietron, Marcin
–arXiv.org Artificial Intelligence
Natural Language Processing consists of many tasks, the role of each is extracting and processing human understandable meaning from the text data. Some tasks like classification encompass the complete flow from data to answer, in other tasks like part of speech tagging, results are often used as an input for next algorithms. An interesting and complex problem is translation, where the meaning of the text needs to be extracted and encoded back to the text in a different language. This approach describes a family of NLP tasks called text-to-text or sequence-to-sequence processing. Another example of text-to-text processing is lemmatisation, it finds a base form of a given word or expression. Complexity of this problem varies from language to language. In English the number of word variations is usually low, there are simple rules and not many exceptions. However in Slavic languages such as Polish inflection of words it is significantly more complicated and effective lemmatisation is beyond capabilities of a rule based or edit tree classification methods [1], [2]. Situation becomes more difficult when we include multi-segment expressions.
arXiv.org Artificial Intelligence
Jul-23-2022