GliLem: Leveraging GliNER for Contextualized Lemmatization in Estonian
Dorkin, Aleksei, Sirts, Kairit
–arXiv.org Artificial Intelligence
Effective lemmatization enhances various downstream NLP We present GliLem--a novel hybrid tasks, including information retrieval based on lexical lemmatization system for Estonian that search and text analysis. Although dense vector enhances the highly accurate rule-based retrieval is gaining traction in information retrieval, morphological analyzer Vabamorf with an lexical search methods remain highly relevant, external disambiguation module based on particularly in modern hybrid systems. Lexical GliNER--an open vocabulary NER model search excels as a first-stage retriever due to its that is able to match text spans with text labels efficiency with inverted indices, and provides reliable in natural language. We leverage the exact term matching that dense retrievers may flexibility of a pre-trained GliNER model miss (Gao et al., 2021). Recent research demonstrates to improve the lemmatization accuracy of that lexical and dense retrieval are complementary, Vabamorf by 10% compared to its original lexical matching providing a strong foundation disambiguation module and achieve an for precise word-level matches, while dense improvement over the token classificationbased retrieval captures semantic relationships and handles baseline. To measure the impact vocabulary mismatches. The complementary of improvements in lemmatization accuracy nature of these approaches has led to state-of-theart on the information retrieval downstream hybrid systems that outperform either method task, we first created an information alone (Lee et al., 2023).
arXiv.org Artificial Intelligence
Jan-11-2025
- Country:
- Asia > Japan
- Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Estonia > Tartu County
- Tartu (0.05)
- Faroe Islands > Streymoy
- Tórshavn (0.04)
- Finland > Southwest Finland
- Turku (0.04)
- France > Île-de-France
- Romania > Sud - Muntenia Development Region
- Giurgiu County > Giurgiu (0.04)
- Spain > Aragón (0.04)
- Belgium > Brussels-Capital Region
- North America > Mexico
- Mexico City > Mexico City (0.04)
- Asia > Japan
- Genre:
- Research Report > New Finding (0.48)
- Technology: