SentAlign: Accurate and Scalable Sentence Alignment
Steingrímsson, Steinþór, Loftsson, Hrafn, Way, Andy
–arXiv.org Artificial Intelligence
We present SentAlign, an accurate sentence alignment tool designed to handle very large parallel document pairs. Given user-defined parameters, the alignment algorithm evaluates all possible alignment paths in fairly large documents of thousands of sentences and uses a divide-and-conquer approach to align documents containing tens of thousands of sentences. The scoring function is based on LaBSE bilingual sentence representations. SentAlign outperforms five other sentence alignment tools when evaluated on two different evaluation sets, German-French and English-Icelandic, and on a downstream machine translation task.
arXiv.org Artificial Intelligence
Nov-15-2023
- Country:
- Oceania > Australia
- North America
- United States
- Pennsylvania (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Colorado > Denver County
- Denver (0.04)
- California > Los Angeles County
- Long Beach (0.04)
- Canada > British Columbia
- United States
- Europe
- Bulgaria (0.04)
- Spain (0.04)
- Latvia > Riga Municipality
- Riga (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Middle East > Malta
- Port Region > Southern Harbour District > Valletta (0.04)
- Finland > Southwest Finland
- Turku (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Italy > Trentino-Alto Adige/Südtirol
- Trentino Province > Trento (0.04)
- Sweden > Östergötland County
- Linköping (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia > China
- Genre:
- Research Report (0.64)
- Technology: