Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs
You, Doohee, Fraiberger, Samuel
–arXiv.org Artificial Intelligence
This study investigates efficient deduplication techniques for a large NLP dataset of economic research paper titles. We explore various pairing methods alongside established distance measures (Levenshtein distance, cosine similarity) and a sBERT model for semantic evaluation. Our findings suggest a potentially low prevalence of duplicates based on the observed semantic similarity across different methods. Further exploration with a human-annotated ground truth set is completed for a more conclusive assessment. The result supports findings from the NLP, LLM based distance metrics.
arXiv.org Artificial Intelligence
Dec-11-2024
- Country:
- North America > United States > New York (0.04)
- Genre:
- Research Report > New Finding (0.55)
- Technology: