AITopics | kenlm

Collaborating Authors

kenlm

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora

Kim, Yungi, Ha, Hyunsoo, Lee, Sukyung, Kim, Jihoo, Yang, Seonghoon, Park, Chanjun

arXiv.org Artificial IntelligenceSep-15-2024

With the increasing demand for substantial amounts of high-quality data to train large language models (LLMs), efficiently filtering large web corpora has become a critical challenge. For this purpose, KenLM, a lightweight n-gram-based language model that operates on CPUs, is widely used. However, the traditional method of training KenLM utilizes only high-quality data and, consequently, does not explicitly learn the linguistic patterns of low-quality data. To address this issue, we propose an ensemble approach that leverages two contrasting KenLMs: (i) Good KenLM, trained on high-quality data; and (ii) Bad KenLM, trained on low-quality data. Experimental results demonstrate that our approach significantly reduces noisy content while preserving high-quality content compared to the traditional KenLM training method. This indicates that our method can be a practical solution with minimal computational overhead for resource-constrained environments.

bad kenlm, dataset, kenlm, (14 more...)

arXiv.org Artificial Intelligence

2409.09613

Country:

North America > United States (0.04)
Europe > Ireland (0.04)
Europe > Hungary (0.04)
(3 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Exploiting Language Relatedness in Machine Translation Through Domain Adaptation Techniques

Kumar, Amit, Baruah, Rupjyoti, Pratap, Ajay, Swarnkar, Mayank, Singh, Anil Kumar

arXiv.org Artificial IntelligenceMar-3-2023

One of the significant challenges of Machine Translation (MT) is the scarcity of large amounts of data, mainly parallel sentence aligned corpora. If the evaluation is as rigorous as resource-rich languages, both Neural Machine Translation (NMT) and Statistical Machine Translation (SMT) can produce good results with such large amounts of data. However, it is challenging to improve the quality of MT output for low resource languages, especially in NMT and SMT. In order to tackle the challenges faced by MT, we present a novel approach of using a scaled similarity score of sentences, especially for related languages based on a 5-gram KenLM language model with Kneser-ney smoothing technique for filtering in-domain data from out-of-domain corpora that boost the translation quality of MT. Furthermore, we employ other domain adaptation techniques such as multi-domain, fine-tuning and iterative back-translation approach to compare our novel approach on the Hindi-Nepali language pair for NMT and SMT. Our approach succeeds in increasing ~2 BLEU point on multi-domain approach, ~3 BLEU point on fine-tuning for NMT and ~2 BLEU point on iterative back-translation approach.

artificial intelligence, computational linguistic, natural language, (15 more...)

arXiv.org Artificial Intelligence

2303.01793

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(21 more...)

Genre:

Research Report > Promising Solution (0.86)
Overview > Innovation (0.54)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

facebookresearch/wav2letter

@machinelearnbotJan-1-2018, 16:10:41 GMT

If you want to get started transcribing speech right away, we provide pre-trained models for the Librispeech dataset. If you use wav2letter or related pre-trained models, then please cite one of these papers. If you plan to train on CPU, it is highly recommended to install Intel MKL. If you want a system-wide installation, remove the -DCMAKE_INSTALL_PREFIX $HOME/usr option. In the next sections, we assume luarocks and luajit are in $PATH.

artificial intelligence, machine learning, speech recognition, (15 more...)

@machinelearnbot

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.35)
Information Technology > Artificial Intelligence > Machine Learning (0.34)

Add feedback