AITopics | ccmatrix

Collaborating Authors

ccmatrix

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

Nagata, Masaaki, Morishita, Makoto, Chousa, Katsuki, Yasuda, Norihito

arXiv.org Artificial IntelligenceMay-14-2024

Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical language models and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing for web mining of parallel data.

proceedings, sentence pair, translation, (15 more...)

arXiv.org Artificial Intelligence

2405.09017

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
Europe > Slovenia (0.05)
(16 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.95)
Information Technology > Communications > Social Media > Crowdsourcing (0.95)
Information Technology > Data Science > Data Mining > Web Mining (0.81)

Add feedback

CCMatrix: A billion-scale bitext data set for training translation models

#artificialintelligenceFeb-7-2020, 18:41:06 GMT

CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models. With more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public data set, CCMatrix is more than 50 times larger than the WikiMatrix corpus that we shared last year. Gathering a data set of this size required modifying our previous bitext mining approach used for WikiMatrix, assuming that the translation of one sentence could be found anywhere on CommonCrawl, which functions as an open archive of the internet. To address the significant computational challenges posed by comparing billions of sentences to determine which ones are mutual translations, we used massively parallel processing, as well as our highly efficient FAISS library for fast similarity searches. We're sharing details about how we created CCMatrix, and the tools needed for other researchers to reproduce our results and use this corpus for their work.

billion-scale bitext data, ccmatrix, training translation model, (7 more...)

#artificialintelligence

Genre: Research Report (0.37)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Communications > Social Media (0.87)

Add feedback