AITopics | russian language

Collaborating Authors

russian language

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture

GigaChat team, null, Valentin, Mamedov, Kosarev, Evgenii, Leleytner, Gregory, Shchuckin, Ilya, Berezovskiy, Valeriy, Smirnov, Daniil, Kozlov, Dmitry, Averkiev, Sergei, Ivan, Lukyanenko, Proshunin, Aleksandr, Israfilova, Ainur, Baskov, Ivan, Chervyakov, Artem, Shakirov, Emil, Kolesov, Mikhail, Khomich, Daria, Latortseva, Darya, Porkhun, Sergei, Fedorov, Yury, Kutuzov, Oleg, Kudriavtseva, Polina, Soldatova, Sofiia, Egor, Kolodin, Pyatkin, Stanislav, Menshykh, Dzmitry, Sergei, Grafov, Damirov, Eldar, Vladimir, Karlov, Gaitukiev, Ruslan, Shatenov, Arkadiy, Fenogenova, Alena, Savushkin, Nikita, Minkin, Fedor

arXiv.org Artificial IntelligenceJun-12-2025

Generative large language models (LLMs) have become crucial for modern NLP research and applications across various languages. However, the development of foundational models specifically tailored to the Russian language has been limited, primarily due to the significant computational resources required. This paper introduces the GigaChat family of Russian LLMs, available in various sizes, including base models and instruction-tuned versions. We provide a detailed report on the model architecture, pre-training process, and experiments to guide design choices. In addition, we evaluate their performance on Russian and English benchmarks and compare GigaChat with multilingual analogs. The paper presents a system demonstration of the top-performing models accessible via an API, a Telegram bot, and a Web interface. Furthermore, we have released three open GigaChat models in open-source (https://huggingface.co/ai-sage), aiming to expand NLP research opportunities and support the development of industrial solutions for the Russian language.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.0944

Country:

Europe (1.00)
Asia (0.67)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Building Russian Benchmark for Evaluation of Information Retrieval Models

Kovalev, Grigory, Tikhomirov, Mikhail, Kozhevnikov, Evgeny, Kornilov, Max, Loukachevitch, Natalia

arXiv.org Artificial IntelligenceApr-18-2025

We introduce RusBEIR, a comprehensive benchmark designed for zero-shot evaluation of information retrieval (IR) models in the Russian language. Comprising 17 datasets from various domains, it integrates adapted, translated, and newly created datasets, enabling systematic comparison of lexical and neural models. Our study highlights the importance of preprocessing for lexical models in morphologically rich languages and confirms BM25 as a strong baseline for full-document retrieval. Neural models, such as mE5-large and BGE-M3, demonstrate superior performance on most datasets, but face challenges with long-document retrieval due to input size constraints. RusBEIR offers a unified, open-source framework that promotes research in Russian-language information retrieval.

artificial intelligence, information retrieval, natural language, (13 more...)

arXiv.org Artificial Intelligence

2504.12879

Country:

Europe > Russia (0.15)
Asia > Russia (0.15)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems

Iakovenko, Olga, Bondarenko, Ivan, Borovikova, Mariya, Vodolazsky, Daniil

arXiv.org Artificial IntelligenceOct-3-2024

This paper presents an overview of rule-based system for automatic accentuation and phonemic transcription of Russian texts for speech connected tasks, such as Automatic Speech Recognition (ASR). Two parts of the developed system, accentuation and transcription, use different approaches to achieve correct phonemic representations of input phrases. Accentuation is based on "Grammatical dictionary of the Russian language" of A.A. Zaliznyak and wiktionary corpus. To distinguish homographs, the accentuation system also utilises morphological information of the sentences based on Recurrent Neural Networks (RNN). Transcription algorithms apply the rules presented in the monograph of B.M. Lobanov and L.I. Tsirulnik "Computer Synthesis and Voice Cloning". The rules described in the present paper are implemented in an open-source module, which can be of use to any scientific study connected to ASR or Speech To Text (STT) tasks. Automatically marked up text annotations of the Russian Voxforge database were used as training data for an acoustic model in CMU Sphinx. The resulting acoustic model was evaluated on cross-validation, mean Word Accuracy being 71.2%. The developed toolkit is written in the Python language and is accessible on GitHub for any researcher interested.

artificial intelligence, machine learning, transcription, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-319-99579-3_78

2410.02538

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.05)
Asia > Russia > Siberian Federal District > Novosibirsk Oblast > Novosibirsk (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Belarus > Minsk Region > Minsk (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Vikhr: The Family of Open-Source Instruction-Tuned Large Language Models for Russian

Nikolich, Aleksandr, Korolev, Konstantin, Shelmanov, Artem, Kiselev, Igor

arXiv.org Artificial IntelligenceJun-19-2024

There has been a surge in the development of various Large Language Models (LLMs). However, text generation for languages other than English often faces significant challenges, including poor generation quality and the reduced computational performance due to the disproportionate representation of tokens in model's vocabulary. In this work, we address these issues and introduce Vikhr, a new state-of-the-art open-source instruction-tuned LLM designed specifically for the Russian language. Unlike previous efforts for Russian that utilize computationally inexpensive LoRA adapters on top of English-oriented models, Vikhr features an adapted tokenizer vocabulary and undergoes the continued pre-training and instruction tuning of all weights. This approach not only enhances the model's performance but also significantly improves its computational and contextual efficiency. The remarkable performance of Vikhr across various Russian-language benchmarks can also be attributed to our efforts in expanding instruction datasets and corpora for continued pre-training. Vikhr not only sets the new state of the art among open-source LLMs for Russian, but even outperforms some proprietary closed-source models on certain benchmarks. The model weights, instruction sets, and code are publicly available

dataset, instruction, llm, (13 more...)

arXiv.org Artificial Intelligence

2405.13929

Country: South America > Suriname > Marowijne District > Albina (0.04)

Genre: Research Report (0.50)

Industry: Education > Educational Setting > Online (0.61)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

RuBia: A Russian Language Bias Detection Dataset

Grigoreva, Veronika, Ivanova, Anastasiia, Alimova, Ilseyar, Artemova, Ekaterina

arXiv.org Artificial IntelligenceMar-26-2024

Warning: this work contains upsetting or disturbing content. Large language models (LLMs) tend to learn the social and cultural biases present in the raw pre-training data. To test if an LLM's behavior is fair, functional datasets are employed, and due to their purpose, these datasets are highly language and culture-specific. In this paper, we address a gap in the scope of multilingual bias evaluation by presenting a bias detection dataset specifically designed for the Russian language, dubbed as RuBia. The RuBia dataset is divided into 4 domains: gender, nationality, socio-economic status, and diverse, each of the domains is further divided into multiple fine-grained subdomains. Every example in the dataset consists of two sentences with the first reinforcing a potentially harmful stereotype or trope and the second contradicting it. These sentence pairs were first written by volunteers and then validated by native-speaking crowdsourcing workers. Overall, there are nearly 2,000 unique sentence pairs spread over 19 subdomains in RuBia. To illustrate the dataset's purpose, we conduct a diagnostic evaluation of state-of-the-art or near-state-of-the-art LLMs and discuss the LLMs' predisposition to social biases.

computational linguistic, dataset, linguistic, (16 more...)

arXiv.org Artificial Intelligence

2403.17553

Country:

Asia > Russia (0.28)
Europe > Ukraine (0.14)
North America > United States > Washington > King County > Seattle (0.14)
(9 more...)

Genre: Research Report (0.64)

Industry: Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Impact of Tokenization on LLaMa Russian Adaptation

Tikhomirov, Mikhail, Chernyshev, Daniil

arXiv.org Artificial IntelligenceDec-5-2023

Latest instruction-tuned large language models (LLM) show great results on various tasks, however, they often face performance degradation for non-English input. There is evidence that the reason lies in inefficient tokenization caused by low language representation in pre-training data which hinders the comprehension of non-English instructions, limiting the potential of target language instruction-tuning. In this work we investigate the possibility of addressing the issue with vocabulary substitution in the context of LLaMa Russian language adaptation. We explore three variants of vocabulary adaptation and test their performance on Saiga instruction-tuning and fine-tuning on Russian Super Glue benchmark. The results of automatic evaluation show that vocabulary substitution not only improves the model's quality in Russian but also accelerates fine-tuning (35%) and inference (up to 60%) while reducing memory consumption. Additional human evaluation of the instruction-tuned models demonstrates that models with Russian-adapted vocabulary generate answers with higher user preference than the original Saiga-LLaMa model.

adaptation, arxiv preprint arxiv, tokenization, (13 more...)

arXiv.org Artificial Intelligence

2312.02598

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.05)
Asia > Russia (0.05)
Europe > Germany > Saarland > Saarbrücken (0.04)

Genre: Research Report (0.64)

Industry: Education > Curriculum > Subject-Specific Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Milimili. Collecting Parallel Data via Crowdsourcing

Antonov, Alexander

arXiv.org Artificial IntelligenceJul-23-2023

We present a methodology for gathering a parallel corpus through crowdsourcing, which is more cost-effective than hiring professional translators, albeit at the expense of quality. Additionally, we have made available experimental parallel data collected for Chechen-Russian and Fula-English language pairs.

collecting parallel data, source sentence, translation, (15 more...)

arXiv.org Artificial Intelligence

2307.12282

Country: Europe > Spain (0.05)

Genre: Research Report (0.50)

Technology:

Information Technology > Communications > Social Media > Crowdsourcing (0.73)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.50)

Add feedback

Monolingual and Cross-Lingual Knowledge Transfer for Topic Classification

Karpov, Dmitry, Burtsev, Mikhail

arXiv.org Artificial IntelligenceJul-4-2023

This article investigates the knowledge transfer from the RuQTopics dataset. This Russian topical dataset combines a large sample number (361,560 single-label, 170,930 multi-label) with extensive class coverage (76 classes). We have prepared this dataset from the "Yandex Que" raw data. By evaluating the RuQTopics - trained models on the six matching classes of the Russian MASSIVE subset, we have proved that the RuQTopics dataset is suitable for real-world conversational tasks, as the Russian-only models trained on this dataset consistently yield an accuracy around 85\% on this subset. We also have figured out that for the multilingual BERT, trained on the RuQTopics and evaluated on the same six classes of MASSIVE (for all MASSIVE languages), the language-wise accuracy closely correlates (Spearman correlation 0.773 with p-value 2.997e-11) with the approximate size of the pretraining BERT's data for the corresponding language. At the same time, the correlation of the language-wise accuracy with the linguistical distance from Russian is not statistically significant.

dataset, knowledge transfer, topic classification, (14 more...)

arXiv.org Artificial Intelligence

2306.07797

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > Los Angeles County > Los Angeles (0.04)
(3 more...)

Genre: Research Report > Experimental Study > Negative Result (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.66)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Detecting Human Rights Violations on Social Media during Russia-Ukraine War

Nemkova, Poli, Ubani, Solomon, Polat, Suleyman Olcay, Kim, Nayeon, Nielsen, Rodney D.

arXiv.org Artificial IntelligenceJun-6-2023

The present-day Russia-Ukraine military conflict has exposed the pivotal role of social media in enabling the transparent and unbridled sharing of information directly from the frontlines. In conflict zones where freedom of expression is constrained and information warfare is pervasive, social media has emerged as an indispensable lifeline. Anonymous social media platforms, as publicly available sources for disseminating war-related information, have the potential to serve as effective instruments for monitoring and documenting Human Rights Violations (HRV). Our research focuses on the analysis of data from Telegram, the leading social media platform for reading independent news in post-Soviet regions. We gathered a dataset of posts sampled from 95 public Telegram channels that cover politics and war news, which we have utilized to identify potential occurrences of HRV. Employing a mBERT-based text classifier, we have conducted an analysis to detect any mentions of HRV in the Telegram data. Our final approach yielded an $F_2$ score of 0.71 for HRV detection, representing an improvement of 0.38 over the multilingual BERT base model. We release two datasets that contains Telegram posts: (1) large corpus with over 2.3 millions posts and (2) annotated at the sentence-level dataset to indicate HRVs. The Telegram posts are in the context of the Russia-Ukraine war. We posit that our findings hold significant implications for NGOs, governments, and researchers by providing a means to detect and document possible human rights violations.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2306.0537

Country:

Asia > Russia (1.00)
Europe > Russia (0.94)
North America > United States > Texas > Denton County > Denton (0.14)
(8 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Law > Criminal Law (1.00)
Law > Civil Rights & Constitutional Law (1.00)
Government > Military (1.00)
(3 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.68)

Add feedback

A big data approach towards sarcasm detection in Russian

Gurin, A. A., Sadykov, T. M., Zhukov, T. A.

arXiv.org Artificial IntelligenceJun-1-2023

We present a set of deterministic algorithms for Russian inflection and automated text synthesis. These algorithms are implemented in a publicly available web-service www.passare.ru. This service provides functions for inflection of single words, word matching and synthesis of grammatically correct Russian text. Selected code and datasets are available at https://github.com/passare-ru/PassareFunctions/ Performance of the inflectional functions has been tested against the annotated corpus of Russian language OpenCorpora, compared with that of other solutions, and used for estimating the morphological variability and complexity of different parts of speech in Russian.

inflection, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2306.00445

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Communications > Social Media (0.95)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback