AITopics | clir

Collaborating Authors

clir

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents

Jagadeeshan, Manoj Balaji, Raj, Prince, Goyal, Pawan

arXiv.org Artificial IntelligenceMay-27-2025

The study presents a comprehensive benchmark for retrieving Sanskrit documents using English queries, focusing on the chapters of the Srimadbhagavatam. It employs a tripartite approach: Direct Retrieval (DR), Translation-based Retrieval (DT), and Query Translation (QT), utilizing shared embedding spaces and advanced translation methods to enhance retrieval systems in a RAG framework. The study fine-tunes state-of-the-art models for Sanskrit's linguistic nuances, evaluating models such as BM25, REPLUG, mDPR, ColBERT, Contriever, and GPT-2. It adapts summarization techniques for Sanskrit documents to improve QA processing. Evaluation shows DT methods outperform DR and QT in handling the cross-lingual challenges of ancient texts, improving accessibility and understanding. A dataset of 3,400 English-Sanskrit query-document pairs underpins the study, aiming to preserve Sanskrit scriptures and share their philosophical importance widely. Our dataset is publicly available at https://huggingface.co/datasets/manojbalaji1/anveshana

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2505.19494

Country: Asia > India (0.28)

Genre:

Research Report > New Finding (0.93)
Research Report > Promising Solution (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
(2 more...)

Add feedback

Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data

Litschko, Robert, Artemova, Ekaterina, Plank, Barbara

arXiv.org Artificial IntelligenceMay-26-2023

Transferring information retrieval (IR) models from a high-resource language (typically English) to other languages in a zero-shot fashion has become a widely adopted approach. In this work, we show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages. Motivated by this, we propose to train ranking models on artificially code-switched data instead, which we generate by utilizing bilingual lexicons. To this end, we experiment with lexicons induced from (1) cross-lingual word embeddings and (2) parallel Wikipedia page titles. We use the mMARCO dataset to extensively evaluate reranking models on 36 language pairs spanning Monolingual IR (MoIR), Cross-lingual IR (CLIR), and Multilingual IR (MLIR). Our results show that code-switching can yield consistent and substantial gains of 5.1 MRR@10 in CLIR and 3.9 MRR@10 in MLIR, while maintaining stable performance in MoIR. Encouragingly, the gains are especially pronounced for distant languages (up to 2x absolute gain). We further show that our approach is robust towards the ratio of code-switched tokens and also extends to unseen languages. Our results demonstrate that training on code-switched data is a cheap and effective way of generalizing zero-shot rankers for cross-lingual and multilingual retrieval.

computational linguistic, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.05295

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > New York County > New York City (0.05)
North America > Dominican Republic (0.04)
(9 more...)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

A General-Purpose Multilingual Document Encoder

Galoğlu, Onur, Litschko, Robert, Glavaš, Goran

arXiv.org Artificial IntelligenceMay-11-2023

Massively multilingual pretrained transformers (MMTs) have tremendously pushed the state of the art on multilingual NLP and cross-lingual transfer of NLP models in particular. While a large body of work leveraged MMTs to mine parallel data and induce bilingual document embeddings, much less effort has been devoted to training general-purpose (massively) multilingual document encoder that can be used for both supervised and unsupervised document-level tasks. In this work, we pretrain a massively multilingual document encoder as a hierarchical transformer model (HMDE) in which a shallow document transformer contextualizes sentence representations produced by a state-of-the-art pretrained multilingual sentence encoder. We leverage Wikipedia as a readily available source of comparable documents for creating training data, and train HMDE by means of a cross-lingual contrastive objective, further exploiting the category hierarchy of Wikipedia for creation of difficult negatives. We evaluate the effectiveness of HMDE in two arguably most common and prominent cross-lingual document-level tasks: (1) cross-lingual transfer for topical document classification and (2) cross-lingual document retrieval. HMDE is significantly more effective than (i) aggregations of segment-based representations and (ii) multilingual Longformer. Crucially, owing to its massively multilingual lower transformer, HMDE successfully generalizes to languages unseen in document-level pretraining. We publicly release our code and models at https://github.com/ogaloglu/pre-training-multilingual-document-encoders .

machine learning, natural language, text classification, (20 more...)

arXiv.org Artificial Intelligence

2305.07016

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > New York > New York County > New York City (0.04)
(8 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.35)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.34)

Add feedback

GCube launches renewable energy offering, backed by Clir - Reinsurance News

#artificialintelligenceNov-17-2021, 20:27:59 GMT

GCube has launched a new data-powered insurance offering to support the growth of the renewable energy industry, with support from Clire, a company dedicated to maximizing project returns from renewable energy assets. The offering will leverage AI-led analytics and data sets to offer enhanced terms and reduced premiums for wind and solar operating companies. By having Clir onboard a wind portfolio's data set onto its platform, GCube will aim to uncover an asset's meteorological and operational loading, overall component health and reliability, and the impact of current operations and maintenance. These insights will give GCube clarity on its underwriting pricing, and offer more competitive terms where operating projects model with lower risk factors. "Insuring renewable energy has been a tumultuous process over the last decade," said Fraser McLachlan, Chief Executive Officer, GCube Insurance Inc. "Claims from equipment failure, natural catastrophe loss and contractor error have forced some underwriters to exit the market. To continue to offer insurance at sustainable rates for clients, we need to have deeper insights into the risk of failure and operational management of renewable energy equipment."

clir, gcube launch renewable energy offering, reinsurance news, (3 more...)

#artificialintelligence

Industry: Energy > Renewable > Wind (0.77)

Technology: Information Technology > Artificial Intelligence (0.54)

Add feedback

Cross-language Information Retrieval

Galuščáková, Petra, Oard, Douglas W., Nair, Suraj

arXiv.org Artificial IntelligenceNov-10-2021

Two key assumptions shape the usual view of ranked retrieval: (1) that the searcher can choose words for their query that might appear in the documents that they wish to see, and (2) that ranking retrieved documents will suffice because the searcher will be able to recognize those which they wished to find. When the documents to be searched are in a language not known by the searcher, neither assumption is true. In such cases, Cross-Language Information Retrieval (CLIR) is needed. This chapter reviews the state of the art for cross-language information retrieval and outlines some open research questions.

proceedings, retrieval, translation, (13 more...)

arXiv.org Artificial Intelligence

2111.05988

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > Maryland > Baltimore (0.14)
(38 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.34)

Industry:

Health & Medicine (0.93)
Media (0.92)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.45)

Add feedback

Finding relevant data in a sea of languages

AITopics Original LinksJan-18-2017, 10:11:49 GMT

"About 6,000 languages are currently spoken in the world today," says Elizabeth Salesky of MIT Lincoln Laboratory's Human Language Technology (HLT) Group. "Within the law enforcement community, there are not enough multilingual analysts who possess the necessary level of proficiency to understand and analyze content across these languages," she continues. This problem of too many languages and too few specialized analysts is one Salesky and her colleagues are now working to solve for law enforcement agencies, but their work has potential application for the Department of Defense and Intelligence Community. The research team is taking advantage of major advances in language recognition, speaker recognition, speech recognition, machine translation, and information retrieval to automate language processing tasks so that the limited number of linguists available for analyzing text and spoken foreign languages can be used more efficiently. "With HLT, an equivalent of 20 times more foreign language analysts are at your disposal," says Salesky.

information retrieval, natural language, translation, (16 more...)

AITopics Original Links

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.40)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.73)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.70)

Add feedback

A cross-language search engine enables English monolingual researchers to find relevant foreign-language documents

#artificialintelligenceJun-3-2016, 23:22:18 GMT

information retrieval, natural language, translation, (15 more...)

#artificialintelligence

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
Asia > Middle East > Qatar (0.05)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.73)

Add feedback

Utilisation of Metadata Fields and Query Expansion in Cross-Lingual Search of User-Generated Internet Video

Khwileh, Ahmad, Ganguly, Debasis, J. F. Jones, Gareth

Journal of Artificial Intelligence ResearchJan-27-2016

Recent years have seen significant efforts in the area of Cross Language Information Retrieval (CLIR) for text retrieval. This work initially focused on formally published content, but more recently research has begun to concentrate on CLIR for informal social media content. However, despite the current expansion in online multimedia archives, there has been little work on CLIR for this content. While there has been some limited work on Cross-Language Video Retrieval (CLVR) for professional videos, such as documentaries or TV news broadcasts, there has to date, been no significant investigation of CLVR for the rapidly growing archives of informal user generated (UGC) content. Key differences between such UGC and professionally produced content are the nature and structure of the textual UGC metadata associated with it, as well as the form and quality of the content itself. In this setting, retrieval effectiveness may not only suffer from translation errors common to all CLIR tasks, but also recognition errors associated with the automatic speech recognition (ASR) systems used to transcribe the spoken content of the video and with the informality and inconsistency of the associated user-created metadata for each video. This work proposes and evaluates techniques to improve CLIR effectiveness of such noisy UGC content. Our experimental investigation shows that different sources of evidence, e.g. the content from different fields of the structured metadata, significantly affect CLIR effectiveness. Results from our experiments also show that each metadata field has a varying robustness to query expansion (QE) and hence can have a negative impact on the CLIR effectiveness. Our work proposes a novel adaptive QE technique that predicts the most reliable source for expansion and shows how this technique can be effective for improving the CLIR effectiveness for UGC content.

effectiveness, query, retrieval, (15 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.4775

AI Access Foundation

10979

Journal of Artificial Intelligence Research

Country:

North America > United States > Maryland (0.04)
Europe > Ireland (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Law > Statutes (0.92)
Energy (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.71)

Add feedback

Learning Inter-Related Statistical Query Translation Models for English-Chinese Bi-Directional CLIR

Zhang, Yuejie (Fudan University) | Cen, Lei (Fudan University) | Jin, Cheng (Fudan University) | Xue, Xiangyang (Fudan University) | Fan, Jianping (The University of North Carolina at Charlotte)

AAAI ConferencesJul-19-2011

To support more precise query translation for English-Chinese Bi-Directional Cross-Language Information Retrieval (CLIR), we have developed a novel framework by integrating a semantic network to characterize the correlations between multiple inter-related text terms of interest and learn their inter-related statistical query translation models. First, a semantic network is automatically generated from large-scale English-Chinese bilingual parallel corpora to characterize the correlations between a large number of text terms of interest. Second, the semantic network is exploited to learn the statistical query translation models for such text terms of interest. Finally, these inter-related query translation models are used to translate the queries more precisely and achieve more effective CLIR. Our experiments on a large number of official public data have obtained very positive results.

semantic network, text term, translation, (16 more...)

AAAI Conferences

Twenty-Second International Joint Conference on Artificial Intelligence

Country:

Asia > China > Shanghai > Shanghai (0.05)
North America > United States > North Carolina (0.04)
Asia > China > Hong Kong (0.04)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.47)
Health & Medicine > Therapeutic Area > Immunology (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.34)

Add feedback