Goto

Collaborating Authors

 Information Retrieval


Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation

arXiv.org Artificial Intelligence

Benefiting from transformer-based pre-trained language models, neural ranking models have made significant progress. More recently, the advent of multilingual pre-trained language models provides great support for designing neural cross-lingual retrieval models. However, due to unbalanced pre-training data in different languages, multilingual language models have already shown a performance gap between high and low-resource languages in many downstream tasks. And cross-lingual retrieval models built on such pre-trained models can inherit language bias, leading to suboptimal result for low-resource languages. Moreover, unlike the English-to-English retrieval task, where large-scale training collections for document ranking such as MS MARCO are available, the lack of cross-lingual retrieval data for low-resource language makes it more challenging for training cross-lingual retrieval models. In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval. To transfer a model from high to low resource languages, OPTICAL forms the cross-lingual token alignment task as an optimal transport problem to learn from a well-trained monolingual retrieval model. By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training, which is more feasible for low-resource languages. Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages, including neural machine translation.


A Comparative Study of Pretrained Language Models for Long Clinical Text

arXiv.org Artificial Intelligence

Objective: Clinical knowledge enriched transformer models (e.g., ClinicalBERT) have state-of-the-art results on clinical NLP (natural language processing) tasks. One of the core limitations of these transformer models is the substantial memory consumption due to their full self-attention mechanism, which leads to the performance degradation in long clinical texts. To overcome this, we propose to leverage long-sequence transformer models (e.g., Longformer and BigBird), which extend the maximum input sequence length from 512 to 4096, to enhance the ability to model long-term dependencies in long clinical texts. Materials and Methods: Inspired by the success of long sequence transformer models and the fact that clinical notes are mostly long, we introduce two domain enriched language models, Clinical-Longformer and Clinical-BigBird, which are pre-trained on a large-scale clinical corpus. We evaluate both language models using 10 baseline tasks including named entity recognition, question answering, natural language inference, and document classification tasks. Results: The results demonstrate that Clinical-Longformer and Clinical-BigBird consistently and significantly outperform ClinicalBERT and other short-sequence transformers in all 10 downstream tasks and achieve new state-of-the-art results. Discussion: Our pre-trained language models provide the bedrock for clinical NLP using long texts. We have made our source code available at https://github.com/luoyuanlab/Clinical-Longformer, and the pre-trained models available for public download at: https://huggingface.co/yikuan8/Clinical-Longformer. Conclusion: This study demonstrates that clinical knowledge enriched long-sequence transformers are able to learn long-term dependencies in long clinical text. Our methods can also inspire the development of other domain-enriched long-sequence transformers.


An Experimental Study on Pretraining Transformers from Scratch for IR

arXiv.org Artificial Intelligence

Finetuning Pretrained Language Models (PLM) for IR has been de facto the standard practice since their breakthrough effectiveness few years ago. But, is this approach well understood? In this paper, we study the impact of the pretraining collection on the final IR effectiveness. In particular, we challenge the current hypothesis that PLM shall be trained on a large enough generic collection and we show that pretraining from scratch on the collection of interest is surprisingly competitive with the current approach. We benchmark first-stage ranking rankers and cross-encoders for reranking on the task of general passage retrieval on MSMARCO, Mr-Tydi for Arabic, Japanese and Russian, and TripClick for specific domain. Contrary to popular belief, we show that, for finetuning first-stage rankers, models pretrained solely on their collection have equivalent or better effectiveness compared to more general models. However, there is a slight effectiveness drop for rerankers pretrained only on the target collection. Overall, our study sheds a new light on the role of the pretraining collection and should make our community ponder on building specialized models by pretraining from scratch. Last but not least, doing so could enable better control of efficiency, data bias and replicability, which are key research questions for the IR community.


Conversational Information Seeking

arXiv.org Artificial Intelligence

Conversational information seeking (CIS) is concerned with a sequence of interactions between one or more users and an information system. Interactions in CIS are primarily based on natural language dialogue, while they may include other types of interactions, such as click, touch, and body gestures. This monograph provides a thorough overview of CIS definitions, applications, interactions, interfaces, design, implementation, and evaluation. This monograph views CIS applications as including conversational search, conversational question answering, and conversational recommendation. Our aim is to provide an overview of past research related to CIS, introduce the current state-of-the-art in CIS, highlight the challenges still being faced in the community. and suggest future directions.


Automated Identification of Disaster News For Crisis Management Using Machine Learning

arXiv.org Artificial Intelligence

A lot of news sources picked up on Typhoon Rai (also known locally as Typhoon Odette), along with fake news outlets. The study honed in on the issue, to create a model that can identify between legitimate and illegitimate news articles. With this in mind, we chose the following machine learning algorithms in our development: Logistic Regression, Random Forest and Multinomial Naive Bayes. Bag of Words, TF-IDF and Lemmatization were implemented in the Model. Gathering 160 datasets from legitimate and illegitimate sources, the machine learning was trained and tested. By combining all the machine learning techniques, the Combined BOW model was able to reach an accuracy of 91.07%, precision of 88.33%, recall of 94.64%, and F1 score of 91.38% and Combined TF-IDF model was able to reach an accuracy of 91.18%, precision of 86.89%, recall of 94.64%, and F1 score of 90.60%.


Topic Ontologies for Arguments

arXiv.org Artificial Intelligence

Many computational argumentation tasks, like stance classification, are topic-dependent: the effectiveness of approaches to these tasks significantly depends on whether the approaches were trained on arguments from the same topics as those they are tested on. So, which are these topics that researchers train approaches on? This paper contributes the first comprehensive survey of topic coverage, assessing 45 argument corpora. For the assessment, we take the first step towards building an argument topic ontology, consulting three diverse authoritative sources: the World Economic Forum, the Wikipedia list of controversial topics, and Debatepedia. Comparing the topic sets between the authoritative sources and corpora, our analysis shows that the corpora topics-which are mostly those frequently discussed in public online fora - are covered well by the sources. However, other topics from the sources are less extensively covered by the corpora of today, revealing interesting future directions for corpus construction.


Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering

arXiv.org Artificial Intelligence

Retrieval augmented language models have recently become the standard for knowledge intensive tasks. Rather than relying purely on latent semantics within the parameters of large neural models, these methods enlist a semi-parametric memory to encode an index of knowledge for the model to retrieve over. Most prior work has employed text passages as the unit of knowledge, which has high coverage at the cost of interpretability, controllability, and efficiency. The opposite properties arise in other methods which have instead relied on knowledge base (KB) facts. At the same time, more recent work has demonstrated the effectiveness of storing and retrieving from an index of Q-A pairs derived from text \citep{lewis2021paq}. This approach yields a high coverage knowledge representation that maintains KB-like properties due to its representations being more atomic units of information. In this work we push this line of research further by proposing a question-answer augmented encoder-decoder model and accompanying pretraining strategy. This yields an end-to-end system that not only outperforms prior QA retrieval methods on single-hop QA tasks but also enables compositional reasoning, as demonstrated by strong performance on two multi-hop QA datasets. Together, these methods improve the ability to interpret and control the model while narrowing the performance gap with passage retrieval systems.


Keyword Embeddings for Query Suggestion

arXiv.org Artificial Intelligence

Nowadays, search engine users commonly rely on query suggestions to improve their initial inputs. Current systems are very good at recommending lexical adaptations or spelling corrections to users' queries. However, they often struggle to suggest semantically related keywords given a user's query. The construction of a detailed query is crucial in some tasks, such as legal retrieval or academic search. In these scenarios, keyword suggestion methods are critical to guide the user during the query formulation. This paper proposes two novel models for the keyword suggestion task trained on scientific literature. Our techniques adapt the architecture of Word2Vec and FastText to generate keyword embeddings by leveraging documents' keyword co-occurrence. Along with these models, we also present a specially tailored negative sampling approach that exploits how keywords appear in academic publications. We devise a ranking-based evaluation methodology following both known-item and ad-hoc search scenarios. Finally, we evaluate our proposals against the state-of-the-art word and sentence embedding models showing considerable improvements over the baselines for the tasks.


Proactive and Reactive Engagement of Artificial Intelligence Methods for Education: A Review

arXiv.org Artificial Intelligence

Quality education, one of the seventeen sustainable development goals (SDGs) identified by the United Nations General Assembly, stands to benefit enormously from the adoption of artificial intelligence (AI) driven tools and technologies. The concurrent boom of necessary infrastructure, digitized data and general social awareness has propelled massive research and development efforts in the artificial intelligence for education (AIEd) sector. In this review article, we investigate how artificial intelligence, machine learning and deep learning methods are being utilized to support students, educators and administrative staff. We do this through the lens of a novel categorization approach. We consider the involvement of AI-driven methods in the education process in its entirety - from students admissions, course scheduling etc. in the proactive planning phase to knowledge delivery, performance assessment etc. in the reactive execution phase. We outline and analyze the major research directions under proactive and reactive engagement of AI in education using a representative group of 194 original research articles published in the past two decades i.e., 2003 - 2022. We discuss the paradigm shifts in the solution approaches proposed, i.e., in the choice of data and algorithms used over this time. We further dive into how the COVID-19 pandemic challenged and reshaped the education landscape at the fag end of this time period. Finally, we pinpoint existing limitations in adopting artificial intelligence for education and reflect on the path forward.


GoogleやBingの検索結果にChatGPTを表示させる拡張機能「ChatGPT for Search Engines」 - GIGAZINE

#artificialintelligence

OpenAIのChatGPTは高度な自然言語処理モデルを利用した対話型AIで、文章を入力するとまるで人間が書いたような自然な文章を返してくれます。これまでの検索エンジンでは検索クエリに複数の単語を入力する必要がありましたが、このChatGPTを応用すれば、調べたいことを直接文章で入力することでより適切な検索結果を示す次世代の検索エンジンが可能になると期待されています。そんなChatGPTの回答を実際にGoogleやBingなどの検索結果に表示させる拡張機能「ChatGPT for Search Engines」が、Chrome・Firefox・Edge向けにリリースされています。