AITopics | Naous, Tarek

Collaborating Authors

Naous, Tarek

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

What are Foundation Models Cooking in the Post-Soviet World?

Lavrouk, Anton, Naous, Tarek, Ritter, Alan, Xu, Wei

arXiv.org Artificial IntelligenceFeb-25-2025

The culture of the Post-Soviet states is complex, shaped by a turbulent history that continues to influence current events. In this study, we investigate the Post-Soviet cultural food knowledge of foundation models by constructing BORSch, a multimodal dataset encompassing 1147 and 823 dishes in the Russian and Ukrainian languages, centered around the Post-Soviet region. We demonstrate that leading models struggle to correctly identify the origins of dishes from Post-Soviet nations in both text-only and multimodal Question Answering (QA), instead over-predicting countries linked to the language the question is asked in. Through analysis of pretraining data, we show that these results can be explained by misleading dish-origin co-occurrences, along with linguistic phenomena such as Russian-Ukrainian code mixing. Finally, to move beyond QA-based assessments, we test models' abilities to produce accurate visual descriptions of dishes. The weak correlation between this task and QA suggests that QA alone may be insufficient as an evaluation of cultural understanding. To foster further research, we will make BORSch publicly available at https://github.com/alavrouk/BORSch.

bor ch, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2502.18583

Country:

South America (1.00)
Europe (1.00)
Asia > Middle East (1.00)
(3 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena

Naous, Tarek, Xu, Wei

arXiv.org Artificial IntelligenceJan-8-2025

Language Models (LMs) have been shown to exhibit a strong preference towards entities associated with Western culture when operating in non-Western languages. In this paper, we aim to uncover the origins of entity-related cultural biases in LMs by analyzing several contributing factors, including the representation of entities in pre-training data and the impact of variations in linguistic phenomena across languages. We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities. Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic. We find that LMs struggle in Arabic with entities that appear at high frequencies in pre-training, where entities can hold multiple word senses. This also extends to entities that exhibit high lexical overlap with languages that are not Arabic but use the Arabic script. Further, we show how frequency-based tokenization leads to this issue in LMs, which gets worse with larger Arabic vocabularies. We will make CAMeL-2 available at: https://github.com/tareknaous/camel2

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2501.04662

Country:

Europe (1.00)
Africa > Middle East (1.00)
North America (0.93)
Asia > Middle East > Lebanon (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Measuring, Modeling, and Helping People Account for Privacy Risks in Online Self-Disclosures with AI

Krsek, Isadora, Kabra, Anubha, Dou, Yao, Naous, Tarek, Dabbish, Laura A., Ritter, Alan, Xu, Wei, Das, Sauvik

arXiv.org Artificial IntelligenceDec-19-2024

In pseudonymous online fora like Reddit, the benefits of self-disclosure are often apparent to users (e.g., I can vent about my in-laws to understanding strangers), but the privacy risks are more abstract (e.g., will my partner be able to tell that this is me?). Prior work has sought to develop natural language processing (NLP) tools that help users identify potentially risky self-disclosures in their text, but none have been designed for or evaluated with the users they hope to protect. Absent this assessment, these tools will be limited by the social-technical gap: users need assistive tools that help them make informed decisions, not paternalistic tools that tell them to avoid self-disclosure altogether. To bridge this gap, we conducted a study with N = 21 Reddit users; we had them use a state-of-the-art NLP disclosure detection model on two of their authored posts and asked them questions to understand if and how the model helped, where it fell short, and how it could be improved to help them make more informed decisions. Despite its imperfections, users responded positively to the model and highlighted its use as a tool that can help them catch mistakes, inform them of risks they were unaware of, and encourage self-reflection. However, our work also shows how, to be useful and usable, AI for supporting privacy decision-making must account for posting context, disclosure norms, and users' lived threat models, and provide explanations that help contextualize detected risks.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2412.15047

Country: North America > United States (1.00)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)
Personal > Interview (0.92)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Law (0.93)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
(2 more...)

Add feedback

Stanceosaurus 2.0: Classifying Stance Towards Russian and Spanish Misinformation

Lavrouk, Anton, Ligon, Ian, Naous, Tarek, Zheng, Jonathan, Ritter, Alan, Xu, Wei

arXiv.org Artificial IntelligenceFeb-5-2024

The Stanceosaurus corpus (Zheng et al., 2022) was designed to provide high-quality, annotated, 5-way stance data extracted from Twitter, suitable for analyzing cross-cultural and cross-lingual misinformation. In the Stanceosaurus 2.0 iteration, we extend this framework to encompass Russian and Spanish. The former is of current significance due to prevalent misinformation amid escalating tensions with the West and the violent incursion into Ukraine. The latter, meanwhile, represents an enormous community that has been largely overlooked on major social media platforms. By incorporating an additional 3,874 Spanish and Russian tweets over 41 misinformation claims, our objective is to support research focused on these issues. To demonstrate the value of this data, we employed zero-shot cross-lingual transfer on multilingual BERT, yielding results on par with the initial Stanceosaurus study with a macro F1 score of 43 for both languages. This underlines the viability of stance classification as an effective tool for identifying multicultural misinformation.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2402.03642

Country:

Asia > Russia (1.00)
Europe > Ukraine (0.68)
North America > United States > Texas (0.14)

Genre: Research Report (0.64)

Industry:

Media > News (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
(2 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.35)

Add feedback

ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment

Naous, Tarek, Ryan, Michael J., Lavrouk, Anton, Chandra, Mohit, Xu, Wei

arXiv.org Artificial IntelligenceNov-15-2023

We present a systematic study and comprehensive evaluation of large language models for automatic multilingual readability assessment. In particular, we construct ReadMe++, a multilingual multi-domain dataset with human annotations of 9757 sentences in Arabic, English, French, Hindi, and Russian collected from 112 different data sources. ReadMe++ offers more domain and language diversity than existing readability datasets, making it ideal for benchmarking multilingual and non-English language models (including mBERT, XLM-R, mT5, Llama-2, GPT-4, etc.) in the supervised, unsupervised, and few-shot prompting settings. Our experiments reveal that models fine-tuned on ReadMe++ outperform those trained on single-domain datasets, showcasing superior performance on multi-domain readability assessment and cross-lingual transfer capabilities. We also compare to traditional readability metrics (such as Flesch-Kincaid Grade Level and Open Source Metric for Measuring Arabic Narratives), as well as the state-of-the-art unsupervised metric RSRS (Martinc et al., 2021). We will make our data and code publicly available at: https://github.com/tareknaous/readme.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2305.14463

Country:

Europe > Ukraine (0.28)
North America > United States (0.28)
Europe > Greece (0.28)
Asia > Middle East > UAE (0.14)

Genre: Research Report (1.00)

Industry:

Health & Medicine (1.00)
Government > Regional Government (1.00)
Education (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Having Beer after Prayer? Measuring Cultural Bias in Large Language Models

Naous, Tarek, Ryan, Michael J., Ritter, Alan, Xu, Wei

arXiv.org Artificial IntelligenceNov-15-2023

It is important that language models appropriately adapt to specific cultural contexts. However, as we show in this paper, multilingual and Arabic monolingual language models default to Western culture even when prompted in Arabic and contextualized by an Arab cultural setting. To measure this Western bias, we introduce CAMeL, a dataset of naturally occurring Arabic prompts spanning eight diverse cultural aspects and an extensive list of 20,504 cultural targets corresponding to Arab or Western culture. Using CAMeL, we show that models favor Western targets and demonstrate cultural unfairness on downstream tasks such as named entity recognition and sentiment analysis. Our analyses of pretraining corpora also reveal that commonly used sources such as Wikipedia may not be suited to build culturally aware models, underscoring the importance of carefully curating pretraining data in constructing language models to serve a global population.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2305.14456

Country:

Asia (1.00)
Europe (0.93)
North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.67)

Industry: Media (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Reducing Privacy Risks in Online Self-Disclosures with Language Models

Dou, Yao, Krsek, Isadora, Naous, Tarek, Kabra, Anubha, Das, Sauvik, Ritter, Alan, Xu, Wei

arXiv.org Artificial IntelligenceNov-15-2023

Self-disclosure, while being common and rewarding in social media interaction, also poses privacy risks. In this paper, we take the initiative to protect the user-side privacy associated with online self-disclosure through identification and abstraction. We develop a taxonomy of 19 self-disclosure categories, and curate a large corpus consisting of 4.8K annotated disclosure spans. We then fine-tune a language model for identification, achieving over 75% in Token F$_1$. We further conduct a HCI user study, with 82\% of participants viewing the model positively, highlighting its real world applicability. Motivated by the user feedback, we introduce the task of self-disclosure abstraction. We experiment with both one-span abstraction and three-span abstraction settings, and explore multiple fine-tuning strategies. Our best model can generate diverse abstractions that moderately reduce privacy risks while maintaining high utility according to human evaluation.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2311.09538

Country:

Europe (1.00)
North America > United States (0.67)
Asia > Middle East > UAE (0.14)

Genre:

Research Report (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Information Technology > Security & Privacy (0.94)
Media (0.70)
Government (0.67)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Revisiting non-English Text Simplification: A Unified Multilingual Benchmark

Ryan, Michael J., Naous, Tarek, Xu, Wei

arXiv.org Artificial IntelligenceMay-24-2023

Recent advancements in high-quality, large-scale English resources have pushed the frontier of English Automatic Text Simplification (ATS) research. However, less work has been done on multilingual text simplification due to the lack of a diverse evaluation benchmark that covers complex-simple sentence pairs in many languages. This paper introduces the MultiSim benchmark, a collection of 27 resources in 12 distinct languages containing over 1.7 million complex-simple sentence pairs. This benchmark will encourage research in developing more effective multilingual text simplification models and evaluation metrics. Our experiments using MultiSim with pre-trained multilingual language models reveal exciting performance improvements from multilingual training in non-English settings. We observe strong performance from Russian in zero-shot cross-lingual transfer to low-resource languages. We further show that few-shot prompting with BLOOM-176b achieves comparable quality to reference simplifications outperforming fine-tuned models in most languages. We validate these findings through human evaluation.

machine learning, natural language, simplification, (19 more...)

arXiv.org Artificial Intelligence

2305.15678

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre: Research Report > New Finding (1.00)

Industry:

Media > News (1.00)
Government (0.92)
Health & Medicine > Therapeutic Area > Neurology (0.46)
Education > Educational Setting > K-12 Education (0.45)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(3 more...)

Add feedback

Stanceosaurus: Classifying Stance Towards Multilingual Misinformation

Zheng, Jonathan, Baheti, Ashutosh, Naous, Tarek, Xu, Wei, Ritter, Alan

arXiv.org Artificial IntelligenceOct-28-2022

We present Stanceosaurus, a new corpus of 28,033 tweets in English, Hindi, and Arabic annotated with stance towards 251 misinformation claims. As far as we are aware, it is the largest corpus annotated with stance towards misinformation claims. The claims in Stanceosaurus originate from 15 fact-checking sources that cover diverse geographical regions and cultures. Unlike existing stance datasets, we introduce a more fine-grained 5-class labeling strategy with additional subcategories to distinguish implicit stance. Pre-trained transformer-based stance classifiers that are fine-tuned on our corpus show good generalization on unseen claims and regional claims from countries outside the training data. Cross-lingual experiments demonstrate Stanceosaurus' capability of training multi-lingual models, achieving 53.1 F1 on Hindi and 50.4 F1 on Arabic without any target-language fine-tuning. Finally, we show how a domain adaptation method can be used to improve performance on Stanceosaurus using additional RumourEval-2019 data. We make Stanceosaurus publicly available to the research community and hope it will encourage further work on misinformation identification across languages and cultures.

computational linguistic, information retrieval, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2210.15954

Country:

Europe > United Kingdom (1.00)
Asia (1.00)
North America > Canada (0.69)
(2 more...)

Genre: Research Report (0.82)

Industry:

Media > News (1.00)
Materials > Chemicals (1.00)
Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (1.00)
(9 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Clustering Plotted Data by Image Segmentation

Naous, Tarek, Sarkar, Srinjay, Abid, Abubakar, Zou, James

arXiv.org Artificial IntelligenceOct-6-2021

Clustering algorithms are one of the main analytical methods to detect patterns in unlabeled data. Existing clustering methods typically treat samples in a dataset as points in a metric space and compute distances to group together similar points. In this paper, we present a wholly different way of clustering points in 2-dimensional space, inspired by how humans cluster data: by training neural networks to perform instance segmentation on plotted data. Our approach, Visual Clustering, has several advantages over traditional clustering algorithms: it is much faster than most existing clustering algorithms (making it suitable for very large datasets), it agrees strongly with human intuition for clusters, and it is by default hyperparameter free (although additional steps with hyperparameters can be introduced for more control of the algorithm). We describe the method and compare it to ten other clustering methods on synthetic data to illustrate its advantages and disadvantages. We then demonstrate how our approach can be extended to higher dimensional data and illustrate its performance on real-world data. The implementation of Visual Clustering is publicly available and can be applied to any dataset in a few lines of code.

artificial intelligence, dataset, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2110.05187

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback