AITopics | Samir, Farhan

Collaborating Authors

Samir, Farhan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset

Samir, Farhan, Ahn, Emily P., Prakash, Shreya, Soskuthy, Márton, Shwartz, Vered, Zhu, Jian

arXiv.org Artificial IntelligenceOct-5-2024

Curating datasets that span multiple languages is challenging. To make the collection more scalable, researchers often incorporate one or more imperfect classifiers in the process, like language identification models. These models, however, are prone to failure, resulting in some language subsets being unreliable for downstream tasks. We introduce a statistical test, the Preference Proportion Test, for identifying such unreliable subsets. By annotating only 20 samples for a language subset, we're able to identify systematic transcription errors for 10 language subsets in a recent large multilingual transcribed audio dataset, X-IPAPack (Zhu et al., 2024). We find that filtering this low-quality data out when training models for the downstream task of phonetic transcription brings substantial benefits, most notably a 25.7% relative improvement on transcribing recordings in out-of-distribution languages. Our method lays a path forward for systematic and reliable multilingual dataset auditing.

machine learning, natural language, transcript, (17 more...)

arXiv.org Artificial Intelligence

2410.04292

Country:

Asia (0.28)
North America > Mexico > Mexico City (0.14)

Genre: Research Report > New Finding (0.47)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech (0.93)

Add feedback

Locating Information Gaps and Narrative Inconsistencies Across Languages: A Case Study of LGBT People Portrayals on Wikipedia

Samir, Farhan, Park, Chan Young, Field, Anjalie, Shwartz, Vered, Tsvetkov, Yulia

arXiv.org Artificial IntelligenceOct-5-2024

To explain social phenomena and identify systematic biases, much research in computational social science focuses on comparative text analyses. These studies often rely on coarse corpus-level statistics or local word-level analyses, mainly in English. We introduce the InfoGap method -- an efficient and reliable approach to locating information gaps and inconsistencies in articles at the fact level, across languages. We evaluate InfoGap by analyzing LGBT people's portrayals, across 2.7K biography pages on English, Russian, and French Wikipedias. We find large discrepancies in factual coverage across the languages. Moreover, our analysis reveals that biographical facts carrying negative connotations are more likely to be highlighted in Russian Wikipedia. Crucially, InfoGap both facilitates large scale analyses, and pinpoints local document- and fact-level information gaps, laying a new foundation for targeted and nuanced comparative language analysis at scale.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.04282

Country:

Europe (1.00)
Asia (0.68)
North America > United States (0.28)
North America > Canada (0.28)

Genre: Research Report > New Finding (0.68)

Industry: Government (0.67)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)

Add feedback

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

Zhu, Jian, Yang, Changbing, Samir, Farhan, Islam, Jahurul

arXiv.org Artificial IntelligenceApr-1-2024

In this project, we demonstrate that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages. We curated the IPAPACK, a massively multilingual speech corpora with phonemic transcriptions, encompassing more than 115 languages from diverse language families, selectively checked by linguists. Based on the IPAPACK, we propose CLAP-IPA, a multi-lingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between arbitrary speech signals and phonemic sequences. The proposed model was tested on 95 unseen languages, showing strong generalizability across languages. Temporal alignments between phonemes and speech signals also emerged from contrastive training, enabling zeroshot forced alignment in unseen languages. We further introduced a neural forced aligner IPA-ALIGNER by finetuning CLAP-IPA with the Forward-Sum loss to learn better phone-to-audio alignment. Evaluation results suggest that IPA-ALIGNER can generalize to unseen languages without adaptation.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2311.08323

Country:

Europe (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)

Add feedback

Understanding Compositional Data Augmentation in Typologically Diverse Morphological Inflection

Samir, Farhan, Silfverberg, Miikka

arXiv.org Artificial IntelligenceOct-23-2023

Data augmentation techniques are widely used in low-resource automatic morphological inflection to overcome data sparsity. However, the full implications of these techniques remain poorly understood. In this study, we aim to shed light on the theoretical aspects of the prominent data augmentation strategy StemCorrupt (Silfverberg et al., 2017; Anastasopoulos and Neubig, 2019), a method that generates synthetic examples by randomly substituting stem characters in gold standard training examples. To begin, we conduct an information-theoretic analysis, arguing that StemCorrupt improves compositional generalization by eliminating spurious correlations between morphemes, specifically between the stem and the affixes. Our theoretical analysis further leads us to study the sample efficiency with which StemCorrupt reduces these spurious correlations. Through evaluation across seven typologically distinct languages, we demonstrate that selecting a subset of datapoints with both high diversity and high predictive uncertainty significantly enhances the data-efficiency of StemCorrupt. However, we also explore the impact of typological features on the choice of the data selection strategy and find that languages incorporating a high degree of allomorphy and phonological alternations derive less benefit from synthetic examples with high uncertainty. We attribute this effect to phonotactic violations induced by StemCorrupt, emphasizing the need for further research to ensure optimal performance across the entire spectrum of natural language morphology.

computational linguistic, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2305.13658

Country:

Europe (1.00)
Asia > Middle East (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.54)

Add feedback