AITopics | Schütze, Hinrich

Collaborating Authors

Schütze, Hinrich

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Large Language Models as Neurolinguistic Subjects: Identifying Internal Representations for Form and Meaning

He, Linyang, Nie, Ercong, Schmid, Helmut, Schütze, Hinrich, Mesgarani, Nima, Brennan, Jonathan

arXiv.org Artificial IntelligenceNov-11-2024

This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning) by distinguishing two LLM evaluation paradigms: psycholinguistic and neurolinguistic. Traditional psycholinguistic evaluations often reflect statistical biases that may misrepresent LLMs' true linguistic capabilities. We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers. This method allows for a detailed examination of how LLMs represent form and meaning, and whether these representations are consistent across languages. Our contributions are three-fold: (1) We compare neurolinguistic and psycholinguistic methods, revealing distinct patterns in LLM assessment; (2) We demonstrate that LLMs exhibit higher competence in form compared to meaning, with the latter largely correlated to the former; (3) We present new conceptual minimal pair datasets for Chinese (COMPS-ZH) and German (COMPS-DE), complementing existing English datasets.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2411.07533

Country:

North America > United States (0.68)
Europe > Middle East > Malta (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.94)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages

Kargaran, Amir Hossein, Yvon, François, Schütze, Hinrich

arXiv.org Artificial IntelligenceOct-31-2024

The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community. Corpus v. 1.0 https://huggingface.co/datasets/cis-lmu/GlotCC-v1, Pipeline v. 3.0 https://github.com/cisnlp/GlotCC.

computational linguistic, large language model, machine learning, (23 more...)

arXiv.org Artificial Intelligence

2410.23825

Country:

Europe > France (0.28)
North America > Canada (0.28)
Asia > Middle East > UAE (0.14)
North America > Mexico > Mexico City (0.14)

Genre: Research Report (1.00)

Industry:

Law (0.93)
Information Technology (0.93)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(5 more...)

Add feedback

MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

Kargaran, Amir Hossein, Modarressi, Ali, Nikeghbal, Nafiseh, Diesner, Jana, Yvon, François, Schütze, Hinrich

arXiv.org Artificial IntelligenceOct-8-2024

English-centric large language models (LLMs) often show strong multilingual capabilities. However, the multilingual performance of these models remains unclear and is not thoroughly evaluated for many languages. Most benchmarks for multilinguality focus on classic NLP tasks, or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages the fact that English-centric LLMs use English as a kind of pivot language in their intermediate layers. It computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in other languages. We conduct studies using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves a statistically significant average Pearson correlation of 0.90 with three established downstream tasks across nine models and two parallel datasets. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs. Leaderboard: https://huggingface.co/spaces/cis-lmu/Mexa, Code: https://github.com/cisnlp/Mexa.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.05873

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Better Call SAUL: Fluent and Consistent Language Model Editing with Generation Regularization

Wang, Mingyang, Lange, Lukas, Adel, Heike, Strötgen, Jannik, Schütze, Hinrich

arXiv.org Artificial IntelligenceOct-3-2024

To ensure large language models contain up-to-date knowledge, they need to be updated regularly. However, model editing is challenging as it might also affect knowledge that is unrelated to the new data. State-of-the-art methods identify parameters associated with specific knowledge and then modify them via direct weight updates. However, these locate-and-edit methods suffer from heavy computational overhead and lack theoretical validation. In contrast, directly fine-tuning the model on requested edits affects the model's behavior on unrelated knowledge, and significantly damages the model's generation fluency and consistency. To address these challenges, we propose SAUL, a streamlined model editing method that uses sentence concatenation with augmented random facts for generation regularization. Evaluations on three model editing benchmarks show that SAUL is a practical and reliable solution for model editing outperforming state-of-the-art methods while maintaining generation quality and reducing computational overhead.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2410.02433

Country:

Asia (0.69)
Europe > Germany > Baden-Württemberg (0.14)

Genre: Research Report > Promising Solution (0.68)

Industry:

Leisure & Entertainment (0.50)
Media > Television (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)

Add feedback

LangSAMP: Language-Script Aware Multilingual Pretraining

Liu, Yihong, Ye, Haotian, Ma, Chunlan, Wang, Mingyang, Schütze, Hinrich

arXiv.org Artificial IntelligenceSep-26-2024

Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings -- learnable vectors assigned to different languages. These embeddings are discarded for two main reasons: (1) mPLMs are expected to have a single, unified parameter set across all languages, and (2) they need to function seamlessly as universal text encoders without requiring language IDs as input. However, this removal increases the burden on token embeddings to encode all language-specific information, which may hinder the model's ability to produce more language-neutral representations. To address this challenge, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning while maintaining a simple architecture. Specifically, we integrate these embeddings into the output of the transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline. Extensive analysis further shows that language/script embeddings encode language/script-specific information, which improves the selection of source languages for crosslingual transfer. We make our code and models publicly available at \url{https://github.com/cisnlp/LangSAMP}.

computational linguistic, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2409.18199

Country:

Europe (1.00)
Asia (0.93)
North America > United States > Minnesota (0.27)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.66)

Add feedback

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Ji, Shaoxiong, Li, Zihao, Paul, Indraneil, Paavola, Jaakko, Lin, Peiqin, Chen, Pinzhen, O'Brien, Dayyán, Luo, Hengyu, Schütze, Hinrich, Tiedemann, Jörg, Haddow, Barry

arXiv.org Artificial IntelligenceSep-26-2024

In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks and PolyWrite, an open-ended generation benchmark developed in this study. Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability.

large language model, latn, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2409.17892

Country:

North America > United States (0.45)
Europe > Germany (0.28)
Europe > Austria > Vienna (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Media (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish

Yüksel, Arda, Köksal, Abdullatif, Şenel, Lütfi Kerem, Korhonen, Anna, Schütze, Hinrich

arXiv.org Artificial IntelligenceJul-17-2024

Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs' understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language. We publicly release our code for the dataset and evaluation: https://github.com/ArdaYueksel/TurkishMMLU.

artificial intelligence, large language model, turkishmmlu, (1 more...)

arXiv.org Artificial Intelligence

2407.12402

Country: Asia > Middle East > Republic of Türkiye (0.24)

Genre: Research Report (0.69)

Industry: Education > Educational Setting > K-12 Education (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Consistent Document-Level Relation Extraction via Counterfactuals

Modarressi, Ali, Köksal, Abdullatif, Schütze, Hinrich

arXiv.org Artificial IntelligenceJul-9-2024

Many datasets have been developed to train and evaluate document-level relation extraction (RE) models. Most of these are constructed using real-world data. It has been shown that RE models trained on real-world data suffer from factual biases. To evaluate and address this issue, we present CovEReD, a counterfactual data generation approach for document-level relation extraction datasets using entity replacement. We first demonstrate that models trained on factual data exhibit inconsistent behavior: while they accurately extract triples from factual data, they fail to extract the same triples after counterfactual modification. This inconsistency suggests that models trained on factual data rely on spurious signals such as specific entities and external knowledge $\unicode{x2013}$ rather than on the input context $\unicode{x2013}$ to extract triples. We show that by generating document-level counterfactual data with CovEReD and training models on them, consistency is maintained with minimal impact on RE performance. We release our CovEReD pipeline as well as Re-DocRED-CF, a dataset of counterfactual RE documents, to assist in evaluating and addressing inconsistency in document-level RE.

artificial intelligence, natural language, replacement, (17 more...)

arXiv.org Artificial Intelligence

2407.06699

Country:

Europe (1.00)
North America > United States (0.69)
Asia > Middle East > UAE (0.14)

Genre: Research Report (0.50)

Industry:

Media (0.70)
Leisure & Entertainment (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts

Ma, Chunlan, Liu, Yihong, Ye, Haotian, Schütze, Hinrich

arXiv.org Artificial IntelligenceJul-2-2024

Decoder-only large language models (LLMs) excel in high-resource languages across various tasks through few-shot or even zero-shot in-context learning (ICL). However, their performance often does not transfer well to low-resource languages, especially those written in non-Latin scripts. Inspired by recent work that leverages transliteration in encoder-only models, we investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts. To this end, we propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both. We apply these methods to several representative LLMs of different sizes on various tasks including text classification and sequential labeling. Our findings show that the effectiveness of transliteration varies by task type and model size. For instance, all models benefit from transliterations for sequential labeling (with increases of up to 25%).

cript, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2407.0232

Country:

North America > Canada (0.28)
Europe > Middle East > Malta (0.14)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation

Liu, Yongkang, Nie, Ercong, Feng, Shi, Hua, Zheng, Ding, Zifeng, Wang, Daling, Zhang, Yifei, Schütze, Hinrich

arXiv.org Artificial IntelligenceJun-28-2024

Current state-of-the-art dialogue systems heavily rely on extensive training datasets. However, challenges arise in domains where domain-specific training datasets are insufficient or entirely absent. To tackle this challenge, we propose a novel data \textbf{A}ugmentation framework for \textbf{M}ulti-\textbf{D}omain \textbf{D}ialogue \textbf{G}eneration, referred to as \textbf{AMD$^2$G}. The AMD$^2$G framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training. We posit that domain corpora are a blend of domain-agnostic and domain-specific features, with certain representation patterns shared among diverse domains. Domain-agnostic training aims to enable models to learn these common expressive patterns. To construct domain-agnostic dialogue corpora, we employ a \textit{\textbf{de-domaining}} data processing technique used to remove domain-specific features. By mitigating the effects of domain-specific features, the model trained on the de-domained corpora can effectively learn common expression patterns in different domains. Subsequently, we adapt the learned domain-agnostic features to the target domain through domain adaptation training. We conduct experiments on Chinese dialogue datasets from five different domains and show that AMD$^2$G achieves superior performance compared to both direct training on the target domain corpus and collective training on all five domain corpora. Our work underscores AMD$^2$G as a viable alternative solution for low-resource multi-domain dialogue generation. Code and data associated with our work are available on GitHub repository$^{\text 1}$.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2406.09881

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology (0.89)
Leisure & Entertainment (0.68)
Media > Film (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)

Add feedback