AITopics | Mæhlum, Petter

Collaborating Authors

Mæhlum, Petter

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Burchell, Laurie, de Gibert, Ona, Arefyev, Nikolay, Aulamo, Mikko, Bañón, Marta, Chen, Pinzhen, Fedorova, Mariia, Guillou, Liane, Haddow, Barry, Hajič, Jan, Helcl, Jindřich, Henriksson, Erik, Klimaszewski, Mateusz, Komulainen, Ville, Kutuzov, Andrey, Kytöniemi, Joona, Laippala, Veronika, Mæhlum, Petter, Malik, Bhavitvya, Mehryary, Farrokh, Mikhailov, Vladislav, Moghe, Nikita, Myntti, Amanda, O'Brien, Dayyán, Oepen, Stephan, Pal, Proyag, Piha, Jousia, Pyysalo, Sampo, Ramírez-Sánchez, Gema, Samuel, David, Stepachev, Pavel, Tiedemann, Jörg, Variš, Dušan, Vojtěchová, Tereza, Zaragoza-Bernabeu, Jaume

arXiv.org Artificial IntelligenceMar-14-2025

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

artificial intelligence, machine translation, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.10267

Country:

Europe (1.00)
Asia > Middle East (0.92)
South America (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology (0.67)
Government (0.46)
Education (0.46)
(2 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Multi-label Scandinavian Language Identification (SLIDE)

Fedorova, Mariia, Frydenberg, Jonas Sebulon, Handford, Victoria, Langø, Victoria Ovedie Chruickshank, Willoch, Solveig Helene, Midtgaard, Marthe Løken, Scherrer, Yves, Mæhlum, Petter, Samuel, David

arXiv.org Artificial IntelligenceFeb-10-2025

Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokm\r{a}l, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed-accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.06692

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Add feedback

Mixed Feelings: Cross-Domain Sentiment Classification of Patient Feedback

Rønningstad, Egil, Storset, Lilja Charlotte, Mæhlum, Petter, Øvrelid, Lilja, Velldal, Erik

arXiv.org Artificial IntelligenceJan-31-2025

Sentiment analysis of patient feedback from the public health domain can aid decision makers in evaluating the provided services. The current paper focuses on free-text comments in patient surveys about general practitioners and psychiatric healthcare, annotated with four sentence-level polarity classes -- positive, negative, mixed and neutral -- while also attempting to alleviate data scarcity by leveraging general-domain sources in the form of reviews. For several different architectures, we compare in-domain and out-of-domain effects, as well as the effects of training joint multi-domain models.

artificial intelligence, natural language, text classification, (16 more...)

arXiv.org Artificial Intelligence

2501.19134

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Industry: Health & Medicine > Health Care Providers & Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.72)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.72)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.50)

Add feedback

A Collection of Question Answering Datasets for Norwegian

Mikhailov, Vladislav, Mæhlum, Petter, Langø, Victoria Ovedie Chruickshank, Velldal, Erik, Øvrelid, Lilja

arXiv.org Artificial IntelligenceJan-19-2025

This paper introduces a new suite of question answering datasets for Norwegian; NorOpenBookQA, NorCommonSenseQA, NorTruthfulQA, and NRK-Quiz-QA. The data covers a wide range of skills and knowledge domains, including world knowledge, commonsense reasoning, truthfulness, and knowledge about Norway. Covering both of the written standards of Norwegian - Bokm{\aa}l and Nynorsk - our datasets comprise over 10k question-answer pairs, created by native speakers. We detail our dataset creation approach and present the results of evaluating 11 language models (LMs) in zero- and few-shot regimes. Most LMs perform better in Bokm{\aa}l than Nynorsk, struggle most with commonsense reasoning, and are often untruthful in generating answers to questions. All our datasets and annotation materials are publicly available.

artificial intelligence, dataset, natural language

arXiv.org Artificial Intelligence

2501.11128

Country: Europe > Norway (0.24)

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.60)

Add feedback

The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

de la Rosa, Javier, Mikhailov, Vladislav, Zhang, Lemei, Wetjen, Freddy, Samuel, David, Liu, Peng, Braaten, Rolv-Arild, Mæhlum, Petter, Birkenes, Magnus Breder, Kutuzov, Andrey, Enstad, Tita, Brygfjeld, Svein Arne, Gulla, Jon Atle, Oepen, Stephan, Velldal, Erik, Østgulen, Wilfred, Øvrelid, Liljia, Myhre, Aslak Sira

arXiv.org Artificial IntelligenceDec-12-2024

The use of copyrighted materials in training generative language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of copyrighted materials on the performance of large language models (LLMs) for Norwegian. We found that both books and newspapers contribute positively when the models are evaluated on a diverse set of Norwegian benchmarks, while fiction works possibly lead to decreased performance. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.0946

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Media > News (0.38)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

It's Difficult to be Neutral -- Human and LLM-based Sentiment Annotation of Patient Comments

Mæhlum, Petter, Samuel, David, Norman, Rebecka Maria, Jelin, Elma, Bjertnæs, Øyvind Andresen, Øvrelid, Lilja, Velldal, Erik

arXiv.org Artificial IntelligenceApr-29-2024

Sentiment analysis is an important tool for aggregating patient voices, in order to provide targeted improvements in healthcare services. A prerequisite for this is the availability of in-domain data annotated for sentiment. This article documents an effort to add sentiment annotations to free-text comments in patient surveys collected by the Norwegian Institute of Public Health (NIPH). However, annotation can be a time-consuming and resource-intensive process, particularly when it requires domain expertise. We therefore also evaluate a possible alternative to human annotation, using large language models (LLMs) as annotators. We perform an extensive evaluation of the approach for two openly available pretrained LLMs for Norwegian, experimenting with different configurations of prompts and in-context learning, comparing their performance to human annotators. We find that even for zero-shot runs, models perform well above the baseline for binary sentiment, but still cannot compete with human annotators on the full dataset.

large language model, natural language, sentiment, (19 more...)

arXiv.org Artificial Intelligence

2404.18832

Country:

Europe (1.00)
North America > United States > Oregon (0.14)

Genre:

Questionnaire & Opinion Survey (0.66)
Research Report (0.64)
Overview (0.46)

Industry:

Health & Medicine > Consumer Health (0.48)
Health & Medicine > Therapeutic Area (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Estimating Lexical Complexity from Document-Level Distributions

Wold, Sondre, Mæhlum, Petter, Hove, Oddbjørn

arXiv.org Artificial IntelligenceApr-1-2024

Existing methods for complexity estimation are typically developed for entire documents. This limitation in scope makes them inapplicable for shorter pieces of text, such as health assessment tools. These typically consist of lists of independent sentences, all of which are too short for existing methods to apply. The choice of wording in these assessment tools is crucial, as both the cognitive capacity and the linguistic competency of the intended patient groups could vary substantially. As a first step towards creating better tools for supporting health practitioners, we develop a two-step approach for estimating lexical complexity that does not rely on any pre-annotated data. We implement our approach for the Norwegian language and verify its effectiveness using statistical testing and a qualitative evaluation of samples from real assessment tools. We also investigate the relationship between our complexity measure and certain features typically associated with complexity in the literature, such as word length, frequency, and the number of syllables.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2404.01196

Country:

Europe (0.68)
North America > United States > Utah (0.14)
North America > United States > Louisiana (0.14)
North America > United States > Iowa (0.14)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.68)

Add feedback