AITopics | Šuppa, Marek

Collaborating Authors

Šuppa, Marek

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MMTEB: Massive Multilingual Text Embedding Benchmark

Enevoldsen, Kenneth, Chung, Isaac, Kerboua, Imene, Kardos, Márton, Mathur, Ashwin, Stap, David, Gala, Jay, Siblini, Wissam, Krzemiński, Dominik, Winata, Genta Indra, Sturua, Saba, Utpala, Saiteja, Ciancone, Mathieu, Schaeffer, Marion, Sequeira, Gabriel, Misra, Diganta, Dhakal, Shreeya, Rystrøm, Jonathan, Solomatin, Roman, Çağatan, Ömer, Kundu, Akash, Bernstorff, Martin, Xiao, Shitao, Sukhlecha, Akshita, Pahwa, Bhavish, Poświata, Rafał, GV, Kranthi Kiran, Ashraf, Shawon, Auras, Daniel, Plüster, Björn, Harries, Jan Philipp, Magne, Loïc, Mohr, Isabelle, Hendriksen, Mariya, Zhu, Dawei, Gisserot-Boukhlef, Hippolyte, Aarsen, Tom, Kostkan, Jan, Wojtasik, Konrad, Lee, Taemin, Šuppa, Marek, Zhang, Crystina, Rocca, Roberta, Hamdy, Mohammed, Michail, Andrianos, Yang, John, Faysse, Manuel, Vatolin, Aleksei, Thakur, Nandan, Dey, Manan, Vasani, Dipam, Chitale, Pranjal, Tedeschi, Simone, Tai, Nguyen, Snegirev, Artem, Günther, Michael, Xia, Mengzhou, Shi, Weijia, Lù, Xing Han, Clive, Jordan, Krishnakumar, Gayatri, Maksimova, Anna, Wehrli, Silvan, Tikhonova, Maria, Panchal, Henil, Abramov, Aleksandr, Ostendorff, Malte, Liu, Zheng, Clematide, Simon, Miranda, Lester James, Fenogenova, Alena, Song, Guangyu, Safi, Ruqiya Bin, Li, Wen-Ding, Borghini, Alessia, Cassano, Federico, Su, Hongjin, Lin, Jimmy, Yen, Howard, Hansen, Lasse, Hooker, Sara, Xiao, Chenghao, Adlakha, Vaibhav, Weller, Orion, Reddy, Siva, Muennighoff, Niklas

arXiv.org Artificial IntelligenceFeb-19-2025

Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.13595

Country:

Europe (1.00)
North America > United States > Colorado (0.14)
North America > United States > Oregon (0.14)
(3 more...)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Health & Medicine > Public Health (0.45)
Government > Regional Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

RoBo6: Standardized MMT Light Curve Dataset for Rocket Body Classification

Kyselica, Daniel, Šuppa, Marek, Šilha, Jiří, Ďurikovič, Roman

arXiv.org Artificial IntelligenceNov-30-2024

Space debris presents a critical challenge for the sustainability of future space missions, emphasizing the need for robust and standardized identification methods. However, a comprehensive benchmark for rocket body classification remains absent. This paper addresses this gap by introducing the RoBo6 dataset for rocket body classification based on light curves. The dataset, derived from the Mini Mega Tortora database, includes light curves for six rocket body classes: CZ-3B, Atlas 5 Centaur, Falcon 9, H-2A, Ariane 5, and Delta 4. With 5,676 training and 1,404 test samples, it addresses data inconsistencies using resampling, normalization, and filtering techniques. Several machine learning models were evaluated, including CNN and transformer-based approaches, with Astroconformer reporting the best performance. The dataset establishes a common benchmark for future comparisons and advancements in rocket body classification tasks.

artificial intelligence, light curve, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2412.00544

Country: Europe > Slovakia (0.17)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

Bryndza at ClimateActivism 2024: Stance, Target and Hate Event Detection via Retrieval-Augmented GPT-4 and LLaMA

Šuppa, Marek, Skala, Daniel, Jašš, Daniela, Sučík, Samuel, Švec, Andrej, Hraška, Peter

arXiv.org Artificial IntelligenceFeb-9-2024

This study details our approach for the CASE 2024 Shared Task on Climate Activism Stance and Hate Event Detection, focusing on Hate Speech Detection, Hate Speech Target Identification, and Stance Detection as classification challenges. We explored the capability of Large Language Models (LLMs), particularly GPT-4, in zero- or few-shot settings enhanced by retrieval augmentation and re-ranking for Tweet classification. Our goal was to determine if LLMs could match or surpass traditional methods in this context. We conducted an ablation study with LLaMA for comparison, and our results indicate that our models significantly outperformed the baselines, securing second place in the Target Detection task. The code for our submission is available at https://github.com/NaiveNeuron/bryndza-case-2024

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2402.06549

Country: Europe > Norway (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Energy > Oil & Gas (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

Mayhew, Stephen, Blevins, Terra, Liu, Shuheng, Šuppa, Marek, Gonen, Hila, Imperial, Joseph Marvin, Karlsson, Börje F., Lin, Peiqin, Ljubešić, Nikola, Miranda, LJ, Plank, Barbara, Riabi, Arij, Pinter, Yuval

arXiv.org Artificial IntelligenceNov-15-2023

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public.

artificial intelligence, natural language, text processing, (14 more...)

arXiv.org Artificial Intelligence

2311.09122

Country:

North America > United States > California (0.14)
Europe > Italy > Tuscany (0.14)
Asia > Middle East > Republic of Türkiye (0.14)

Genre: Research Report (0.50)

Industry:

Government (0.67)
Education (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition

Šuba, Dávid, Šuppa, Marek, Kubík, Jozef, Hamerlik, Endre, Takáč, Martin

arXiv.org Artificial IntelligenceApr-8-2023

Named Entity Recognition (NER) is a lower-level In this paper we focus on Slovak, a language Natural Language Processing (NLP) task in which of the Indo-European family, spoken by 5 million the aim is to both identify and classify named entity native speakers, which is still missing a manually expressions in text into a pre-defined set of annotated NER dataset of substantial size. To fill semantic types, such as Location, Organization or this gap, we propose the following contributions: Person (Goyal et al., 2018). It is a key component of many downstream NLP tasks, ranging from information We introduce a novel, manually annotated extraction, machine translation, question NER dataset called WikiGoldSK built by annotating answering to entity linking and co-reference resolution, articles sampled from Slovak Wikipedia among others. Since its introduction at and labeled with four entity classes. MUC-6 (Grishman and Sundheim, 1996), the task We evaluate a selection of multilingual NER has been studied extensively, usually as a form of baseline models on the presented dataset to token classification. In recent years, the advent compare its quality with that of existing silverstandard of pre-trained language models (PLMs) combined Slovak NER datasets.

artificial intelligence, natural language, text processing, (14 more...)

arXiv.org Artificial Intelligence

2304.04026

Country:

Europe (0.46)
Asia (0.46)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback