Goto

Collaborating Authors

 tf-idf 0


77b5aaf2826c95c98e5eb4ab830073de-Supplemental-Conference.pdf

Neural Information Processing Systems

Asystem of regions (also referred to as anetwork) can comprise multiple disjoint regions that exhibit shared activity patterns across a range of tasks. The auditory system is located in the superior temporal region of the brain. This region uniquely encodespitch, speech, and music, butisnot involvedinhigh-levellanguage comprehensionandproduction[Norman-Haignereetal.,2015,2019].Inourexperimentspertaining to programming language comprehension, we use the activity seen in the auditory system as a negativecontrol. ForthePython program comprehension experiment, individual programs were modeled using the period from the onset of the code/sentence problem until the buttonpress. See Fedorenko et al. [2010] for a discussion of the functional localization approach as it pertains to the language network.


A Brain regions

Neural Information Processing Systems

A system of regions (also referred to as a network) can comprise multiple disjoint regions that exhibit shared activity patterns across a range of tasks. The auditory system is located in the superior temporal region of the brain. The voxels were then filtered using gray-matter masking and (for MD and the Language systems) network localization. See Fedorenko et al. [2010] for a discussion of the functional localization approach as it pertains to the language network. For each brain system and each code property or code model, we run a separate MVP A analysis.


PromotionGo at SemEval-2025 Task 11: A Feature-Centric Framework for Cross-Lingual Multi-Emotion Detection in Short Texts

Huang, Ziyi, Cui, Xia

arXiv.org Artificial Intelligence

This paper presents our system for SemEval 2025 Task 11: Bridging the Gap in Text-Based Emotion Detection (Track A), which focuses on multi-label emotion detection in short texts. We propose a feature-centric framework that dynamically adapts document representations and learning algorithms to optimize language-specific performance. Our study evaluates three key components: document representation, dimensionality reduction, and model training in 28 languages, highlighting five for detailed analysis. The results show that TF-IDF remains highly effective for low-resource languages, while contextual embeddings like FastText and transformer-based document representations, such as those produced by Sentence-BERT, exhibit language-specific strengths. Principal Component Analysis (PCA) reduces training time without compromising performance, particularly benefiting FastText and neural models such as Multi-Layer Perceptrons (MLP). Computational efficiency analysis underscores the trade-off between model complexity and processing cost. Our framework provides a scalable solution for multilingual emotion detection, addressing the challenges of linguistic diversity and resource constraints.


AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection

Demirok, Basak, Kutlu, Mucahid

arXiv.org Artificial Intelligence

With the rapid advancement of LLM models, they have become widely useful in various fields. While these AI systems can be used for code generation, significantly simplifying and accelerating the tasks of developers, their use for students to do assignments has raised ethical questions in the field of education. In this context, determining the author of a particular code becomes important. In this study, we introduce AIGCodeSet, a dataset for AI-generated code detection tasks, specifically for the Python programming language. We obtain the problem descriptions and human-written codes from the CodeNet dataset. Using the problem descriptions, we generate AI-written codes with CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash models in three approaches: i) generating code from the problem description alone, ii) generating code using the description along with human-written source code containing runtime errors, and iii) generating code using the problem description and human-written code that resulted in wrong answers. Lastly, we conducted a post-processing step to eliminate LLM output irrelevant to code snippets. Overall, AIGCodeSet consists of 2,828 AI-generated and 4,755 human-written code snippets. We share our code with the research community to support studies on this important topic and provide performance results for baseline AI-generated code detection methods.


MELO: An Evaluation Benchmark for Multilingual Entity Linking of Occupations

Retyk, Federico, Gasco, Luis, Carrino, Casimiro Pio, Deniz, Daniel, Zbib, Rabih

arXiv.org Artificial Intelligence

We present the Multilingual Entity Linking of Occupations (MELO) Benchmark, a new collection of 48 datasets for evaluating the linking of entity mentions in 21 languages to the ESCO Occupations multilingual taxonomy. MELO was built using high-quality, pre-existent human annotations. We conduct experiments with simple lexical models and general-purpose sentence encoders, evaluated as bi-encoders in a zero-shot setup, to establish baselines for future research.


MasonTigers at SemEval-2024 Task 1: An Ensemble Approach for Semantic Textual Relatedness

Goswami, Dhiman, Puspo, Sadiya Sayara Chowdhury, Raihan, Md Nishat, Emran, Al Nahian Bin, Ganguly, Amrita, Zampieri, Marcos

arXiv.org Artificial Intelligence

This paper presents the MasonTigers entry to the SemEval-2024 Task 1 - Semantic Textual Relatedness. The task encompasses supervised (Track A), unsupervised (Track B), and cross-lingual (Track C) approaches across 14 different languages. MasonTigers stands out as one of the two teams who participated in all languages across the three tracks. Our approaches achieved rankings ranging from 11th to 21st in Track A, from 1st to 8th in Track B, and from 5th to 12th in Track C. Adhering to the task-specific constraints, our best performing approaches utilize ensemble of statistical machine learning approaches combined with language-specific BERT based models and sentence transformers.


Evaluating Embeddings for One-Shot Classification of Doctor-AI Consultations

Ojo, Olumide Ebenezer, Adebanji, Olaronke Oluwayemisi, Gelbukh, Alexander, Calvo, Hiram, Feldman, Anna

arXiv.org Artificial Intelligence

Effective communication between healthcare providers and patients is crucial to providing high-quality patient care. In this work, we investigate how Doctor-written and AI-generated texts in healthcare consultations can be classified using state-of-the-art embeddings and one-shot classification systems. By analyzing embeddings such as bag-of-words, character n-grams, Word2Vec, GloVe, fastText, and GPT2 embeddings, we examine how well our one-shot classification systems capture semantic information within medical consultations. Results show that the embeddings are capable of capturing semantic features from text in a reliable and adaptable manner. Overall, Word2Vec, GloVe and Character n-grams embeddings performed well, indicating their suitability for modeling targeted to this task. GPT2 embedding also shows notable performance, indicating its suitability for models tailored to this task as well. Our machine learning architectures significantly improved the quality of health conversations when training data are scarce, improving communication between patients and healthcare providers.


Disaster Tweets Classification using BERT-Based Language Model

Le, Anh Duc

arXiv.org Artificial Intelligence

Social networking services have became an important communication channel in time of emergency. The aim of this study is to create a machine learning language model that is able to investigate if a person or area was in danger or not. The ubiquitousness of smartphones enables people to announce an emergency they are observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter (i.e. disaster relief organizations and news agencies). Design a language model that is able to understand and acknowledge when a disaster is happening based on the social network posts will become more and more necessary over time.


Hybrid consistency and plausibility verification of product data according to FIC

Schorr, Christian

arXiv.org Artificial Intelligence

The labelling of food products in the EU is regulated by the Food Information of Customers (FIC). Companies are required to provide the corresponding information regarding nutrients and allergens among others. With the rise of e-commerce more and more food products are sold online. There are often errors in the online product descriptions regarding the FIC-relevant information due to low data quality in the vendors' product data base. In this paper we propose a hybrid approach of both rule-based and machine learning to verify nutrient declaration and allergen labelling according to FIC requirements. Special focus is given to the problem of false negatives in allergen prediction since this poses a significant health risk to customers. Results show that a neural net trained on a subset of the ingredients of a product is capable of predicting the allergens contained with a high reliability.


How to detect novelty in textual data streams? A comparative study of existing methods

Christophe, Clément, Velcin, Julien, Cugliari, Jairo, Suignard, Philippe, Boumghar, Manel

arXiv.org Machine Learning

Since datasets with annotation for novelty at the document and/or word level are not easily available, we present a simulation framework that allows us to create different textual datasets in which we control the way novelty occurs. We also present a benchmark of existing methods for novelty detection in textual data streams. We define a few tasks to solve and compare several state-of-the-art methods. The simulation framework allows us to evaluate their performances according to a set of limited scenarios and test their sensitivity to some parameters.