Bucharest
Big5PersonalityEssays: Introducing a Novel Synthetic Generated Dataset Consisting of Short State-of-Consciousness Essays Annotated Based on the Five Factor Model of Personality
Psychology, with a focus on psychometry, heavily relies on statistical models of analysis to create a cohesive understanding of personality and preferences of individuals. One of the fields that showed a proper evolution over time is the psychology of personality. Statistics-wise, personality can be modeled using the Five Factor Model (FFM), this model being the most scientifically validated personality model to date. It consists of five personality traits, each divided into 6 facets, usually. These traits can be memorized using the acronym OCEAN: openness to experience (O), conscienciousness (C), extraversion (E), agreeableness (A) and neuroticism (N). The traits are not correlated with one another, as evidence suggests. However, these 5 personality traits can be mapped into two metatraits: plasticity and stability.
A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus
Poesina, Eduard, Caragea, Cornelia, Ionescu, Radu Tudor
Natural language inference (NLI), the task of recognizing the entailment relationship in sentence pairs, is an actively studied topic serving as a proxy for natural language understanding. Despite the relevance of the task in building conversational agents and improving text classification, machine translation and other NLP tasks, to the best of our knowledge, there is no publicly available NLI corpus for the Romanian language. To this end, we introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs, which are obtained via distant supervision, and 6K validation and test sentence pairs, which are manually annotated with the correct labels. We conduct experiments with multiple machine learning methods based on distant learning, ranging from shallow models based on word embeddings to transformer-based neural networks, to establish a set of competitive baselines. Furthermore, we improve on the best model by employing a new curriculum learning strategy based on data cartography. Our dataset and code to reproduce the baselines are available at https://github.com/Eduard6421/RONLI.
Goal-conditioned reinforcement learning for ultrasound navigation guidance
Amadou, Abdoul Aziz, Singh, Vivek, Ghesu, Florin C., Kim, Young-Ho, Stanciulescu, Laura, Sai, Harshitha P., Sharma, Puneet, Young, Alistair, Rajani, Ronak, Rhode, Kawal
Transesophageal echocardiography (TEE) plays a pivotal role in cardiology for diagnostic and interventional procedures. However, using it effectively requires extensive training due to the intricate nature of image acquisition and interpretation. To enhance the efficiency of novice sonographers and reduce variability in scan acquisitions, we propose a novel ultrasound (US) navigation assistance method based on contrastive learning as goal-conditioned reinforcement learning (GCRL). We augment the previous framework using a novel contrastive patient batching method (CPB) and a data-augmented contrastive loss, both of which we demonstrate are essential to ensure generalization to anatomical variations across patients. The proposed framework enables navigation to both standard diagnostic as well as intricate interventional views with a single model. Our method was developed with a large dataset of 789 patients and obtained an average error of 6.56 mm in position and 9.36 degrees in angle on a testing dataset of 140 patients, which is competitive or superior to models trained on individual views. Furthermore, we quantitatively validate our method's ability to navigate to interventional views such as the Left Atrial Appendage (LAA) view used in LAA closure. Our approach holds promise in providing valuable guidance during transesophageal ultrasound examinations, contributing to the advancement of skill acquisition for cardiac ultrasound practitioners.
A Survey of Artificial Intelligence in Gait-Based Neurodegenerative Disease Diagnosis
Rao, Haocong, Zeng, Minlin, Zhao, Xuejiao, Miao, Chunyan
Recent years have witnessed an increasing global population affected by neurodegenerative diseases (NDs), which traditionally require extensive healthcare resources and human effort for medical diagnosis and monitoring. As a crucial disease-related motor symptom, human gait can be exploited to characterize different NDs. The current advances in artificial intelligence (AI) models enable automatic gait analysis for NDs identification and classification, opening a new avenue to facilitate faster and more cost-effective diagnosis of NDs. In this paper, we provide a comprehensive survey on recent progress of machine learning and deep learning based AI techniques applied to diagnosis of five typical NDs through gait. We provide an overview of the process of AI-assisted NDs diagnosis, and present a systematic taxonomy of existing gait data and AI models. Through an extensive review and analysis of 164 studies, we identify and discuss the challenges, potential solutions, and future directions in this field. Finally, we envision the prospective utilization of 3D skeleton data for human gait representation and the development of more efficient AI models for NDs diagnosis. We provide a public resource repository to track and facilitate developments in this emerging field: https://github.com/Kali-Hac/AI4NDD-Survey.
Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models
Bai, Yang, Pei, Ge, Gu, Jindong, Yang, Yong, Ma, Xingjun
Large language models (LLMs) have achieved remarkable performance on a wide range of tasks. However, recent studies have shown that LLMs can memorize training data and simple repeated tokens can trick the model to leak the data. In this paper, we take a step further and show that certain special characters or their combinations with English letters are stronger memory triggers, leading to more severe data leakage. The intuition is that, since LLMs are trained with massive data that contains a substantial amount of special characters (e.g. structural symbols {, } of JSON files, and @, # in emails and online posts), the model may memorize the co-occurrence between these special characters and the raw texts. This motivates us to propose a simple but effective Special Characters Attack (SCA) to induce training data leakage. Our experiments verify the high effectiveness of SCA against state-of-the-art LLMs: they can leak diverse training data, such as code corpus, web pages, and personally identifiable information, and sometimes generate non-stop outputs as a byproduct. We further show that the composition of the training data corpus can be revealed by inspecting the leaked data -- one crucial piece of information for pre-training high-performance LLMs. Our work can help understand the sensitivity of LLMs to special characters and identify potential areas for improvement.
Equilibria in multiagent online problems with predictions
Istrate, Gabriel, Bonchiş, Cosmin, Bogdan, Victor
We study the power of (competitive) algorithms with predictions in a multiagent setting. To this extent we introduce a multiagent version of the ski-rental problem. In this problem agents can collaborate by pooling resources to get a group license for some asset. If the license price is not met agents have to rent the asset individually for the day at a unit price. Otherwise the license becomes available forever to everyone at no extra cost. Our main contribution is a best-response analysis of a single-agent competitive algorithm that assumes perfect knowledge of other agents' actions (but no knowledge of its own renting time). We then analyze the setting when agents have a predictor for their own active time, yielding a tradeoff between robustness and consistency. We investigate the effect of using such a predictor in an equilibrium, as well as the new equilibria formed in this way.
Designing NLP Systems That Adapt to Diverse Worldviews
Creanga, Claudiu, Dinu, Liviu P.
Natural Language Inference (NLI) is foundational for evaluating language understanding in AI. However, progress has plateaued, with models failing on ambiguous examples and exhibiting poor generalization. We argue that this stems from disregarding the subjective nature of meaning, which is intrinsically tied to an individual's \textit{weltanschauung} (which roughly translates to worldview). Existing NLP datasets often obscure this by aggregating labels or filtering out disagreement. We propose a perspectivist approach: building datasets that capture annotator demographics, values, and justifications for their labels. Such datasets would explicitly model diverse worldviews. Our initial experiments with a subset of the SBIC dataset demonstrate that even limited annotator metadata can improve model performance.
Transformer based neural networks for emotion recognition in conversations
Creanga, Claudiu, Dinu, Liviu P.
This paper outlines the approach of the ISDS-NLP team in the SemEval 2024 Task 10: Emotion Discovery and Reasoning its Flip in Conversation (EDiReF). For Subtask 1 we obtained a weighted F1 score of 0.43 and placed 12 in the leaderboard. We investigate two distinct approaches: Masked Language Modeling (MLM) and Causal Language Modeling (CLM). For MLM, we employ pre-trained BERT-like models in a multilingual setting, fine-tuning them with a classifier to predict emotions. Experiments with varying input lengths, classifier architectures, and fine-tuning strategies demonstrate the effectiveness of this approach. Additionally, we utilize Mistral 7B Instruct V0.2, a state-of-the-art model, applying zero-shot and few-shot prompting techniques. Our findings indicate that while Mistral shows promise, MLMs currently outperform them in sentence-level emotion classification.
Automated Text Identification Using CNN and Training Dynamics
Creanga, Claudiu, Dinu, Liviu Petrisor
We used Data Maps to model and characterize the AuTexTification dataset. This provides insights about the behaviour of individual samples during training across epochs (training dynamics). We characterized the samples across 3 dimensions: confidence, variability and correctness. This shows the presence of 3 regions: easy-to-learn, ambiguous and hard-to-learn examples. We used a classic CNN architecture and found out that training the model only on a subset of ambiguous examples improves the model's out-of-distribution generalization.
OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs
Masala, Mihai, Ilie-Ablachim, Denis C., Corlatescu, Dragos, Zavelca, Miruna, Leordeanu, Marius, Velicu, Horia, Popescu, Marius, Dascalu, Mihai, Rebedea, Traian
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English. Hence, their performance in English greatly exceeds their performance in other languages. This document presents our approach to training and evaluating the first foundational and chat LLM specialized for Romanian.