big data and data science
From User Preferences to Optimization Constraints Using Large Language Models
Sanguinetti, Manuela, Perniciano, Alessandra, Zedda, Luca, Loddo, Andrea, Di Ruberto, Cecilia, Atzori, Maurizio
This work explores using Large Language Models (LLMs) to translate user preferences into energy optimization constraints for home appliances. We describe a task where natural language user utterances are converted into formal constraints for smart appliances, within the broader context of a renewable energy community (REC) and in the Italian scenario. We evaluate the effectiveness of various LLMs currently available for Italian in translating these preferences resorting to classical zero-shot, one-shot, and few-shot learning settings, using a pilot dataset of Italian user requests paired with corresponding formal constraint representation. Our contributions include establishing a baseline performance for this task, publicly releasing the dataset and code for further research, and providing insights on observed best practices and limitations of LLMs in this particular domain.
A Methodology to extract Geo-Referenced Standard Routes from AIS Data
Corvino, Michela, Daffinà, Filippo, Francalanci, Chiara, Giacomazzi, Paolo, Magliani, Martina, Ravanelli, Paolo, Stahl, Torbjorn
Maritime AIS (Automatic Identification Systems) data serve as a valuable resource for studying vessel behavior. This study proposes a methodology to analyze route between maritime points of interest and extract geo-referenced standard routes, as maritime patterns of life, from raw AIS data. The underlying assumption is that ships adhere to consistent patterns when travelling in certain maritime areas due to geographical, environmental, or economic factors. Deviations from these patterns may be attributed to weather conditions, seasonality, or illicit activities. This enables maritime surveillance authorities to analyze the navigational behavior between ports, providing insights on vessel route patterns, possibly categorized by vessel characteristics (type, flag, or size). Our methodological process begins by segmenting AIS data into distinct routes using a finite state machine (FSM), which describes routes as seg-ments connecting pairs of points of interest. The extracted segments are ag-gregated based on their departure and destination ports and then modelled using iterative density-based clustering to connect these ports. The cluster-ing parameters are assigned manually to sample and then extended to the en-tire dataset using linear regression. Overall, the approach proposed in this paper is unsupervised and does not require any ground truth to be trained. The approach has been tested on data on the on a six-year AIS dataset cover-ing the Arctic region and the Europe, Middle East, North Africa areas. The total size of our dataset is 1.15 Tbytes. The approach has proved effective in extracting standard routes, with less than 5% outliers, mostly due to routes with either their departure or their destination port not included in the test areas.
Active Data Sampling and Generation for Bias Remediation
Adequate sampling space coverage is the keystone to effectively train trustworthy Machine Learning models. Unfortunately, real data do carry several inherent risks due to the many potential biases they exhibit when gathered without a proper random sampling over the reference population, and most of the times this is way too expensive or time consuming to be a viable option. Depending on how training data have been gathered, unmitigated biases can lead to harmful or discriminatory consequences that ultimately hinders large scale applicability of pre-trained models and undermine their truthfulness or fairness expectations. In this paper, a mixed active sampling and data generation strategy -- called samplation -- is proposed as a mean to compensate during fine-tuning of a pre-trained classifer the unfair classifications it produces, assuming that the training data come from a non-probabilistic sampling schema. Given a pre-trained classifier, first a fairness metric is evaluated on a test set, then new reservoirs of labeled data are generated and finally a number of reversely-biased artificial samples are generated for the fine-tuning of the model. Using as case study Deep Models for visual semantic role labeling, the proposed method has been able to fully cure a simulated gender bias starting from a 90/10 imbalance, with only a small percentage of new data and with a minor effect on accuracy.
Symbolic Audio Classification via Modal Decision Tree Learning
Marzano, Enrico, Pagliarini, Giovanni, Pasini, Riccardo, Sciavicco, Guido, Stan, Ionel Eduard
The range of potential applications of acoustic analysis is wide. Classification of sounds, in particular, is a typical machine learning task that received a lot of attention in recent years. The most common approaches to sound classification are sub-symbolic, typically based on neural networks, and result in black-box models with high performances but very low transparency. In this work, we consider several audio tasks, namely, age and gender recognition, emotion classification, and respiratory disease diagnosis, and we approach them with a symbolic technique, that is, (modal) decision tree learning. We prove that such tasks can be solved using the same symbolic pipeline, that allows to extract simple rules with very high accuracy and low complexity. In principle, all such tasks could be associated to an autonomous conversation system, which could be useful in different contexts, such as an automatic reservation agent for an hospital or a clinic.
Interpretable Machine Learning for Oral Lesion Diagnosis through Prototypical Instances Identification
Cascione, Alessio, Setzu, Mattia, Galatolo, Federico A., Cimino, Mario G. C. A., Guidotti, Riccardo
Decision-making processes in healthcare can be highly complex and challenging. Machine Learning tools offer significant potential to assist in these processes. However, many current methodologies rely on complex models that are not easily interpretable by experts. This underscores the need to develop interpretable models that can provide meaningful support in clinical decision-making. When approaching such tasks, humans typically compare the situation at hand to a few key examples and representative cases imprinted in their memory. Using an approach which selects such exemplary cases and grounds its predictions on them could contribute to obtaining high-performing interpretable solutions to such problems. To this end, we evaluate PivotTree, an interpretable prototype selection model, on an oral lesion detection problem, specifically trying to detect the presence of neoplastic, aphthous and traumatic ulcerated lesions from oral cavity images. We demonstrate the efficacy of using such method in terms of performance and offer a qualitative and quantitative comparison between exemplary cases and ground-truth prototypes selected by experts.
From Text to Talent: A Pipeline for Extracting Insights from Candidate Profiles
Frazzetto, Paolo, Haq, Muhammad Uzair Ul, Fabris, Flavia, Sperduti, Alessandro
The recruitment process is undergoing a significant transformation with the increasing use of machine learning and natural language processing techniques. While previous studies have focused on automating candidate selection, the role of multiple vacancies in this process remains understudied. This paper addresses this gap by proposing a novel pipeline that leverages Large Language Models and graph similarity measures to suggest ideal candidates for specific job openings. Our approach represents candidate profiles as multimodal embeddings, enabling the capture of nuanced relationships between job requirements and candidate attributes. The proposed approach has significant implications for the recruitment industry, enabling companies to streamline their hiring processes and identify top talent more efficiently. Our work contributes to the growing body of research on the application of machine learning in human resources, highlighting the potential of LLMs and graph-based methods in revolutionizing the recruitment landscape.
Multivariate Time Series Anomaly Detection in Industry 5.0
Colombi, Lorenzo, Vespa, Michela, Belletti, Nicolas, Brina, Matteo, Dahdal, Simon, Tabanelli, Filippo, Bellodi, Elena, Tortonesi, Mauro, Stefanelli, Cesare, Vignoli, Massimiliano
Industry 5.0 environments present a critical need for effective anomaly detection methods that can indicate equipment malfunctions, process inefficiencies, or potential safety hazards. The ever-increasing sensorization of manufacturing lines makes processes more observable, but also poses the challenge of continuously analyzing vast amounts of multivariate time series data. These challenges include data quality since data may contain noise, be unlabeled or even mislabeled. A promising approach consists of combining an embedding model with other Machine Learning algorithms to enhance the overall performance in detecting anomalies. Moreover, representing time series as vectors brings many advantages like higher flexibility and improved ability to capture complex temporal dependencies. We tested our solution in a real industrial use case, using data collected from a Bonfiglioli plant. The results demonstrate that, unlike traditional reconstruction-based autoencoders, which often struggle in the presence of sporadic noise, our embedding-based framework maintains high performance across various noise conditions.
Visually Linking AI, Machine Learning, Deep Learning, Big Data and Data Science
Over the past few years AI has exploded, and especially since 2015. Much of that has to do with the wide availability of GPUs that make parallel processing ever faster, cheaper, and more powerful. It also has to do with the simultaneous one-two punch of practically infinite storage and a flood of data of every stripe (that whole Big Data movement) – images, text, transactions, mapping data, you name it.
Quantum Computing - How it may Revolutionize Big Data?
Quantum computing is round the corner and uses in machine learning are already being hypothesised [2]. Big Data is involved in virtually every industry, from health care to finance. Each industry and task requires different approaches to how data is utilised. One thing that remains consistent is the increasing volume, velocity and variety of data (the 3V's of Big Data) involved and the challenge to put as much data to use for real time, effective solutions [3]. The involvement of machine learning has undoubtedly revolutionised this process, leading to far more accurate models and solutions to challenges in more unique ways [4].
Visually Linking AI, Machine Learning, Deep Learning, Big Data and Data Science
Over the past few years AI has exploded, and especially since 2015. Much of that has to do with the wide availability of GPUs that make parallel processing ever faster, cheaper, and more powerful. It also has to do with the simultaneous one-two punch of practically infinite storage and a flood of data of every stripe (that whole Big Data movement) – images, text, transactions, mapping data, you name it.