Clustering
Using Backbone Foundation Model for Evaluating Fairness in Chest Radiography Without Demographic Data
Queiroz, Dilermando, Anjos, André, Berton, Lilian
Ensuring consistent performance across diverse populations and incorporating fairness into machine learning models are crucial for advancing medical image diagnostics and promoting equitable healthcare. However, many databases do not provide protected attributes or contain unbalanced representations of demographic groups, complicating the evaluation of model performance across different demographics and the application of bias mitigation techniques that rely on these attributes. This study aims to investigate the effectiveness of using the backbone of Foundation Models as an embedding extractor for creating groups that represent protected attributes, such as gender and age. We propose utilizing these groups in different stages of bias mitigation, including pre-processing, in-processing, and evaluation. Using databases in and out-of-distribution scenarios, it is possible to identify that the method can create groups that represent gender in both databases and reduce in 4.44% the difference between the gender attribute in-distribution and 6.16% in out-of-distribution. However, the model lacks robustness in handling age attributes, underscoring the need for more fundamentally fair and robust Foundation models. These findings suggest a role in promoting fairness assessment in scenarios where we lack knowledge of attributes, contributing to the development of more equitable medical diagnostics.
Urban context and delivery performance: Modelling service time for cargo bikes and vans across diverse urban environments
Schrader, Maxwell, Kumar, Navish, Sørig, Esben, Yoon, Soonmyeong, Srivastava, Akash, Xu, Kai, Astefanoaei, Maria, Collignon, Nicolas
Light goods vehicles (LGV) used extensively in the last mile of delivery are one of the leading polluters in cities. Cargo-bike logistics and Light Electric Vehicles (LEVs) have been put forward as a high impact candidate for replacing LGVs. Studies have estimated over half of urban van deliveries being replaceable by cargo-bikes, due to their faster speeds, shorter parking times and more efficient routes across cities. However, the logistics sector suffers from a lack of publicly available data, particularly pertaining to cargo-bike deliveries, thus limiting the understanding of their potential benefits. Specifically, service time (which includes cruising for parking, and walking to destination) is a major, but often overlooked component of delivery time modelling. The aim of this study is to establish a framework for measuring the performance of delivery vehicles, with an initial focus on modelling service times of vans and cargo-bikes across diverse urban environments. We introduce two datasets that allow for in-depth analysis and modelling of service times of cargo bikes and use existing datasets to reason about differences in delivery performance across vehicle types. We introduce a modelling framework to predict the service times of deliveries based on urban context. We employ Uber's H3 index to divide cities into hexagonal cells and aggregate OpenStreetMap tags for each cell, providing a detailed assessment of urban context. Leveraging this spatial grid, we use GeoVex to represent micro-regions as points in a continuous vector space, which then serve as input for predicting vehicle service times. We show that geospatial embeddings can effectively capture urban contexts and facilitate generalizations to new contexts and cities. Our methodology addresses the challenge of limited comparative data available for different vehicle types within the same urban settings.
Interactive dense pixel visualizations for time series and model attribution explanations
Schlegel, Udo, Keim, Daniel A.
The field of Explainable Artificial Intelligence (XAI) for Deep Neural Network models has developed significantly, offering numerous techniques to extract explanations from models. However, evaluating explanations is often not trivial, and differences in applied metrics can be subtle, especially with non-intelligible data. Thus, there is a need for visualizations tailored to explore explanations for domains with such data, e.g., time series. We propose DAVOTS, an interactive visual analytics approach to explore raw time series data, activations of neural networks, and attributions in a dense-pixel visualization to gain insights into the data, models' decisions, and explanations. To further support users in exploring large datasets, we apply clustering approaches to the visualized data domains to highlight groups and present ordering strategies for individual and combined data exploration to facilitate finding patterns. We visualize a CNN trained on the FordA dataset to demonstrate the approach.
Fast and Modular Autonomy Software for Autonomous Racing Vehicles
Saba, Andrew, Adetunji, Aderotimi, Johnson, Adam, Kothari, Aadi, Sivaprakasam, Matthew, Spisak, Joshua, Bharatia, Prem, Chauhan, Arjun, Duff, Brendan Jr., Gasparro, Noah, King, Charles, Larkin, Ryan, Mao, Brian, Nye, Micah, Parashar, Anjali, Attias, Joseph, Balciunas, Aurimas, Brown, Austin, Chang, Chris, Gao, Ming, Heredia, Cindy, Keats, Andrew, Lavariega, Jose, Muckelroy, William III, Slavescu, Andre, Stathas, Nickolas, Suvarna, Nayana, Zhang, Chuan Tian, Scherer, Sebastian, Ramanan, Deva
Autonomous motorsports aim to replicate the human racecar driver with software and sensors. As in traditional motorsports, Autonomous Racing Vehicles (ARVs) are pushed to their handling limits in multi-agent scenarios at extremely high ($\geq 150mph$) speeds. This Operational Design Domain (ODD) presents unique challenges across the autonomy stack. The Indy Autonomous Challenge (IAC) is an international competition aiming to advance autonomous vehicle development through ARV competitions. While far from challenging what a human racecar driver can do, the IAC is pushing the state of the art by facilitating full-sized ARV competitions. This paper details the MIT-Pitt-RW Team's approach to autonomous racing in the IAC. In this work, we present our modular and fast approach to agent detection, motion planning and controls to create an autonomy stack. We also provide analysis of the performance of the software stack in single and multi-agent scenarios for rapid deployment in a fast-paced competition environment. We also cover what did and did not work when deployed on a physical system the Dallara AV-21 platform and potential improvements to address these shortcomings. Finally, we convey lessons learned and discuss limitations and future directions for improvement.
Time Series Analysis for Education: Methods, Applications, and Future Directions
Mao, Shengzhong, Zhang, Chaoli, Song, Yichi, Wang, Jindong, Zeng, Xiao-Jun, Xu, Zenglin, Wen, Qingsong
Recent advancements in the collection and analysis of sequential educational data have brought time series analysis to a pivotal position in educational research, highlighting its essential role in facilitating data-driven decision-making. However, there is a lack of comprehensive summaries that consolidate these advancements. To the best of our knowledge, this paper is the first to provide a comprehensive review of time series analysis techniques specifically within the educational context. We begin by exploring the landscape of educational data analytics, categorizing various data sources and types relevant to education. We then review four prominent time series methods-forecasting, classification, clustering, and anomaly detection-illustrating their specific application points in educational settings. Subsequently, we present a range of educational scenarios and applications, focusing on how these methods are employed to address diverse educational tasks, which highlights the practical integration of multiple time series methods to solve complex educational problems. Finally, we conclude with a discussion on future directions, including personalized learning analytics, multimodal data fusion, and the role of large language models (LLMs) in educational time series. The contributions of this paper include a detailed taxonomy of educational data, a synthesis of time series techniques with specific educational applications, and a forward-looking perspective on emerging trends and future research opportunities in educational analysis. The related papers and resources are available and regularly updated at the project page.
Provable Imbalanced Point Clustering
Denisov, David, Feldman, Dan, Dolev, Shlomi, Segal, Michael
We suggest efficient and provable methods to compute an approximation for imbalanced point clustering, that is, fitting $k$-centers to a set of points in $\mathbb{R}^d$, for any $d,k\geq 1$. To this end, we utilize \emph{coresets}, which, in the context of the paper, are essentially weighted sets of points in $\mathbb{R}^d$ that approximate the fitting loss for every model in a given set, up to a multiplicative factor of $1\pm\varepsilon$. We provide [Section 3 and Section E in the appendix] experiments that show the empirical contribution of our suggested methods for real images (novel and reference), synthetic data, and real-world data. We also propose choice clustering, which by combining clustering algorithms yields better performance than each one separately.
Contrastive Learning Subspace for Text Clustering
Yong, Qian, Chen, Chen, Zhou, Xiabing
Contrastive learning has been frequently investigated to learn effective representations for text clustering tasks. While existing contrastive learning-based text clustering methods only focus on modeling instance-wise semantic similarity relationships, they ignore contextual information and underlying relationships among all instances that needs to be clustered. In this paper, we propose a novel text clustering approach called Subspace Contrastive Learning (SCL) which models cluster-wise relationships among instances. Specifically, the proposed SCL consists of two main modules: (1) a self-expressive module that constructs virtual positive samples and (2) a contrastive learning module that further learns a discriminative subspace to capture task-specific cluster-wise relationships among texts. Experimental results show that the proposed SCL method not only has achieved superior results on multiple task clustering datasets but also has less complexity in positive sample construction.
HBIC: A Biclustering Algorithm for Heterogeneous Datasets
José-García, Adán, Jacques, Julie, Chauvet, Clément, Sobanski, Vincent, Dhaenens, Clarisse
Biclustering is an unsupervised machine-learning approach aiming to cluster rows and columns simultaneously in a data matrix. Several biclustering algorithms have been proposed for handling numeric datasets. However, real-world data mining problems often involve heterogeneous datasets with mixed attributes. To address this challenge, we introduce a biclustering approach called HBIC, capable of discovering meaningful biclusters in complex heterogeneous data, including numeric, binary, and categorical data. The approach comprises two stages: bicluster generation and bicluster model selection. In the initial stage, several candidate biclusters are generated iteratively by adding and removing rows and columns based on the frequency of values in the original matrix. In the second stage, we introduce two approaches for selecting the most suitable biclusters by considering their size and homogeneity. Through a series of experiments, we investigated the suitability of our approach on a synthetic benchmark and in a biomedical application involving clinical data of systemic sclerosis patients. The evaluation comparing our method to existing approaches demonstrates its ability to discover high-quality biclusters from heterogeneous data. Our biclustering approach is a starting point for heterogeneous bicluster discovery, leading to a better understanding of complex underlying data structures.
NeurCAM: Interpretable Neural Clustering via Additive Models
Interpretable clustering algorithms aim to group similar data points while explaining the obtained groups to support knowledge discovery and pattern recognition tasks. While most approaches to interpretable clustering construct clusters using decision trees, the interpretability of trees often deteriorates on complex problems where large trees are required. In this work, we introduce the Neural Clustering Additive Model (NeurCAM), a novel approach to the interpretable clustering problem that leverages neural generalized additive models to provide fuzzy cluster membership with additive explanations of the obtained clusters. To promote sparsity in our model's explanations, we introduce selection gates that explicitly limit the number of features and pairwise interactions leveraged. Additionally, we demonstrate the capacity of our model to perform text clustering that considers the contextual representation of the texts while providing explanations for the obtained clusters based on uni- or bi-word terms. Extensive experiments show that NeurCAM achieves performance comparable to black-box methods on tabular datasets while remaining interpretable. Additionally, our approach significantly outperforms other interpretable clustering approaches when clustering on text data.
Accelerating the k-means++ Algorithm by Using Geometric Information
Corominas, Guillem Rodríguez, Blesa, Maria J., Blum, Christian
The k-means clustering is a widely used method in data clustering and unsupervised machine learning, aiming to divide a given dataset into k distinct, non-overlapping clusters. This division seeks to minimize the within-cluster variance. The k-means clustering problem becomes NP-hard when extended beyond a single dimension [3]. Despite this complexity, there are algorithms designed to find sufficiently good solutions within a reasonable amount of time. Among these, Lloyd's algorithm, also referred to as the standard algorithm or batch k-means, is the most renowned [42]. The k-means algorithm is one of the most popular algorithms in data mining [58, 32], mainly due to its simplicity, scalability, and guaranteed termination. However, its performance is highly sensible to the initial placement of the centers [5]. In fact, there is no general approximation expectation for Lloyd's algorithm that applies to all scenarios, i.e., an arbitrary initialization may lead to an arbitrarily bad clustering. Therefore, it is crucial to employ effective initialization methods [24].