Goto

Collaborating Authors

 Clustering


Claim Detection for Automated Fact-checking: A Survey on Monolingual, Multilingual and Cross-Lingual Research

arXiv.org Artificial Intelligence

Automated fact-checking has drawn considerable attention over the past few decades due to the increase in the diffusion of misinformation on online platforms. This is often carried out as a sequence of tasks comprising (i) the detection of sentences circulating in online platforms which constitute claims needing verification, followed by (ii) the verification process of those claims. This survey focuses on the former, by discussing existing efforts towards detecting claims needing fact-checking, with a particular focus on multilingual data and methods. This is a challenging and fertile direction where existing methods are yet far from matching human performance due to the profoundly challenging nature of the issue. Especially, the dissemination of information across multiple social platforms, articulated in multiple languages and modalities demands more generalized solutions for combating misinformation. Focusing on multilingual misinformation, we present a comprehensive survey of existing multilingual claim detection research. We present state-of-the-art multilingual claim detection research categorized into three key factors of the problem, verifiability, priority, and similarity. Further, we present a detailed overview of the existing multilingual datasets along with the challenges and suggest possible future advancements.


Unsupervised Machine Learning for the Classification of Astrophysical X-ray Sources

arXiv.org Artificial Intelligence

The automatic classification of X-ray detections is a necessary step in extracting astrophysical information from compiled catalogs of astrophysical sources. Classification is useful for the study of individual objects, statistics for population studies, as well as for anomaly detection, i.e., the identification of new unexplored phenomena, including transients and spectrally extreme sources. Despite the importance of this task, classification remains challenging in X-ray astronomy due to the lack of optical counterparts and representative training sets. We develop an alternative methodology that employs an unsupervised machine learning approach to provide probabilistic classes to Chandra Source Catalog sources with a limited number of labeled sources, and without ancillary information from optical and infrared catalogs. We provide a catalog of probabilistic classes for 8,756 sources, comprising a total of 14,507 detections, and demonstrate the success of the method at identifying emission from young stellar objects, as well as distinguishing between small-scale and large-scale compact accretors with a significant level of confidence. We investigate the consistency between the distribution of features among classified objects and well-established astrophysical hypotheses such as the unified AGN model. This provides interpretability to the probabilistic classifier. Code and tables are available publicly through GitHub. We provide a web playground for readers to explore our final classification at https://umlcaxs-playground.streamlit.app.


Semi-supervised segmentation of land cover images using nonlinear canonical correlation analysis with multiple features and t-SNE

arXiv.org Artificial Intelligence

Image segmentation is a clustering task whereby each pixel is assigned a cluster label. Remote sensing data usually consists of multiple bands of spectral images in which there exist semantically meaningful land cover subregions, co-registered with other source data such as LIDAR (LIght Detection And Ranging) data, where available. This suggests that, in order to account for spatial correlation between pixels, a feature vector associated with each pixel may be a vectorized tensor representing the multiple bands and a local patch as appropriate. Similarly, multiple types of texture features based on a pixel's local patch would also be beneficial for encoding locally statistical information and spatial variations, without necessarily labelling pixel-wise a large amount of ground truth, then training a supervised model, which is sometimes impractical. In this work, by resorting to label only a small quantity of pixels, a new semi-supervised segmentation approach is proposed. Initially, over all pixels, an image data matrix is created in high dimensional feature space. Then, t-SNE projects the high dimensional data onto 3D embedding. By using radial basis functions as input features, which use the labelled data samples as centres, to pair with the output class labels, a modified canonical correlation analysis algorithm, referred to as RBF-CCA, is introduced which learns the associated projection matrix via the small labelled data set. The associated canonical variables, obtained for the full image, are applied by k-means clustering algorithm. The proposed semi-supervised RBF-CCA algorithm has been implemented on several remotely sensed multispectral images, demonstrating excellent segmentation results.


Double-Bounded Optimal Transport for Advanced Clustering and Classification

arXiv.org Artificial Intelligence

Optimal transport (OT) is attracting increasing attention in machine learning. It aims to transport a source distribution to a target one at minimal cost. In its vanilla form, the source and target distributions are predetermined, which contracts to the real-world case involving undetermined targets. In this paper, we propose Doubly Bounded Optimal Transport (DB-OT), which assumes that the target distribution is restricted within two boundaries instead of a fixed one, thus giving more freedom for the transport to find solutions. Based on the entropic regularization of DB-OT, three scaling-based algorithms are devised for calculating the optimal solution. We also show that our DB-OT is helpful for barycenter-based clustering, which can avoid the excessive concentration of samples in a single cluster. Then we further develop DB-OT techniques for long-tailed classification which is an emerging and open problem. We first propose a connection between OT and classification, that is, in the classification task, training involves optimizing the Inverse OT to learn the representations, while testing involves optimizing the OT for predictions. With this OT perspective, we first apply DB-OT to improve the loss, and the Balanced Softmax is shown as a special case. Then we apply DB-OT for inference in the testing process. Even with vanilla Softmax trained features, our extensive experimental results show that our method can achieve good results with our improved inference scheme in the testing stage.


Multicollinearity Resolution Based on Machine Learning: A Case Study of Carbon Emissions in Sichuan Province

arXiv.org Artificial Intelligence

This study preprocessed 2000-2019 energy consumption data for 46 key Sichuan industries using matrix normalization. DBSCAN clustering identified 16 feature classes to objectively group industries. Penalized regression models were then applied for their advantages in overfitting control, high-dimensional data processing, and feature selection - well-suited for the complex energy data. Results showed the second cluster around coal had highest emissions due to production needs. Emissions from gasoline-focused and coke-focused clusters were also significant. Based on this, emission reduction suggestions included clean coal technologies, transportation management, coal-electricity replacement in steel, and industry standardization. The research introduced unsupervised learning to objectively select factors and aimed to explore new emission reduction avenues. In summary, the study identified industry groupings, assessed emissions drivers, and proposed scientific reduction strategies to better inform decision-making using algorithms like DBSCAN and penalized regression models.


Spatial-temporal-demand clustering for solving large-scale vehicle routing problems with time windows

arXiv.org Artificial Intelligence

Several metaheuristics use decomposition and pruning strategies to solve large-scale instances of the vehicle routing problem (VRP). Those complexity reduction techniques often rely on simple, problem-specific rules. However, the growth in available data and advances in computer hardware enable data-based approaches that use machine learning (ML) to improve scalability of solution algorithms. We propose a decompose-route-improve (DRI) framework that groups customers using clustering. Its similarity metric incorporates customers' spatial, temporal, and demand data and is formulated to reflect the problem's objective function and constraints. The resulting sub-routing problems can independently be solved using any suitable algorithm. We apply pruned local search (LS) between solved subproblems to improve the overall solution. Pruning is based on customers' similarity information obtained in the decomposition phase. In a computational study, we parameterize and compare existing clustering algorithms and benchmark the DRI against the Hybrid Genetic Search (HGS) of Vidal et al. (2013). Results show that our data-based approach outperforms classic cluster-first, route-second approaches solely based on customers' spatial information. The newly introduced similarity metric forms separate sub-VRPs and improves the selection of LS moves in the improvement phase. Thus, the DRI scales existing metaheuristics to achieve high-quality solutions faster for large-scale VRPs by efficiently reducing complexity. Further, the DRI can be easily adapted to various solution methods and VRP characteristics, such as distribution of customer locations and demands, depot location, and different time window scenarios, making it a generalizable approach to solving routing problems.


Enabling clustering algorithms to detect clusters of varying densities through scale-invariant data preprocessing

arXiv.org Artificial Intelligence

In this paper, we show that preprocessing data using a variant of rank transformation called 'Average Rank over an Ensemble of Sub-samples (ARES)' makes clustering algorithms robust to data representation and enable them to detect varying density clusters. Our empirical results, obtained using three most widely used clustering algorithms-namely KMeans, DBSCAN, and DP (Density Peak)-across a wide range of real-world datasets, show that clustering after ARES transformation produces better and more consistent results.


Polytopic Autoencoders with Smooth Clustering for Reduced-order Modelling of Flows

arXiv.org Artificial Intelligence

With the advancement of neural networks, there has been a notable increase, both in terms of quantity and variety, in research publications concerning the application of autoencoders to reduced-order models. We propose a polytopic autoencoder architecture that includes a lightweight nonlinear encoder, a convex combination decoder, and a smooth clustering network. Supported by several proofs, the model architecture ensures that all reconstructed states lie within a polytope, accompanied by a metric indicating the quality of the constructed polytopes, referred to as polytope error. Additionally, it offers a minimal number of convex coordinates for polytopic linear-parameter varying systems while achieving acceptable reconstruction errors compared to proper orthogonal decomposition (POD). To validate our proposed model, we conduct simulations involving two flow scenarios with the incompressible Navier-Stokes equation. Numerical results demonstrate the guaranteed properties of the model, low reconstruction errors compared to POD, and the improvement in error using a clustering network.


Empowering Aggregators with Practical Data-Driven Tools: Harnessing Aggregated and Disaggregated Flexibility for Demand Response

arXiv.org Artificial Intelligence

This study explores the crucial interplay between aggregators and building occupants in activating flexibility through Demand Response (DR) programs, with a keen focus on achieving robust decarbonization and fortifying the resilience of the energy system amidst the uncertainties presented by Renewable Energy Sources (RES). Firstly, it introduces a methodology of optimizing aggregated flexibility provision strategies in environments with limited data, utilizing Discrete Fourier Transformation (DFT) and clustering techniques to identify building occupant's activity patterns. Secondly, the study assesses the disaggregated flexibility provision of Heating Ventilation and Air Conditioning (HVAC) systems during DR events, employing machine learning and optimization techniques for precise, device-level analysis. The first approach offers a non-intrusive pathway for aggregators to provide flexibility services in environments of a single smart meter for the whole building's consumption, while the second approach carefully considers building occupants' thermal comfort profiles, while maximizing flexibility in case of existence of dedicated smart meters to the HVAC systems. Through the application of data-driven techniques and encompassing case studies from both industrial and residential buildings, this paper not only unveils pivotal opportunities for aggregators in the balancing and emerging flexibility markets but also successfully develops end-to-end practical tools for aggregators. Furthermore, the efficacy of this tool is validated through detailed case studies, substantiating its operational capability and contributing to the evolution of a resilient and efficient energy system.


Classification with neural networks with quadratic decision functions

arXiv.org Artificial Intelligence

Neural network with quadratic decision functions have been introduced as alternatives to standard neural networks with affine linear one. They are advantageous when the objects to be identified are of compact basic geometries like circles, ellipsis etc. In this paper we investigate the use of such ansatz functions for classification. In particular we test and compare the algorithm on the MNIST dataset for classification of handwritten digits and for classification of subspecies. We also show, that the implementation can be based on the neural network structure in the software Tensorflow and Keras, respectively.