Support Vector Machines
Fast Epigraphical Projection-based Incremental Algorithms for Wasserstein Distributionally Robust Support Vector Machine
Wasserstein D istributionally R obust O ptimization (DRO) is concerned with finding decisions that perform well on data that are drawn from the worst-case probability distribution within a Wasserstein ball centered at a certain nominal distribution. In recent years, it has been shown that various DRO formulations of learning models admit tractable convex reformulations. However, most existing works propose to solve these convex reformulations by general-purpose solvers, which are not well-suited for tackling large-scale problems. In this paper, we focus on a family of Wasserstein distributionally robust support vector machine (DRSVM) problems and propose two novel epigraphical projection-based incremental algorithms to solve them. The updates in each iteration of these algorithms can be computed in a highly efficient manner. Moreover, we show that the DRSVM problems considered in this paper satisfy a Hรถlderian growth condition with explicitly determined growth exponents. Consequently, we are able to establish the convergence rates of the proposed incremental algorithms. Our numerical results indicate that the proposed methods are orders of magnitude faster than the state-of-the-art, and the performance gap grows considerably as the problem size increases.
Calibrated Structured Prediction
Volodymyr Kuleshov, Percy S. Liang
In user-facing applications, displaying calibrated confidence measures-- probabilities that correspond to true frequency--can be as important as obtaining high accuracy. We are interested in calibration for structured prediction problems such as speech recognition, optical character recognition, and medical diagnosis. Structured prediction presents new challenges for calibration: the output space is large, and users may issue many types of probability queries (e.g., marginals) on the structured output. We extend the notion of calibration so as to handle various subtleties pertaining to the structured setting, and then provide a simple recalibra-tion method that trains a binary classifier to predict probabilities of interest. We explore a range of features appropriate for structured recalibration, and demonstrate their efficacy on three real-world datasets.
Fair CCA for Fair Representation Learning: An ADNI Study
Hou, Bojian, Wang, Zhanliang, Zhou, Zhuoping, Tong, Boning, Wang, Zexuan, Bao, Jingxuan, Duong-Tran, Duy, Long, Qi, Shen, Li
Canonical correlation analysis (CCA) is a technique for finding correlations between different data modalities and learning low-dimensional representations. As fairness becomes crucial in machine learning, fair CCA has gained attention. However, previous approaches often overlook the impact on downstream classification tasks, limiting applicability. We propose a novel fair CCA method for fair representation learning, ensuring the projected features are independent of sensitive attributes, thus enhancing fairness without compromising accuracy. We validate our method on synthetic data and real-world data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), demonstrating its ability to maintain high correlation analysis performance while improving fairness in classification tasks. Our work enables fair machine learning in neuroimaging studies where unbiased analysis is essential. Code is available in https://github.com/ZhanliangAaronWang/FR-CCA-ADNI.
BoMGene: Integrating Boruta-mRMR feature selection for enhanced Gene expression classification
Phan, Bich-Chung, Ma, Thanh, Nguyen, Huu-Hoa, Do, Thanh-Nghi
Feature selection is a crucial step in analyzing gene expression data, enhancing classification performance, and reducing computational costs for high-dimensional datasets. This paper proposes BoMGene, a hybrid feature selection method that effectively integrates two popular techniques: Boruta and Minimum Redundancy Maximum Relevance (mRMR). The method aims to optimize the feature space and enhance classification accuracy. Experiments were conducted on 25 publicly available gene expression datasets, employing widely used classifiers such as Support Vector Machine (SVM), Random Forest, XGBoost (XGB), and Gradient Boosting Machine (GBM). The results show that using the Boruta-mRMR combination cuts down the number of features chosen compared to just using mRMR, which helps to speed up training time while keeping or even improving classification accuracy compared to using individual feature selection methods. The proposed approach demonstrates clear advantages in accuracy, stability, and practical applicability for multi-class gene expression data analysis
Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models
Zanotto, Sergio E., Aroyehun, Segun
The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. While recent research has primarily focused on using LLMs to classify text as either human-written or machine-generated texts, our study focuses on characterizing these texts using a set of linguistic features across different linguistic levels such as morphology, syntax, and semantics. We select a dataset of human-written and machine-generated texts spanning 8 domains and produced by 11 different LLMs. We calculate different linguistic features such as dependency length and emotionality, and we use them for characterizing human-written and machine-generated texts along with different sampling strategies, repetition controls, and model release dates. Our statistical analysis reveals that human-written texts tend to exhibit simpler syntactic structures and more diverse semantic content. Furthermore, we calculate the variability of our set of features across models and domains. Both human- and machine-generated texts show stylistic diversity across domains, with human-written texts displaying greater variation in our features. Finally, we apply style embeddings to further test variability among human-written and machine-generated texts. Notably, newer models output text that is similarly variable, pointing to a homogenization of machine-generated texts.
Planning Jerk-Optimized Trajectory with Discrete-Time Constraints for Redundant Robots
Dai, Chengkai, Lefebvre, Sylvain, Yu, Kai-Ming, Geraedts, Jo M. P., Wang, Charlie C. L.
--We present a method for effectively planning the motion trajectory of robots in manufacturing tasks, the tool-paths of which are usually complex and have a large number of discrete time constraints as waypoints. Kinematic redundancy also exists in these robotic systems. The jerk of motion is optimized in our trajectory planning method at the meanwhile of fabrication process to improve the quality of fabrication. Our method is based on a sampling strategy and consists of two major parts. After determining an initial path by graph-search, a greedy algorithm is adopted to optimize a path by locally applying adaptive filers in the regions with large jerks. The filtered result is obtained by numerical optimization. In order to achieve efficient computation, an adaptive sampling method is developed for learning a collision-indication function that is represented as a support-vector machine. Applications in robot-assisted 3D printing are given in this paper to demonstrate the functionality of our approach. Abstract --In robot-assisted manufacturing applications, robotic arms are employed to realize the motion of workpieces (or machining tools) specified as a sequence of waypoints with the positions of tool tip and the tool orientations constrained. The required degree-of-freedom (DOF) is often less than the robotic hardware system (e.g., a robotic arm has 6-DOF). Specifically, rotations of the workpiece around the axis of a tool can be arbitrary (see Figure 1 for an example). By using this redundancy - i.e., there are many possible poses of a robotic arm to realize a given waypoint, the trajectory of robots can be optimized to consider the performance of motion in velocity, acceleration and jerk in the joint space. In addition, when fabricating complex models each tool-path can have a large amount of waypoints. It is crucial for a motion planning algorithm to compute a smooth and collision-free trajectory of robot to improve fabrication quality. The time taken by the planning algorithm should not significantly lengthen the total manufacturing time; ideally it would remain hidden as computing motions for a layer can be done while the previous layer is printing. The method presented in this paper provides an efficient framework to tackle this problem. The framework has been well tested on our robot-assisted additive manufacturing system to demonstrate its effectiveness and can be generally applied to other robot-assisted manufacturing systems.
From Fragile to Certified: Wasserstein Audits of Group Fairness Under Distribution Shift
Ehyaei, Ahmad-Reza, Farnadi, Golnoosh, Samadi, Samira
Group-fairness metrics (e.g., equalized odds) can vary sharply across resamples and are especially brittle under distribution shift, undermining reliable audits. We propose a Wasserstein distributionally robust framework that certifies worst-case group fairness over a ball of plausible test distributions centered at the empirical law. Our formulation unifies common group fairness notions via a generic conditional-probability functional and defines $\varepsilon$-Wasserstein Distributional Fairness ($\varepsilon$-WDF) as the audit target. Leveraging strong duality, we derive tractable reformulations and an efficient estimator (DRUNE) for $\varepsilon$-WDF. We prove feasibility and consistency and establish finite-sample certification guarantees for auditing fairness, along with quantitative bounds under smoothness and margin conditions. Across standard benchmarks and classifiers, $\varepsilon$-WDF delivers stable fairness assessments under distribution shift, providing a principled basis for auditing and certifying group fairness beyond observational data.
Bayesian Nonlinear Support Vector Machines and Discriminative Factor Modeling
A new Bayesian formulation is developed for nonlinear support vector machines (SVMs), based on a Gaussian process and with the SVM hinge loss expressed as a scaled mixture of normals. We then integrate the Bayesian SVM into a factor model, in which feature learning and nonlinear classifier design are performed jointly; almost all previous work on such discriminative feature learning has assumed a linear classifier. Inference is performed with expectation conditional maximization (ECM) and Markov Chain Monte Carlo (MCMC).
Object Localization based on Structural SVM using Privileged Information
We propose a structured prediction algorithm for object localization based on Support Vector Machines (SVMs) using privileged information. Privileged information provides useful high-level knowledge for image understanding and facilitates learning a reliable model even with a small number of training examples. In our setting, we assume that such information is available only at training time since it may be difficult to obtain from visual data accurately without human supervision. Our goal is to improve performance by incorporating privileged information into ordinary learning framework and adjusting model parameters for better generalization. We tackle object localization problem based on a novel structural SVM using privileged information, where an alternating loss-augmented inference procedure is employed to handle the term in the objective function corresponding to privileged information. We apply the proposed algorithm to the Caltech-UCSD Birds 200-2011 dataset, and obtain encouraging results suggesting further investigation into the benefit of privileged information in structured prediction.
Latent Support Measure Machines for Bag-of-Words Data Classification
In many classification problems, the input is represented as a set of features, e.g., the bag-of-words (BoW) representation of documents. Support vector machines (SVMs) are widely used tools for such classification problems. The performance of the SVMs is generally determined by whether kernel values between data points can be defined properly. However, SVMs for BoW representations have a major weakness in that the co-occurrence of different but semantically similar words cannot be reflected in the kernel calculation. To overcome the weakness, we propose a kernel-based discriminative classifier for BoW data, which we call the latent support measure machine (latent SMM). With the latent SMM, a latent vector is associated with each vocabulary term, and each document is represented as a distribution of the latent vectors for words appearing in the document. To represent the distributions efficiently, we use the kernel embeddings of distributions that hold high order moment information about distributions. Then the latent SMM finds a separating hyperplane that maximizes the margins between distributions of different classes while estimating latent vectors for words to improve the classification performance. In the experiments, we show that the latent SMM achieves state-of-the-art accuracy for BoW text classification, is robust with respect to its own hyper-parameters, and is useful to visualize words.