Goto

Collaborating Authors

 Sohn, Jinwon


Monotone Curve Estimation via Convex Duality

arXiv.org Machine Learning

A principal curve serves as a powerful tool for uncovering underlying structures of data through 1-dimensional smooth and continuous representations. On the basis of optimal transport theories, this paper introduces a novel principal curve framework constrained by monotonicity with rigorous theoretical justifications. We establish statistical guarantees for our monotone curve estimate, including expected empirical and generalized mean squared errors, while proving the existence of such estimates. These statistical foundations justify adopting the popular early stopping procedure in machine learning to implement our numeric algorithm with neural networks. Comprehensive simulation studies reveal that the proposed monotone curve estimate outperforms competing methods in terms of accuracy when the data exhibits a monotonic structure. Moreover, through two real-world applications on future prices of copper, gold, and silver, and avocado prices and sales volume, we underline the robustness of our curve estimate against variable transformation, further confirming its effective applicability for noisy and complex data sets. We believe that this monotone curve-fitting framework offers significant potential for numerous applications where monotonic relationships are intrinsic or need to be imposed.


Parallelly Tempered Generative Adversarial Networks

arXiv.org Machine Learning

The rising demand for bigger data with privacy has led to the widespread adoption of data generators (or synthesizers) across various domains (Jordon et al., 2022). For instance, the European General Data Protection Regulation mandates data deletion after its primary purpose and restricts sharing due to ownership and privacy concerns. As a promising solution, Mottini et al. (2018) employed a generative model to synthesize Passenger-Name-Records (PNR) data while discarding the original one to comply with such a data protection policy. Meanwhile, large-scale AI models, devouring massive training datasets, exploit the generative model's capability to produce an arbitrary number of data points for efficient training (Hwang et al., 2023). In the literature, the generative adversarial network (GAN, Goodfellow et al., 2014) particularly stands out as a versatile data synthesizer, demonstrating exceptional capabilities in reconstructing diverse datasets, such as images (Kang et al., 2023), text (de Rosa and Papa, 2021), tabular data (Zhao et al., 2021), and even estimating model parameters (Wang and Roฤkovรก, 2022). The GAN framework consists of two competing networks D D (i.e., the critic) and G G (i.e., the generator) where D and G have neural-net families.


Fair Supervised Learning with A Simple Random Sampler of Sensitive Attributes

arXiv.org Machine Learning

As the data-driven decision process becomes dominating for industrial applications, fairness-aware machine learning arouses great attention in various areas. This work proposes fairness penalties learned by neural networks with a simple random sampler of sensitive attributes for non-discriminatory supervised learning. In contrast to many existing works that critically rely on the discreteness of sensitive attributes and response variables, the proposed penalty is able to handle versatile formats of the sensitive attributes, so it is more extensively applicable in practice than many existing algorithms. This penalty enables us to build a computationally efficient group-level in-processing fairness-aware training framework. Empirical evidence shows that our framework enjoys better utility and fairness measures on popular benchmark data sets than competing methods. We also theoretically characterize estimation errors and loss of utility of the proposed neural-penalized risk minimization problem.


Differentially Private Topological Data Analysis

arXiv.org Machine Learning

This paper is the first to attempt differentially private (DP) topological data analysis (TDA), producing near-optimal private persistence diagrams. We analyze the sensitivity of persistence diagrams in terms of the bottleneck distance, and we show that the commonly used \v{C}ech complex has sensitivity that does not decrease as the sample size $n$ increases. This makes it challenging for the persistence diagrams of \v{C}ech complexes to be privatized. As an alternative, we show that the persistence diagram obtained by the $L^1$-distance to measure (DTM) has sensitivity $O(1/n)$. Based on the sensitivity analysis, we propose using the exponential mechanism whose utility function is defined in terms of the bottleneck distance of the $L^1$-DTM persistence diagrams. We also derive upper and lower bounds of the accuracy of our privacy mechanism; the obtained bounds indicate that the privacy error of our mechanism is near-optimal. We demonstrate the performance of our privatized persistence diagrams through simulations as well as on a real dataset tracking human movement.