Goto

Collaborating Authors

 Banerjee, Arindam


An Improved Analysis of Alternating Minimization for Structured Multi-Response Regression

Neural Information Processing Systems

Multi-response linear models aggregate a set of vanilla linear models by assuming correlated noise across them, which has an unknown covariance structure. To find the coefficient vector, estimators with a joint approximation of the noise covariance are often preferred than the simple linear regression in view of their superior empirical performance, which can be generally solved by alternating-minimization type procedures. Due to the non-convex nature of such joint estimators, the theoretical justification of their efficiency is typically challenging. The existing analyses fail to fully explain the empirical observations due to the assumption of resampling on the alternating procedures, which requires access to fresh samples in each iteration. In this work, we present a resampling-free analysis for the alternating minimization algorithm applied to the multi-response regression. In particular, we focus on the high-dimensional setting of multi-response linear models with structured coefficient parameter, and the statistical error of the parameter can be expressed by the complexity measure, Gaussian width, which is related to the assumed structure. More importantly, to the best of our knowledge, our result reveals for the first time that the alternating minimization with random initialization can achieve the same performance as the well-initialized one when solving this multi-response regression problem. Experimental results support our theoretical developments.


An Improved Analysis of Alternating Minimization for Structured Multi-Response Regression

Neural Information Processing Systems

Multi-response linear models aggregate a set of vanilla linear models by assuming correlated noise across them, which has an unknown covariance structure. To find the coefficient vector, estimators with a joint approximation of the noise covariance are often preferred than the simple linear regression in view of their superior empirical performance, which can be generally solved by alternating-minimization type procedures. Due to the non-convex nature of such joint estimators, the theoretical justification of their efficiency is typically challenging. The existing analyses fail to fully explain the empirical observations due to the assumption of resampling on the alternating procedures, which requires access to fresh samples in each iteration. In this work, we present a resampling-free analysis for the alternating minimization algorithm applied to the multi-response regression. In particular, we focus on the high-dimensional setting of multi-response linear models with structured coefficient parameter, and the statistical error of the parameter can be expressed by the complexity measure, Gaussian width, which is related to the assumed structure. More importantly, to the best of our knowledge, our result reveals for the first time that the alternating minimization with random initialization can achieve the same performance as the well-initialized one when solving this multi-response regression problem. Experimental results support our theoretical developments.


DAPPER: Scaling Dynamic Author Persona Topic Model to Billion Word Corpora

arXiv.org Machine Learning

Extracting common narratives from multi-author dynamic text corpora requires complex models, such as the Dynamic Author Persona (DAP) topic model. However, such models are complex and can struggle to scale to large corpora, often because of challenging non-conjugate terms. To overcome such challenges, in this paper we adapt new ideas in approximate inference to the DAP model, resulting in the DAP Performed Exceedingly Rapidly (DAPPER) topic model. Specifically, we develop Conjugate-Computation Variational Inference (CVI) based variational Expectation-Maximization (EM) for learning the model, yielding fast, closed form updates for each document, replacing iterative optimization in earlier work. Our results show significant improvements in model fit and training time without needing to compromise the model's temporal structure or the application of Regularized Variation Inference (RVI). We demonstrate the scalability and effectiveness of the DAPPER model by extracting health journeys from the CaringBridge corpus --- a collection of 9 million journals written by 200,000 authors during health crises.


Adversarial Recommendation: Attack of the Learned Fake Users

arXiv.org Machine Learning

Can machine learning models for recommendation be easily fooled? While the question has been answered for hand-engineered fake user profiles, it has not been explored for machine learned adversarial attacks. This paper attempts to close this gap. We propose a framework for generating fake user profiles which, when incorporated in the training of a recommendation system, can achieve an adversarial intent, while remaining indistinguishable from real user profiles. We formulate this procedure as a repeated general-sum game between two players: an oblivious recommendation system $R$ and an adversarial fake user generator $A$ with two goals: (G1) the rating distribution of the fake users needs to be close to the real users, and (G2) some objective $f_A$ encoding the attack intent, such as targeting the top-K recommendation quality of $R$ for a subset of users, needs to be optimized. We propose a learning framework to achieve both goals, and offer extensive experiments considering multiple types of attacks highlighting the vulnerability of recommendation systems.


High Dimensional Data Enrichment: Interpretable, Fast, and Data-Efficient

arXiv.org Machine Learning

High dimensional structured data enriched model describes groups of observations by shared and per-group individual parameters, each with its own structure such as sparsity or group sparsity. In this paper, we consider the general form of data enrichment where data comes in a fixed but arbitrary number of groups G. Any convex function, e.g., norms, can characterize the structure of both shared and individual parameters. We propose an estimator for high dimensional data enriched model and provide conditions under which it consistently estimates both shared and individual parameters. We also delineate sample complexity of the estimator and present high probability non-asymptotic bound on estimation error of all parameters. Interestingly the sample complexity of our estimator translates to conditions on both per-group sample sizes and the total number of samples. We propose an iterative estimation algorithm with linear convergence rate and supplement our theoretical analysis with synthetic and real experimental results. Particularly, we show the predictive power of data-enriched model along with its interpretable results in anticancer drug sensitivity analysis.


Topic Modeling on Health Journals With Regularized Variational Inference

AAAI Conferences

Topic modeling enables exploration and compact representation of a corpus. The CaringBridge (CB) dataset is a massive collection of journals written by patients and caregivers during a health crisis. Topic modeling on the CB dataset, however, is challenging due to the asynchronous nature of multiple authors writing about their health journeys. To overcome this challenge we introduce the Dynamic Author-Persona topic model (DAP), a probabilistic graphical model designed for temporal corpora with multiple authors. The novelty of the DAP model lies in its representation of authors by a persona---where personas capture the propensity to write about certain topics over time. Further, we present a regularized variational inference (RVI) algorithm, which we use to encourage the DAP model's personas to be distinct. Our results show significant improvements over competing topic models---particularly after regularization, and highlight the DAP model's unique ability to capture common journeys shared by different authors.


Topic Modeling on Health Journals with Regularized Variational Inference

arXiv.org Machine Learning

Topic modeling enables exploration and compact representation of a corpus. The CaringBridge (CB) dataset is a massive collection of journals written by patients and caregivers during a health crisis. Topic modeling on the CB dataset, however, is challenging due to the asynchronous nature of multiple authors writing about their health journeys. To overcome this challenge we introduce the Dynamic Author-Persona topic model (DAP), a probabilistic graphical model designed for temporal corpora with multiple authors. The novelty of the DAP model lies in its representation of authors by a persona --- where personas capture the propensity to write about certain topics over time. Further, we present a regularized variational inference algorithm, which we use to encourage the DAP model's personas to be distinct. Our results show significant improvements over competing topic models --- particularly after regularization, and highlight the DAP model's unique ability to capture common journeys shared by different authors.


Alternating Estimation for Structured High-Dimensional Multi-Response Models

Neural Information Processing Systems

We consider the problem of learning high-dimensional multi-response linear models with structured parameters. By exploiting the noise correlations among different responses, we propose an alternating estimation (AltEst) procedure to estimate the model parameters based on the generalized Dantzig selector (GDS). Under suitable sample size and resampling assumptions, we show that the error of the estimates generated by AltEst, with high probability, converges linearly to certain minimum achievable level, which can be tersely expressed by a few geometric measures, such as Gaussian width of sets related to the parameter structure. To the best of our knowledge, this is the first non-asymptotic statistical guarantee for such AltEst-type algorithm applied to estimation with general structures.


Theory-guided Data Science: A New Paradigm for Scientific Discovery from Data

arXiv.org Artificial Intelligence

Data science models, although successful in a number of commercial domains, have had limited applicability in scientific problems involving complex physical phenomena. Theory-guided data science (TGDS) is an emerging paradigm that aims to leverage the wealth of scientific knowledge for improving the effectiveness of data science models in enabling scientific discovery. The overarching vision of TGDS is to introduce scientific consistency as an essential component for learning generalizable models. Further, by producing scientifically interpretable models, TGDS aims to advance our scientific understanding by discovering novel domain insights. Indeed, the paradigm of TGDS has started to gain prominence in a number of scientific disciplines such as turbulence modeling, material discovery, quantum chemistry, bio-medical science, bio-marker discovery, climate science, and hydrology. In this paper, we formally conceptualize the paradigm of TGDS and present a taxonomy of research themes in TGDS. We describe several approaches for integrating domain knowledge in different research themes using illustrative examples from different disciplines. We also highlight some of the promising avenues of novel research for realizing the full potential of theory-guided data science.


Sparse Linear Isotonic Models

arXiv.org Machine Learning

In machine learning and data mining, linear models have been widely used to model the response as parametric linear functions of the predictors. To relax such stringent assumptions made by parametric linear models, additive models consider the response to be a summation of unknown transformations applied on the predictors; in particular, additive isotonic models (AIMs) assume the unknown transformations to be monotone. In this paper, we introduce sparse linear isotonic models (SLIMs) for highdimensional problems by hybridizing ideas in parametric sparse linear models and AIMs, which enjoy a few appealing advantages over both. In the high-dimensional setting, a two-step algorithm is proposed for estimating the sparse parameters as well as the monotone functions over predictors. Under mild statistical assumptions, we show that the algorithm can accurately estimate the parameters. Promising preliminary experiments are presented to support the theoretical results.