lij
ALabel model and illustrations
A.1 Majority Voting The Majority Voting (MV) is the most intuitive algorithm for aggregate LFs' annotations. We omit this case for simplicity. A.3 Snorkel MeTaL The parameters µof Snorkel MeTaL [31] are given by Bayes' theorem we have: pµ(y = c,λ = m) = pµ(λ = m | y = c)p(y = c) = Consider a label model g(L(x),x) F in arbitrary functional class F, e.g., neural network, and having additional dependency on data feature x4, we can still approximate such complicated function with identity function-based label model g W(x)(L(x)) similar to the aforementioned one except that W(x): X RM (C+1) C is a similarly complicated function, e.g., neural network, that maps each data x X to a unique label model parameter W(x). We leave the exploration of more complicated form of label models into future work. B.1 Case 1: identity function We define the loss with reweighted sample as, Instead of employing the decomposing loss function, we introduce a more general influence estimation method - weight-moving Influence, which get ride of the loss decomposition and approximation and is agnostic to the selection of σ() function.
Understanding Programmatic Weak Supervision via Source-aware Influence Function
Programmatic Weak Supervision (PWS) aggregates the source votes of multiple weak supervision sources into probabilistic training labels, which are in turn used to train an end model. With its increasing popularity, it is critical to have some tool for users to understand the influence of each component (e.g., the source vote or training data) in the pipeline and interpret the end model behavior. To achieve this, we build on Influence Function (IF) and propose source-aware IF2, which leverages the generation process of the probabilistic labels to decompose the end model's training objective and then calculate the influence associated with each (data, source, class) tuple. These primitive influence score can then be used to estimate the influence of individual component of PWS, such as source vote, supervision source, and training data. On datasets of diverse domains, we demonstrate multiple use cases: (1) interpreting incorrect predictions from multiple angles that reveals insights for debugging the PWS pipeline, (2) identifying mislabeling of sources with a gain of 9%-37% over baselines, and (3) improving the end model's generalization performance by removing harmful components in the training objective (13%-24% better than ordinary IF).
Local Linear Recovery Guarantee of Deep Neural Networks at Overparameterization
Zhang, Yaoyu, Zhang, Leyang, Zhang, Zhongwang, Bai, Zhiwei
Determining whether deep neural network (DNN) models can reliably recover target functions at overparameterization is a critical yet complex issue in the theory of deep learning. To advance understanding in this area, we introduce a concept we term "local linear recovery" (LLR), a weaker form of target function recovery that renders the problem more amenable to theoretical analysis. In the sense of LLR, we prove that functions expressible by narrower DNNs are guaranteed to be recoverable from fewer samples than model parameters. Specifically, we establish upper limits on the optimistic sample sizes, defined as the smallest sample size necessary to guarantee LLR, for functions in the space of a given DNN. Furthermore, we prove that these upper bounds are achieved in the case of two-layer tanh neural networks. Our research lays a solid groundwork for future investigations into the recovery capabilities of DNNs in overparameterized scenarios.
HeMPPCAT: Mixtures of Probabilistic Principal Component Analysers for Data with Heteroscedastic Noise
Xu, Alec S., Balzano, Laura, Fessler, Jeffrey A.
Mixtures of probabilistic principal component analysis (MPPCA) is a well-known mixture model extension of principal component analysis (PCA). Similar to PCA, MPPCA assumes the data samples in each mixture contain homoscedastic noise. However, datasets with heterogeneous noise across samples are becoming increasingly common, as larger datasets are generated by collecting samples from several sources with varying noise profiles. The performance of MPPCA is suboptimal for data with heteroscedastic noise across samples. This paper proposes a heteroscedastic mixtures of probabilistic PCA technique (HeMPPCAT) that uses a generalized expectation-maximization (GEM) algorithm to jointly estimate the unknown underlying factors, means, and noise variances under a heteroscedastic noise setting. Simulation results illustrate the improved factor estimates and clustering accuracies of HeMPPCAT compared to MPPCA.