Causal Discovery from Event Sequences by Local Cause-Effect Attribution
Sequences of events, such as crashes in the stock market or outages in a network, contain strong temporal dependencies, whose understanding is crucial to react to and influence future events. In this paper, we study the problem of discovering the underlying causal structure from event sequences. To this end, we introduce a new causal model, where individual events of the cause trigger events of the effect with dynamic delays. We show that in contrast to existing methods based on Granger causality, our model is identifiable for both instant and delayed effects. We base our approach on the Algorithmic Markov Condition, by which we identify the true causal network as the one that minimizes the Kolmogorov complexity. As the Kolmogorov complexity is not computable, we instantiate our model using Minimum Description Length and show that the resulting score identifies the causal direction.
A Optimal K-priors for GLMs
We present theoretical results to show that K-priors with limited memory can achieve low gradientreconstruction error. We will discuss the optimal K-prior which can theoretically achieve perfect reconstruction error. Note that the prior is difficult to realize in practice since it requires all past training-data inputs X. Our goal here is to establish a theoretical limit, not to give practical choices. Our key idea is to choose a few input locations that provide a good representation of the training-data inputs X.
Invariant and Transportable Representations for Anti-Causal Domain Shifts and Victor Veitch Department of Computer Science, University of Chicago Department of Statistics, University of Chicago
Real-world classification problems must contend with domain shift, the (potential) mismatch between the domain where a model is deployed and the domain(s) where the training data was gathered. Methods to handle such problems must specify what structure is common between the domains and what varies. A natural assumption is that causal (structural) relationships are invariant in all domains. Then, it is tempting to learn a predictor for label Y that depends only on its causal parents. However, many real-world problems are "anti-causal" in the sense that Y is a cause of the covariates X--in this case, Y has no causal parents and the naive causal invariance is useless.
A The Embeddings
In this section, we briefly introduce the four kinds of emebddings consists the fusion embedding. The goal of position embedding module is to calibrate the position of each time point in the sequence so that the self-attention mechanism can recognize the relative positions between different time points in the input sequence. We design the token embedding module in order to enrich the features of each time point by fusion of other features from the adjacent time points within a certain interval. The role of spatial embedding is to locate and encode the spatial locations of different nodes, by which each node at different location possesses a unique spatial embedding. Thus, it enabling the model to identify nodes in different spatial and temporal planes after the dimensionality is compressed in the later computation.
Supplementary Material
We provide more details of training the teacher network in Section A, more experimental results on synthetic functions in Section B, and the hyperparameter settings for benchmark datasets in Section C. Here, we omit the iteration subscript t for simplicity. To solve Eq. (10), we obtain the hypergradient regarding to and backpropagate it to = {W 2 R As shown in Algorithm 1, we train the teacher network one step when each time it is called by an underperforming student model, where the step refers to one iteration on synthetic functions and one epoch of the validation set on benchmark datasets in the experiment. In Section 4.1, we have shown the experimental results of HPM on two population synthetic functions, i.e., the Branin and Hartmann6D functions. In the following, we will provide more details about synthetic functions and the implementation, as well as more results on the other two functions. We used the Branin and Hartmann6D functions in Section 4.1.
We thank all reviewers for their time and constructive comments
We thank all reviewers for their time and constructive comments. We first address concerns that were brought up by multiple reviewers. NMODE is more sample efficient than other methods (Appendix C.2, first paragraph), so for density estimation The quantifier for Prop 5.1 should be "for some"; this will be fixed. Note that for small dimensions (e.g. Riemannian metric, and are thus Riemannian.
Instability and Local Minima in GAN Training with Kernel Discriminators
Generative Adversarial Networks (GANs) are a widely-used tool for generative modeling of complex data. Despite their empirical success, the training of GANs is not fully understood due to the min-max optimization of the generator and discriminator. This paper analyzes these joint dynamics when the true samples as well as the generated samples are discrete, finite sets, and the discriminator is kernel-based. A simple yet expressive framework for analyzing training called the Isolated Points Model is introduced. In the proposed model, the distance between true samples greatly exceeds the kernel width, so each generated point is influenced by at most one true point. Our model enables precise characterization of the conditions for convergence, both to good and bad minima. In particular, the analysis explains two common failure modes: (i) an approximate mode collapse and (ii) divergence. Numerical simulations are provided that predictably replicate these behaviors.
Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth Games: Convergence Analysis under Expected Co-coercivity
Two of the most prominent algorithms for solving unconstrained smooth games are the classical stochastic gradient descent-ascent (SGDA) and the recently introduced stochastic consensus optimization (SCO) [Mescheder et al., 2017]. SGDA is known to converge to a stationary point for specific classes of games, but current convergence analyses require a bounded variance assumption. SCO is used successfully for solving large-scale adversarial problems, but its convergence guarantees are limited to its deterministic variant. In this work, we introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO under this condition for solving a class of stochastic variational inequality problems that are potentially non-monotone. We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size, and we propose insightful stepsize-switching rules to guarantee convergence to the exact solution. In addition, our convergence guarantees hold under the arbitrary sampling paradigm, and as such, we give insights into the complexity of minibatching.