Not enough data to create a plot.
Try a different view from the menu above.
Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping Marina Danilova
In this paper, we propose a new accelerated stochastic first-order method called clipped-SSTM for smooth convex stochastic optimization with heavy-tailed distributed noise in stochastic gradients and derive the first high-probability complexity bounds for this method closing the gap in the theory of stochastic optimization with heavy-tailed noise. Our method is based on a special variant of accelerated Stochastic Gradient Descent (SGD) and clipping of stochastic gradients. We extend our method to the strongly convex case and prove new complexity bounds that outperform state-of-the-art results in this case. Finally, we extend our proof technique and derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.
20885c72ca35d75619d6a378edea9f76-AuthorFeedback.pdf
Figure 1: (a) Some of the features extracted by LFOICA. We would like to thank all reviewers for the constructive comments. The citation indices below are consistent with those in the main paper. To R#1: MoG and sparsity have been studied. We will further discuss the suggested references in the "practical consideration" subsection. With additional constraints on A, the permutation and scaling ambiguity can be further reduced.
Learning with Fitzpatrick Losses
Fenchel-Young losses are a family of loss functions, encompassing the squared, logistic and sparsemax losses, among others. They are convex w.r.t. the model output and the target, separately. Each Fenchel-Young loss is implicitly associated with a link function, that maps model outputs to predictions. For instance, the logistic loss is associated with the soft argmax link function. Can we build new loss functions associated with the same link function as Fenchel-Young losses?
Explaining Datasets in Words: Statistical Models with Natural Language Parameters
To make sense of massive data, we often first fit simplified models and then interpret the parameters; for example, we cluster the text embeddings and then interpret the mean parameters of each cluster. However, these parameters are often highdimensional and hard to interpret. To make model parameters directly interpretable, we introduce a family of statistical models--including clustering, time series, and classification models--parameterized by natural language predicates. For example, a cluster of text about COVID could be parameterized by the predicate "discusses COVID". To learn these statistical models effectively, we develop a model-agnostic algorithm that optimizes continuous relaxations of predicate parameters with gradient descent and discretizes them by prompting language models (LMs). Finally, we apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other, clustering math problems based on subareas, and explaining visual features in memorable images. Our framework is highly versatile, applicable to both textual and visual domains, can be easily steered to focus on specific properties (e.g.
Assessing Social and Intersectional Biases in Contextualized Word Representations
Social bias in machine learning has drawn significant attention, with work ranging from demonstrations of bias in a multitude of applications, curating definitions of fairness for different contexts, to developing algorithms to mitigate bias. In natural language processing, gender bias has been shown to exist in context-free word embeddings. Recently, contextual word representations have outperformed word embeddings in several downstream NLP tasks. These word representations are conditioned on their context within a sentence, and can also be used to encode the entire sentence. In this paper, we analyze the extent to which state-of-the-art models for contextual word representations, such as BERT and GPT-2, encode biases with respect to gender, race, and intersectional identities. Towards this, we propose assessing bias at the contextual word level. This novel approach captures the contextual effects of bias missing in context-free word embeddings, yet avoids confounding effects that underestimate bias at the sentence encoding level. We demonstrate evidence of bias at the corpus level, find varying evidence of bias in embedding association tests, show in particular that racial bias is strongly encoded in contextual word models, and observe that bias effects for intersectional minorities are exacerbated beyond their constituent minority identities. Further, evaluating bias effects at the contextual word level captures biases that are not captured at the sentence level, confirming the need for our novel approach.
201d546992726352471cfea6b0df0a48-AuthorFeedback.pdf
Reviewer 1: Thank you for the detailed thought-provoking comments which will help us improve the final work. We will include a test for associating M/F names and occ words in the final version. Indeed the method uses the contextual representation, so there is no pooling involved. Reviewer 2: Thank you for the incisive comments which help us improve our discussion and interpretation of results. Indeed, our word lists were constructed in prior work (Caliskan et al. [6] and May et al. [23]), which E.g., for the extension of the Heilman double bind tests to race, we kept the Some prior work [23] also found negative effect sizes for BERT and GPT (for sent encodings).
Dendritic Integration Inspired Artificial Neural Networks Capture Data Correlation, and Douglas Zhou
Incorporating biological neuronal properties into Artificial Neural Networks (ANNs) to enhance computational capabilities is under active investigation in the field of deep learning. Inspired by recent findings indicating that dendrites adhere to quadratic integration rule for synaptic inputs, this study explores the computational benefits of quadratic neurons. We theoretically demonstrate that quadratic neurons inherently capture correlation within structured data, a feature that grants them superior generalization abilities over traditional neurons. This is substantiated by few-shot learning experiments. Furthermore, we integrate the quadratic rule into Convolutional Neural Networks (CNNs) using a biologically plausible approach, resulting in innovative architectures--Dendritic integration inspired CNNs (Dit-CNNs). Our Dit-CNNs compete favorably with state-of-the-art models across multiple classification benchmarks, e.g., ImageNet-1K, while retaining the simplicity and efficiency of traditional CNNs.
Humanoid Locomotion as Next Token Prediction
We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor sequences. To account for the multimodal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, such as videos without actions. We train our model on a dataset of sequences from a prior neural network policy, a model-based controller, motion capture, and YouTube videos of humans. We show that our model enables a real humanoid robot to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor sequences.
abb451a12cf1a9d93292e81f0d4fdd7a-AuthorFeedback.pdf
We thank the reviewers for their thoughtful feedback. Below, we provide clarifications to their concerns. Reviewer R1. - (Sufficient conditions for learnability) As noted on Line 72, an online learning with dynamics These comprise the broad class of instances for which our learnability characterization is tight. We will elaborate further in our revised draft. For example, consider linear dynamical systems parameterized by matrices (A, B).
Chirality Nets for Human Pose Regression
Raymond Yeh, Yuan-Ting Hu, Alexander Schwing
We propose Chirality Nets, a family of deep nets that is equivariant to the "chirality transform," i.e., the transformation to create a chiral pair. Through parameter sharing, odd and even symmetry, we propose and prove variants of standard building blocks of deep nets that satisfy the equivariance property, including fully connected layers, convolutional layers, batch-normalization, and LSTM/GRU cells. The proposed layers lead to a more data efficient representation and a reduction in computation by exploiting symmetry. We evaluate chirality nets on the task of human pose regression, which naturally exploits the left/right mirroring of the human body. We study three pose regression tasks: 3D pose estimation from video, 2D pose forecasting, and skeleton based activity recognition. Our approach achieves/matches state-of-the-art results, with more significant gains on small datasets and limited-data settings.