Not enough data to create a plot.
Try a different view from the menu above.
Accelerating Transformers with Spectrum-Preserving Token Merging 1,3,4
Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior works have proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top k similar tokens. However, these methods have significant drawbacks, such as sensitivity to tokensplitting strategies and damage to informative tokens in later layers.
A Gradient Sampling Method With Complexity Guarantees for Lipschitz Functions in High and Low Dimensions
Their method is a novel modification of Goldstein's classical subgradient method. Their work, however, makes use of a nonstandard subgradient oracle model and requires the function to be directionally differentiable. Our first contribution in this paper is to show that both of these assumptions can be dropped by simply adding a small random perturbation in each step of their algorithm. The resulting method works on any Lipschitz function whose value and gradient can be evaluated at points of differentiability.
A Gradient Sampling Method With Complexity Guarantees for Lipschitz Functions in High and Low Dimensions
Their method is a novel modification of Goldstein's classical subgradient method. Their work, however, makes use of a nonstandard subgradient oracle model and requires the function to be directionally differentiable. Our first contribution in this paper is to show that both of these assumptions can be dropped by simply adding a small random perturbation in each step of their algorithm. The resulting method works on any Lipschitz function whose value and gradient can be evaluated at points of differentiability.
On the Parameter Identifiability of Partially Observed Linear Causal Models
Linear causal models are important tools for modeling causal dependencies and yet in practice, only a subset of the variables can be observed. In this paper, we examine the parameter identifiability of these models by investigating whether the edge coefficients can be recovered given the causal structure and partially observed data. Our setting is more general than that of prior research--we allow all variables, including both observed and latent ones, to be flexibly related, and we consider the coefficients of all edges, whereas most existing works focus only on the edges between observed variables. Theoretically, we identify three types of indeterminacy for the parameters in partially observed linear causal models. We then provide graphical conditions that are sufficient for all parameters to be identifiable and show that some of them are provably necessary. Methodologically, we propose a novel likelihoodbased parameter estimation method that addresses the variance indeterminacy in a specific way and can asymptotically recover the underlying parameters up to trivial indeterminacy. Empirical studies on both synthetic and real-world datasets validate our identifiability theory and the effectiveness of the proposed method in the finitesample regime.
OctField Hierarchical Implicit Functions for 3D Modeling Supplemental Material
In this supplemental material, we provide more details on network architecture and more visualization results, including shape reconstruction/comparison, shape Generation, and shape Interpolations. Furthermore, some results on scene reconstruction and comparison with Local Implicit Grid [3] are presented to demonstrate our superiority on large data representation thanks to the hierarchical tree structure of our proposed OctField representation. All sections are listed as follows: Section 1 provides the details of network architecture and training. Section 2, Section 3 and Section 4 provide more visualization results on a number of 3D modeling tasks, including shape reconstruction, generation and interpolation. Section 5 conducts four ablation studies, including with or without overlapping of adjacent octants, the training strategy, the distinction of latent codes and the subdivision parameter ฯ.
A Multi-Implicit Neural Representation for Fonts
In our experiments, we train an auto-decoder based network which is an 8-layer MLP, and each hidden layer contains 384 neurons. We use the LeakyReLU activation function as the non-linearity. The latent embedding z is a 128-D vector. For better convergence, sharing the spirit from [4], a skip connection is built between inputs and the third hidden layer, i.e., the inputs are concatenated to the output of the third hidden layer. Rather than following the traditional training routine in the reconstruction and interpolation tasks, the training strategy for the generation task is to freeze the learned latent embedding weights after 1000 epochs, such that the training is more stable across glyphs of the same font family.
A Multi-Implicit Neural Representation for Fonts Zhaowen Wang 2 Matthew Fisher
Fonts are ubiquitous across documents and come in a variety of styles. They are either represented in a native vector format or rasterized to produce fixed resolution images. In the first case, the non-standard representation prevents benefiting from latest network architectures for neural representations; while, in the latter case, the rasterized representation, when encoded via networks, results in loss of data fidelity, as font-specific discontinuities like edges and corners are difficult to represent using neural networks. Based on the observation that complex fonts can be represented by a superposition of a set of simpler occupancy functions, we introduce multi-implicits to represent fonts as a permutation-invariant set of learned implicit functions, without losing features (e.g., edges and corners). However, while multi-implicits locally preserve font features, obtaining supervision in the form of ground truth multi-channel signals is a problem in itself. Instead, we propose how to train such a representation with only local supervision, while the proposed neural architecture directly finds globally consistent multi-implicits for font families. We extensively evaluate the proposed representation for various tasks including reconstruction, interpolation, and synthesis to demonstrate clear advantages with existing alternatives. Additionally, the representation naturally enables glyph completion, wherein a single characteristic font is used to synthesize a whole font family in the target style.
Language Without Borders: A Dataset and Benchmark for Code-Switching Lip Reading Supplementary Material
This supplement to our main paper, "Language Without Borders: A Dataset and Benchmark for Code-Switching Lip Reading," includes detailed descriptions of the dataset collection methods, a comprehensive data card, and datasheets. Additionally, we provide licensing information for the dataset, along with an author statement affirming adherence to the license. Further discussions on the societal impact are included, covering cultural context and privacy considerations. Implementation details of the methods applied to the dataset are also provided. This application, illustrated in Figure 3, not only facilitates the usages of participants, but also ensures the integrity and uniformity of the collected data. Prior to the commencement of the recording, participants are adequately briefed about the entire data collection process and all necessary precautions. This includes detailed instructions for downloading and installing our application, important pre-requisites for successful data collection such as securing a quiet environment for recordings. It guarantees that the participant's face is fully within the video frame and directly facing the camera, and avoiding the presence of additional faces in the recording frame. It is of fundamental importance that during the recording, participants are advised to hold their phone with one hand while maintaining an optimal distance from the camera to achieve clear and properly framed video images. To avoid any distractions or impediments during the recording session, participants are recommended to disable notification alerting from various apps like WeChat or any others that could potentially obstruct the recording interface's prompts.
Language Without Borders: A Dataset and Benchmark for Code-Switching Lip Reading
Lip reading aims at transforming the videos of continuous lip movement into textual contents, and has achieved significant progress over the past decade. It serves as a critical yet practical assistance for speech-impaired individuals, with more practicability than speech recognition in noisy environments. With the increasing interpersonal communications in social media owing to globalization, the existing monolingual datasets for lip reading may not be sufficient to meet the exponential proliferation of bilingual and even multilingual users. However, to our best knowledge, research on code-switching is only explored in speech recognition, while the attempts in lip reading are seriously neglected. To bridge this gap, we have collected a bilingual code-switching lip reading benchmark composed of Chinese and English, dubbed CSLR.
Delving into the Reversal Curse: How Far Can Large Language Models Generalize?
A prime example is the recently debated "reversal curse", which surfaces when models, having been trained on the fact "A is B", struggle to generalize this knowledge to infer that "B is A". In this paper, we examine the manifestation of the reversal curse across various tasks and delve into both the generalization abilities and the problem-solving mechanisms of LLMs. This investigation leads to a series of significant insights: (1) LLMs are able to generalize to "B is A" when both A and B are presented in the context as in the case of a multiple-choice question.