Not enough data to create a plot.
Try a different view from the menu above.
Should We Really Edit Language Models On the Evaluation of Edited Language Models
Model editing has become an increasingly popular method for efficiently updating knowledge within language models. Current approaches primarily focus on reliability, generalization, and locality, with many excelling across these criteria. Some recent studies have highlighted the potential pitfalls of these editing methods, such as knowledge distortion and conflicts. However, the general capabilities of post-edited language models remain largely unexplored. In this paper, we conduct a comprehensive evaluation of various editing methods across different language models, and have the following findings.
A FineWeb Datasheet Dataset Details Purpose of the dataset
We released FineWeb to make large language model training more accessible to the machine learning community at large. The dataset was curated by Hugging Face. The dataset was funded by Hugging Face. The dataset is released under the Open Data Commons Attribution License (ODC-By) v1.0 license. The use of this dataset is also subject to Common-Crawl's Terms of Use.
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-ofthe-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including indepth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb.
The Inductive Bias of Quantum Kernels
It has been hypothesized that quantum computers may lend themselves well to applications in machine learning. In the present work, we analyze function classes defined via quantum kernels. Quantum computers offer the possibility to efficiently compute inner products of exponentially large density operators that are classically hard to compute.
The Inductive Bias of Quantum Kernels
It has been hypothesized that quantum computers may lend themselves well to applications in machine learning. In the present work, we analyze function classes defined via quantum kernels. Quantum computers offer the possibility to efficiently compute inner products of exponentially large density operators that are classically hard to compute.
Accelerating Transformers with Spectrum-Preserving Token Merging 1,3,4
Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior works have proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top k similar tokens. However, these methods have significant drawbacks, such as sensitivity to tokensplitting strategies and damage to informative tokens in later layers.
A Gradient Sampling Method With Complexity Guarantees for Lipschitz Functions in High and Low Dimensions
Their method is a novel modification of Goldstein's classical subgradient method. Their work, however, makes use of a nonstandard subgradient oracle model and requires the function to be directionally differentiable. Our first contribution in this paper is to show that both of these assumptions can be dropped by simply adding a small random perturbation in each step of their algorithm. The resulting method works on any Lipschitz function whose value and gradient can be evaluated at points of differentiability.