Goto

Collaborating Authors

 sequoia


Sequoia: Scalable and Robust Speculative Decoding

Neural Information Processing Systems

As the usage of large language models (LLMs) grows, it becomes increasingly important to serve them quickly and efficiently. While speculative decoding has recently emerged as a promising direction for accelerating LLM serving, existing methods are limited in their ability to scale to larger speculation budgets and adapt to different hyperparameters. This paper introduces Sequoia, a scalable and robust algorithm for speculative decoding. To improve scalability, Sequoia introduces a dynamic programming algorithm to find an optimal tree structure for the speculated tokens. To achieve robust speculative decoding, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 GPU by up to $4.04\times$, $3.73\times$, and $2.27 \times$. To serve Llama3-70B-Instruct on a single L40 GPU through offloading, Sequoia reduces the per-token decoding latency to 0.60 s/token, $9.5\times$ faster than DeepSpeed-Zero-Inference.


Sequoia: Scalable and Robust Speculative Decoding

Neural Information Processing Systems

As the usage of large language models (LLMs) grows, it becomes increasingly important to serve them quickly and efficiently. While speculative decoding has recently emerged as a promising direction for accelerating LLM serving, existing methods are limited in their ability to scale to larger speculation budgets and adapt to different hyperparameters. This paper introduces Sequoia, a scalable and robust algorithm for speculative decoding. To improve scalability, Sequoia introduces a dynamic programming algorithm to find an optimal tree structure for the speculated tokens. To achieve robust speculative decoding, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures.


Reinforcement learning with combinatorial actions for coupled restless bandits

Xu, Lily, Wilder, Bryan, Khalil, Elias B., Tambe, Milind

arXiv.org Artificial Intelligence

Reinforcement learning (RL) has increasingly been applied to solve real-world planning problems, with progress in handling large state spaces and time horizons. However, a key bottleneck in many domains is that RL methods cannot accommodate large, combinatorially structured action spaces. In such settings, even representing the set of feasible actions at a single step may require a complex discrete optimization formulation. We leverage recent advances in embedding trained neural networks into optimization problems to propose SEQUOIA, an RL algorithm that directly optimizes for long-term reward over the feasible action space. Our approach embeds a Q-network into a mixed-integer program to select a combinatorial action in each timestep. Here, we focus on planning over restless bandits, a class of planning problems which capture many real-world examples of sequential decision making. RMAB, a broader class of restless bandits with combinatorial actions that cannot be decoupled across the arms of the restless bandit, requiring direct solving over the joint, exponentially large action space. Our approach significantly outperforms existing methods--which cannot address sequential planning and combinatorial selection simultaneously--by an average of 24.8% on these difficult instances. Reinforcement learning (RL) has made tremendous progress in recent years to solve a wide range of practical problems (Treloar et al., 2020; Marot et al., 2021; Silvestro et al., 2022; Degrave et al., 2022). While successful at dealing with large or infinite state spaces, RL struggles with discrete, combinatorial action spaces. This limitation is pertinent to many real-world sequential decisionmaking problems, where resource constraints frequently lead to combinatorial action spaces (Dulac-Arnold et al., 2020). Consider, for example, a sequential resource allocation problem in which public health workers are dispatched to visit patients. The workers each have a limited daily budget to maximize patient well-being. These requirements give rise to an exponentially large combinatorial action space to optimize over, even when the number of workers and patients is small.


Google Maps changed the way we get around. It all began in a spare bedroom in Sydney

The Guardian

Stephen Ma has every right to claim bragging rights for helping to hatch the world's most popular online mapping platform. Instead, for the past two decades Ma, one of the four co-founders of Google Maps, has buried himself in a big black hole of anonymity. But not because of any shame or regret – it's just that he isn't one to blow his own trumpet. "I tend to be a very private person," Ma says in a rare interview. "I find the limelight uncomfortable."


DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure

Xiong, Yunfan, Zhang, Ruoyu, Li, Yanzeng, Wu, Tianhao, Zou, Lei

arXiv.org Artificial Intelligence

While speculative decoding has recently appeared as a promising direction for accelerating the inference of large language models (LLMs), the speedup and scalability are strongly bounded by the token acceptance rate. Prevalent methods usually organize predicted tokens as independent chains or fixed token trees, which fails to generalize to diverse query distributions. In this paper, we propose DySpec, a faster speculative decoding algorithm with a novel dynamic token tree structure. We begin by bridging the draft distribution and acceptance rate from intuitive and empirical clues, and successfully show that the two variables are strongly correlated. Based on this, we employ a greedy strategy to dynamically expand the token tree at run time. Theoretically, we show that our method can achieve optimal results under mild assumptions. Empirically, DySpec yields a higher acceptance rate and speedup than fixed trees. DySpec can drastically improve the throughput and reduce the latency of token generation across various data distribution and model sizes, which significantly outperforms strong competitors, including Specinfer and Sequoia. Under low temperature setting, DySpec can improve the throughput up to 9.1$\times$ and reduce the latency up to 9.4$\times$ on Llama2-70B. Under high temperature setting, DySpec can also improve the throughput up to 6.21$\times$, despite the increasing difficulty of speculating more than one token per step for draft model.


LinkedIn Has Answers to Questions You've Never Had

Slate

"What does a teacher do?" "What does a barber do?" "What are recent developments in Swiftonomics?" I pondered these questions only after LinkedIn prompted me to do so. Suddenly, I found myself contemplating the very essence of my own reality. How did I learn what I know? How does my hair go from long to short every five weeks?


All the Top New Features Coming to MacOS Sequoia

WIRED

Apple has officially unveiled the latest version of its operating system for Mac. This time around, Apple stuck to its "California places" naming convention and went with macOS Sequoia. Also known as macOS 15, the new OS packs a ton of new capabilities onto the desktop, including a password management app, video conferencing tools, and updates to Safari, as well as all the features that come with Apple Intelligence--the company's new artificial intelligence–powered system. Below, we break down all these new features that will become available in macOS Sequoia when it ships this fall. Be sure to also check out our iOS 18 and iPadOS 18 feature roundup for all the new features coming to your iPhone and iPad, and our look at what's new in watchOS 11.


Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Chen, Zhuoming, May, Avner, Svirschevski, Ruslan, Huang, Yuhsun, Ryabinin, Max, Jia, Zhihao, Chen, Beidi

arXiv.org Artificial Intelligence

As the usage of large language models (LLMs) grows, performing efficient inference with these models becomes increasingly important. While speculative decoding has recently emerged as a promising direction for speeding up inference, existing methods are limited in their ability to scale to larger speculation budgets, and adapt to different hyperparameters and hardware. This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding. To attain better scalability, Sequoia introduces a dynamic programming algorithm to find the optimal tree structure for the speculated tokens. To achieve robust speculative performance, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Finally, Sequoia introduces a hardware-aware tree optimizer that maximizes speculative performance by automatically selecting the token tree size and depth for a given hardware platform. Evaluation shows that Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 by up to $4.04\times$, $3.73\times$, and $2.27\times$. For offloading setting on L40, Sequoia achieves as low as 0.56 s/token for exact Llama2-70B inference latency, which is $9.96\times$ on our optimized offloading system (5.6 s/token), $9.7\times$ than DeepSpeed-Zero-Inference, $19.5\times$ than Huggingface Accelerate.


Amazon says its new AI-powered robots reduce fulfilment time by 25 percent

Engadget

Amazon is integrating a new robotics system into its warehouses to improve delivery times, safety and general operations. The AI-powered technology, known as Sequoia, could improve the speed of finding and storing products by up to 75 percent and order fulfillment by up to 25 percent, the Wall Street Journal reports. The system was already introduced in one of Amazon's Houston-based warehouses. Sequoia involves vehicles transporting totes of products to a sorting machine. It uses robotic arms and computer vision to identify the inventory before sending it to employees for delivery.


Sequoia: A Software Framework to Unify Continual Learning Research

Normandin, Fabrice, Golemo, Florian, Ostapenko, Oleksiy, Rodriguez, Pau, Riemer, Matthew D, Hurtado, Julio, Khetarpal, Khimya, Lindeborg, Ryan, Cecchi, Lucas, Lesort, Timothée, Charlin, Laurent, Rish, Irina, Caccia, Massimo

arXiv.org Artificial Intelligence

The field of Continual Learning (CL) seeks to develop algorithms that accumulate knowledge and skills over time through interaction with non-stationary environments. In practice, a plethora of evaluation procedures (settings) and algorithmic solutions (methods) exist, each with their own potentially disjoint set of assumptions. This variety makes measuring progress in CL difficult. We propose a taxonomy of settings, where each setting is described as a set of assumptions. A tree-shaped hierarchy emerges from this view, where more general settings become the parents of those with more restrictive assumptions. This makes it possible to use inheritance to share and reuse research, as developing a method for a given setting also makes it directly applicable onto any of its children. We instantiate this idea as a publicly available software framework called Sequoia, which features a wide variety of settings from both the Continual Supervised Learning (CSL) and Continual Reinforcement Learning (CRL) domains. Sequoia also includes a growing suite of methods which are easy to extend and customize, in addition to more specialized methods from external libraries. We hope that this new paradigm and its first implementation can help unify and accelerate research in CL. You can help us grow the tree by visiting www.github.com/lebrice/Sequoia.