Goto

Collaborating Authors

 tpc


Sequential Memory with Temporal Predictive Coding Supplementary Materials

Neural Information Processing Systems

In Algorithm 1 we present the memorizing and recalling procedures of the single-layer tPC.Algorithm 1 Memorizing and recalling with single-layer tPC Here we present the proof for Property 1 in the main text, that the single-layer tPC can be viewed as a "whitened" version of the AHN. When applied to the data sequence, it whitens the data such that (i.e., Eq.16 in the main text): These observations are consistent with our numerical results shown in Figure 1. MCAHN has a much larger MSE than that of the tPC because of the entirely wrong recalls. In Figure 1 we also present the online recall results of the models in MovingMNIST, CIFAR10 and UCF101. In Fig 4 we show a natural example of aliased sequences where a movie of a human doing push-ups is memorized and recalled by the model.



Length-MAX Tokenizer for Language Models

Dong, Dong, Su, Weijie

arXiv.org Artificial Intelligence

We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7\% and enhancing HellaSwag accuracy by 4.3\%. Moreover, the Length-MAX tokenizer achieves 99.62\% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12\% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing -- and often improving -- downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18\% at inference.



Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Oldfield, James, Torr, Philip, Patras, Ioannis, Bibi, Adel, Barez, Fazl

arXiv.org Artificial Intelligence

Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible-costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and Beaver-Tails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts.



Sequential Memory with Temporal Predictive Coding Supplementary Materials

Neural Information Processing Systems

In Algorithm 1 we present the memorizing and recalling procedures of the single-layer tPC.Algorithm 1 Memorizing and recalling with single-layer tPC Here we present the proof for Property 1 in the main text, that the single-layer tPC can be viewed as a "whitened" version of the AHN. When applied to the data sequence, it whitens the data such that (i.e., Eq.16 in the main text): These observations are consistent with our numerical results shown in Figure 1. MCAHN has a much larger MSE than that of the tPC because of the entirely wrong recalls. In Figure 1 we also present the online recall results of the models in MovingMNIST, CIFAR10 and UCF101. In Fig 4 we show a natural example of aliased sequences where a movie of a human doing push-ups is memorized and recalled by the model.



LithOS: An Operating System for Efficient Machine Learning on GPUs

Coppock, Patrick H., Zhang, Brian, Solomon, Eliot H., Kypriotis, Vasilis, Yang, Leon, Sharma, Bikash, Schatzberg, Dan, Mowry, Todd C., Skarlatos, Dimitrios

arXiv.org Artificial Intelligence

The surging demand for GPUs in datacenters for machine learning (ML) has made efficient GPU utilization crucial. However, meeting the diverse needs of ML models while optimizing resource usage is challenging. To enable transparent, fine-grained GPU management that maximizes utilization and energy efficiency while maintaining strong isolation, an operating system (OS) approach is needed. This paper introduces LithOS, a first step toward a GPU OS. LithOS includes the following new abstractions and mechanisms for efficient GPU resource management: (i) a novel TPC Scheduler that supports spatial scheduling at the granularity of individual TPCs, unlocking efficient TPC stealing between workloads; (ii) transparent kernel atomization to reduce head-of-line blocking and enable dynamic resource reallocation mid-execution; (iii) a lightweight hardware right-sizing mechanism that determines the minimal TPC resources needed per atom; and (iv) a transparent power management mechanism that reduces power consumption based on in-flight work behavior. We implement LithOS in Rust and evaluate its performance across extensive ML environments, comparing it to state-of-the-art solutions from NVIDIA and prior research. For inference stacking, LithOS reduces tail latencies by 13x compared to MPS; compared to the best SotA, it reduces tail latencies by 3x while improving aggregate throughput by 1.6x. In hybrid inference-training stacking, LithOS reduces tail latencies by 4.7x compared to MPS; compared to the best SotA, it reduces tail latencies 1.18x while improving aggregate throughput by 1.35x. Finally, for a modest performance hit under 4%, LithOS's right-sizing provides a quarter of GPU capacity savings on average, while for a 7% hit, its power management yields a quarter of a GPU's energy savings. Overall, LithOS increases GPU efficiency, establishing a foundation for future OS research on GPUs.


TPC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction

Wang, Chao, Fu, Weiwei, Zhou, Yang

arXiv.org Artificial Intelligence

Vision-language models (VLMs) have achieved remarkable advancements, capitalizing on the impressive capabilities of large language models (LLMs) across diverse tasks. Despite this, a critical challenge known as hallucination occurs when models overconfidently describe objects or attributes absent from the image, a problem exacerbated by the tendency of VLMs to rely on linguistic priors. This limitation reduces model reliability in high-stakes applications. In this work, we have observed the characteristic of logits' continuity consistency enhancement and introduced a straightforward and efficient method, Cross-Temporal Prediction Connection (TPC), designed to enhance the semantic consistency of logits by connecting them temporally across timesteps. TPC amplifies information flow and improves coherence, effectively reducing hallucination. Extensive experiments show that TPC surpasses existing representatives, delivering superior performance in both accuracy and efficiency while maintaining robustness in open-ended text generation tasks.