AITopics

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)

Neural Information Processing SystemsFeb-15-2026, 16:32:29 GMT

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models Pan Lu

This title draws inspiration from the chameleon's ability to adapt and blend into its surroundings, which parallels

generator, large language model, machine learning, (19 more...)

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)

arXiv.org Artificial IntelligenceDec-9-2025

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Li, Xiangyu, Yin, Chengyu, Wang, Weijun, Wei, Jianyu, Cao, Ting, Liu, Yunxin

Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-based inference, CPUs run these ultra-low-bit LLMs even faster than NPUs, opening new opportunities for ubiquitous on-device intelligence. However, this paper identifies that LUT-based inference underutilizes memory bandwidth during parallel inference, which is required for prefilling, test-time scaling, and other multi-token scenarios. The root cause is the scalar LUT paradigm, which performs repetitive and non-contiguous memory accesses for each token. To solve the issue, we propose vector LUT, a new lookup paradigm that constructs a unified LUT across parallel tokens, and performs a single $1 \rightarrow N$ lookup per index. To realize it efficiently, we further introduce (1) Vector LUT-Centric Tensor Layout, and (2) Cache-Aware Streamed Lookup techniques. Evaluations on 5 edge devices across 3 LLMs show that Vec-LUT outperforms state-of-the-art baselines by up to $4.2\times$. Our implementation is integrated into llama.cpp. The code is available at https://github.com/Cipherxzc/vlut.cpp.

large language model, machine learning, vec-lut, (16 more...)

2512.06443

Country:

Asia (0.28)
North America > United States (0.16)

Genre: Research Report (0.64)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Nainwani, Harshit, Baban, Hediyeh

Search Is Not Retrieval: Decoupling Semantic Matching from Contextual Assembly in RAG

arXiv.org Artificial IntelligenceNov-14-2025

Retrieval systems are essential to contemporary AI pipelines, although most confuse two separate processes: finding relevant information and giving enough context for reasoning. We introduce the Search-Is-Not-Retrieve (SINR) framework, a dual-layer architecture that distinguishes between fine-grained search representations and coarse-grained retrieval contexts. SINR enhances the composability, scalability, and context fidelity of retrieval systems by directly connecting small, semantically accurate search chunks to larger, contextually complete retrieve chunks, all without incurring extra processing costs. This design changes retrieval from a passive step to an active one, making the system architecture more like how people process information. We discuss the SINR framework's conceptual foundation, formal structure, implementation issues, and qualitative outcomes. This provides a practical foundation for the next generation of AI systems that use retrieval.

large language model, machine learning, natural language, (16 more...)

2511.04939

Genre: Research Report (0.50)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)

Neural Information Processing SystemsOct-9-2025, 00:33:09 GMT

871ed095b734818cfba48db6aeb25a62-Paper-Conference.pdf

large language model, machine learning, programming language, (21 more...)

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (0.46)
Education (0.46)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(4 more...)

Fidalgo, Nicholas, Ye, Puyuan

Learned LSM-trees: Two Approaches Using Learned Bloom Filters

arXiv.org Artificial IntelligenceAug-5-2025

Modern key-value stores rely heavily on Log-Structured Merge (LSM) trees for write optimization, but this design introduces significant read amplification. Auxiliary structures like Bloom filters help, but impose memory costs that scale with tree depth and dataset size. Recent advances in learned data structures suggest that machine learning models can augment or replace these components, trading handcrafted heuristics for data-adaptive behavior. In this work, we explore two approaches for integrating learned predictions into the LSM-tree lookup path. The first uses a classifier to selectively bypass Bloom filter probes for irrelevant levels, aiming to reduce average-case query latency. The second replaces traditional Bloom filters with compact learned models and small backup filters, targeting memory footprint reduction without compromising correctness. We implement both methods atop a Monkey-style LSM-tree with leveled compaction, per-level Bloom filters, and realistic workloads. Our experiments show that the classifier reduces GET latency by up to 2.28x by skipping over 30% of Bloom filter checks with high precision, though it incurs a modest false-negative rate. The learned Bloom filter design achieves zero false negatives and retains baseline latency while cutting memory usage per level by 70-80%. Together, these designs illustrate complementary trade-offs between latency, memory, and correctness, and highlight the potential of learned index components in write-optimized storage systems.

artificial intelligence, bloom filter, machine learning, (19 more...)

2508.00882

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

arXiv.org Artificial IntelligenceJun-30-2025

DCN^2: Interplay of Implicit Collision Weights and Explicit Cross Layers for Large-Scale Recommendation

Škrlj, Blaž, Karni, Yonatan, Gašperšič, Grega, Mramor, Blaž, Stolin, Yulia, Jakomin, Martin, Urbančič, Jasna, Dishi, Yuval, Silberstein, Natalia, Friedler, Ophir, Klein, Assaf

The Deep and Cross architecture (DCNv2) is a robust production baseline and is integral to numerous real-life recommender systems. Its inherent efficiency and ability to model interactions often result in models that are both simpler and highly competitive compared to more computationally demanding alternatives, such as Deep FFMs. In this work, we introduce three significant algorithmic improvements to the DCNv2 architecture, detailing their formulation and behavior at scale. The enhanced architecture we refer to as DCN^2 is actively used in a live recommender system, processing over 0.5 billion predictions per second across diverse use cases where it out-performed DCNv2, both offline and online (ab tests). These improvements effectively address key limitations observed in the DCNv2, including information loss in Cross layers, implicit management of collisions through learnable lookup-level weights, and explicit modeling of pairwise similarities with a custom layer that emulates FFMs' behavior. The superior performance of DCN^2 is also demonstrated on four publicly available benchmark data sets.

artificial intelligence, dcn 2, machine learning, (17 more...)

2506.21624

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Neural Information Processing SystemsMay-27-2025, 16:49:01 GMT

NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention

Large Language Model (LLM) inference on Central Processing Units (CPU) is challenging due to the vast quantities of Multiply-Add (MAD) matrix operations in the attention computations. We leverage this unique capability to propose NoMAD-Attention, an efficient attention algorithm that replaces MAD operations with in-register lookups. Through hardware-aware algorithmic designs, NoMAD-Attention achieves the computation of attention scores using repeated fast accesses to SIMD registers. NoMAD-Attention works with pre-trained attention-based LLMs without model finetuning. Extensive empirical evaluations demonstrate that NoMAD-Attention maintains the quality of the original LLMs well and speeds up the 4-bit quantized LLaMA-7B-based model by up to 2 \times at 16k context length.

efficient llm inference, multiply-add-free attention, nomad-attention, (4 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceApr-15-2025

Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures

Siracusa, Marco, Hsu, Olivia, Soria-Pardos, Victor, Randall, Joshua, Grasset, Arnaud, Biscondi, Eric, Joseph, Doug, Allen, Randy, Kjolstad, Fredrik, Planas, Miquel Moretó, Armejach, Adrià

Irregular embedding lookups are a critical bottleneck in recommender models, sparse large language models, and graph learning models. In this paper, we first demonstrate that, by offloading these lookups to specialized access units, Decoupled Access-Execute (DAE) processors achieve 2.6$\times$ higher performance and 6.4$\times$ higher performance/watt than GPUs on end-to-end models. Then, we propose the Ember compiler for automatically generating optimized DAE code from PyTorch and TensorFlow. Conversely from other DAE compilers, Ember features multiple intermediate representations specifically designed for different optimization levels. In this way, Ember can implement all optimizations to match the performance of hand-written code, unlocking the full potential of DAE architectures at scale.

artificial intelligence, machine learning, natural language, (19 more...)

2504.0987

Country:

Europe (1.00)
North America > United States > California (0.93)
North America > United States > New York > New York County > New York City (0.15)

Genre: Research Report (0.50)

Industry: Information Technology (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)