AITopics | Yao, Zhewei

Collaborating Authors

Yao, Zhewei

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ReFoRCE: A Text-to-SQL Agent with Self-Refinement, Format Restriction, and Column Exploration

Deng, Minghang, Ramachandran, Ashwin, Xu, Canwen, Hu, Lanxiang, Yao, Zhewei, Datta, Anupam, Zhang, Hao

arXiv.org Artificial IntelligenceFeb-14-2025

Text-to-SQL systems have unlocked easier access to critical data insights by enabling natural language queries over structured databases. However, deploying such systems in enterprise environments remains challenging due to factors such as large, complex schemas (> 3000 columns), diverse SQL dialects (e.g., BigQuery, Snowflake) and sophisticated query requirements (e.g., transformation, analytics). Current state-of-the-art performance on the Spider 2.0 dataset -- a benchmark built to mimic such complex environments -- remains limited at 20%. Key limitations include inadequate instruction-following, poor long-context comprehension, weak self-refinement, and insufficient dialect-specific knowledge. To address these gaps, we propose ReFoRCE (Self-Refinement Agent with Format Restriction and Column Exploration) which introduces (1) table compression to mitigate longcontext limitations (2) format restriction to ensure accurate answer format, and (3) iterative column exploration for enhanced schema understanding. Additionally, it employs self-refinement pipeline consisting of (1) parallelized workflows with voting mechanisms and (2) a Common Table Expression (CTE) based refinement approach to handle unresolved cases. ReFoRCE achieves state-of-the-art results scoring 31.26 on the Spider 2.0-Snow and scoring 30.35 on the Spider 2.0-Lite tasks. Text-to-SQL converts natural language queries into SQL queries, serving as a key technology for lowering the barrier to accessing relational databases (Zelle & Mooney, 1996; Zettlemoyer & Collins, 2012; Zhong et al., 2017; Yu et al., 2018; Wang et al., 2019; Gao et al., 2023a; Lei et al., 2024).

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.00675

Country:

Asia (0.28)
North America > United States (0.28)

Genre:

Workflow (0.69)
Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Agentic Verification for Ambiguous Query Disambiguation

Lee, Youngwon, Hwang, Seung-won, Wu, Ruofan, Yan, Feng, Xu, Danmei, Akkad, Moutasem, Yao, Zhewei, He, Yuxiong

arXiv.org Artificial IntelligenceFeb-14-2025

In this work, we tackle the challenge of disambiguating queries in retrieval-augmented generation (RAG) to diverse yet answerable interpretations. State-of-the-arts follow a Diversify-then-Verify (DtV) pipeline, where diverse interpretations are generated by an LLM, later used as search queries to retrieve supporting passages. Such a process may introduce noise in either interpretations or retrieval, particularly in enterprise settings, where LLMs -- trained on static data -- may struggle with domain-specific disambiguations. Thus, a post-hoc verification phase is introduced to prune noises. Our distinction is to unify diversification with verification by incorporating feedback from retriever and generator early on. This joint approach improves both efficiency and robustness by reducing reliance on multiple retrieval and inference steps, which are susceptible to cascading errors. We validate the efficiency and effectiveness of our method, Verified-Diversification with Consolidation (VERDICT), on the widely adopted ASQA benchmark to achieve diverse yet verifiable interpretations. Empirical results show that VERDICT improves grounding-aware F1 score by an average of 23% over the strongest baseline across different backbone LLMs.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.10352

Country:

Asia (0.68)
North America > United States (0.28)
North America > Mexico (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

CORD: Balancing COnsistency and Rank Distillation for Robust Retrieval-Augmented Generation

Lee, Youngwon, Hwang, Seung-won, Campos, Daniel, Graliński, Filip, Yao, Zhewei, He, Yuxiong

arXiv.org Artificial IntelligenceDec-19-2024

With the adoption of retrieval-augmented generation (RAG), large language models (LLMs) are expected to ground their generation to the retrieved contexts. Yet, this is hindered by position bias of LLMs, failing to evenly attend to all contexts. Previous work has addressed this by synthesizing contexts with perturbed positions of gold segment, creating a position-diversified train set. We extend this intuition to propose consistency regularization with augmentation and distillation. First, we augment each training instance with its position perturbation to encourage consistent predictions, regardless of ordering. We also distill behaviors of this pair, although it can be counterproductive in certain RAG scenarios where the given order from the retriever is crucial for generation quality. We thus propose CORD, balancing COnsistency and Rank Distillation. CORD adaptively samples noise-controlled perturbations from an interpolation space, ensuring both consistency and respect for the rank prior. Empirical results show this balance enables CORD to outperform consistently in diverse RAG benchmarks.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.14581

Country: Europe > Austria (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)

Add feedback

Inference Scaling for Bridging Retrieval and Augmented Generation

Lee, Youngwon, Hwang, Seung-won, Campos, Daniel, Graliński, Filip, Yao, Zhewei, He, Yuxiong

arXiv.org Artificial IntelligenceDec-14-2024

Retrieval-augmented generation (RAG) has emerged as a popular approach to steering the output of a large language model (LLM) by incorporating retrieved contexts as inputs. However, existing work observed the generator bias, such that improving the retrieval results may negatively affect the outcome. In this work, we show such bias can be mitigated, from inference scaling, aggregating inference calls from the permuted order of retrieved contexts. The proposed Mixture-of-Intervention (MOI) explicitly models the debiased utility of each passage with multiple forward passes to construct a new ranking. We also show that MOI can leverage the retriever's prior knowledge to reduce the computational cost by minimizing the number of permutations considered and lowering the cost per LLM call. We showcase the effectiveness of MOI on diverse RAG tasks, improving ROUGE-L on MS MARCO and EM on HotpotQA benchmarks by ~7 points.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.10684

Country:

Asia (0.93)
North America > United States (0.46)
Europe > Austria (0.28)
North America > Mexico (0.28)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

Qiao, Aurick, Yao, Zhewei, Rajbhandari, Samyam, He, Yuxiong

arXiv.org Artificial IntelligenceDec-5-2024

LLM inference for popular enterprise use cases, such as summarization, RAG, and code-generation, typically observes orders of magnitude longer prompt lengths than generation lengths. This characteristic leads to high cost of prefill and increased response latency. In this paper, we present SwiftKV, a novel model transformation and distillation procedure specifically designed to reduce the time and cost of processing prompt tokens while preserving high quality of generated tokens. SwiftKV combines three key mechanisms: i) SingleInputKV, which prefills later layers' KV cache using a much earlier layer's output, allowing prompt tokens to skip much of the model computation, ii) AcrossKV, which merges the KV caches of neighboring layers to reduce the memory footprint and support larger batch size for higher throughput, and iii) a knowledge-preserving distillation procedure that can adapt existing LLMs for SwiftKV with minimal accuracy impact and low compute and data requirement. For Llama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50% and the memory requirement of the KV cache by 62.5% while incurring minimum quality degradation across a wide range of tasks. In the end-to-end inference serving using an optimized vLLM implementation, SwiftKV realizes up to 2 higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4 H100 GPUs. While it is clear that LLMs can add value to these applications, the cost and speed of inference determine their practicality. Therefore, improving the aggregate throughput and reducing latency of LLM inference has become an increasingly important topic of interest, with various efforts (Sec. In this paper, we take a unique approach to improving LLM inference for enterprise applications based on the key observation that typical enterprise workloads process many more input tokens than output tokens. For example, tasks like code completion, text-to-SQL, summarization and RAG each submit long prompts but produce a small number of generated tokens, and a majority of enterprise LLM use cases in Snowflake incur a 10:1 ratio between prompt and generated tokens. After distillation, the KV cache of layers 5-8 can all be populated using the hidden state outputs of layer 4. For prefill tokens, the query, attention, and MLP operations of layers 5-8 may be skipped altogether, while decode tokens complete all layers. Existing models may be efficiently adapted for SwiftKV by distilling from the original unmodified model using a small dataset.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2410.0396

Country: North America > United States > California (0.14)

Genre: Research Report > Promising Solution (0.34)

Industry:

Health & Medicine > Consumer Health (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AI and Memory Wall

Gholami, Amir, Yao, Zhewei, Kim, Sehoon, Hooper, Coleman, Mahoney, Michael W., Keutzer, Kurt

arXiv.org Artificial IntelligenceMar-21-2024

The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0x/2yrs, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively. This disparity has made memory, rather than compute, the primary bottleneck in AI applications, particularly in serving. Here, we analyze encoder and decoder Transformer models and show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2403.14123

Country: North America > United States > California (0.14)

Genre: Research Report (0.82)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding

Zhang, Zhenyu, Chen, Runjin, Liu, Shiwei, Yao, Zhewei, Ruwase, Olatunji, Chen, Beidi, Wu, Xiaoxia, Wang, Zhangyang

arXiv.org Artificial IntelligenceMar-4-2024

This paper aims to overcome the "lost-in-the-middle" challenge of large language models (LLMs). While recent advancements have successfully enabled LLMs to perform stable language modeling with up to 4 million tokens, the persistent difficulty faced by most LLMs in identifying relevant information situated in the middle of the context has not been adequately tackled. To address this problem, this paper introduces Multi-scale Positional Encoding (Ms-PoE) which is a simple yet effective plug-and-play approach to enhance the capacity of LLMs to handle the relevant information located in the middle of the context, without fine-tuning or introducing any additional overhead. Ms-PoE leverages the position indice rescaling to relieve the long-term decay effect introduced by RoPE, while meticulously assigning distinct scaling ratios to different attention heads to preserve essential knowledge learned during the pre-training step, forming a multi-scale context fusion from short to long distance. Extensive experiments with a wide range of LLMs demonstrate the efficacy of our approach. Notably, Ms-PoE achieves an average accuracy gain of up to 3.8 on the Zero-SCROLLS benchmark over the original LLMs. Code are available at https://github.com/VITA-Group/Ms-PoE.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2403.04797

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

Xia, Haojun, Zheng, Zhen, Wu, Xiaoxia, Chen, Shiyang, Yao, Zhewei, Youn, Stephen, Bakhtiari, Arash, Wyatt, Michael, Zhuang, Donglin, Zhou, Zhongzhu, Ruwase, Olatunji, He, Yuxiong, Song, Shuaiwen Leon

arXiv.org Artificial IntelligenceJan-25-2024

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with irregular bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of float-point weights for various quantization bit-width. We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called FP6-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved. Experiments show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69x-2.65x higher normalized inference throughput than the FP16 baseline. The source code will be publicly available soon.

large language model, machine learning, quantization, (21 more...)

arXiv.org Artificial Intelligence

2401.14112

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks

Wu, Xiaoxia, Xia, Haojun, Youn, Stephen, Zheng, Zhen, Chen, Shiyang, Bakhtiari, Arash, Wyatt, Michael, Aminabadi, Reza Yazdani, He, Yuxiong, Ruwase, Olatunji, Song, Leon, Yao, Zhewei

arXiv.org Machine LearningDec-18-2023

This study examines 4-bit quantization methods like GPTQ in large language models (LLMs), highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperform. However, simply shifting to higher precision formats like FP6 has been particularly challenging, thus overlooked, due to poor performance caused by the lack of sophisticated integration and system acceleration strategies on current AI hardware. Our results show that FP6, even with a coarse-grain quantization scheme, performs robustly across various algorithms and tasks, demonstrating its superiority in accuracy and versatility. Notably, with the FP6 quantization, \codestar-15B model performs comparably to its FP16 counterpart in code generation, and for smaller models like the 406M it closely matches their baselines in summarization. Neither can be achieved by INT4. To better accommodate various AI hardware and achieve the best system performance, we propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization. With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.

machine learning, natural language, quantization, (16 more...)

arXiv.org Machine Learning

2312.08583

Country:

Oceania > Australia (0.14)
North America > United States (0.14)

Genre:

Research Report > New Finding (0.86)
Research Report > Promising Solution (0.66)

Industry: Information Technology > Hardware (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention

Yao, Zhewei, Wu, Xiaoxia, Li, Conglong, Zhang, Minjia, Qin, Heyang, Ruwase, Olatunji, Awan, Ammar Ahmad, Rajbhandari, Samyam, He, Yuxiong

arXiv.org Artificial IntelligenceNov-29-2023

Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms. To address this, we present the DeepSpeed-VisualChat framework, designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities, with a focus on enhancing the proficiency of Large Vision and Language Models in handling interleaved inputs. Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions in multi-round, multi-image conversations. Compared to existing frameworks, DeepSpeed-VisualChat shows superior scalability up to 70B parameter language model size, representing a significant advancement in multi-modal language models and setting a solid foundation for future explorations.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2309.14327

Country: North America > Canada (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.97)

Add feedback