mixtral
Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators
The rapid proliferation of Large Language Models presents both opportunities and challenges for the translation field. While commercial, cloud-based AI chatbots have garnered significant attention in translation studies, concerns regarding data privacy, security, and equitable access necessitate exploration of alternative deployment models. This paper investigates the feasibility and performance of locally deployable, free language models as a viable alternative to proprietary, cloud-based AI solutions. This study evaluates three open-source models installed on CPU-based platforms and compared against commercially available online chat-bots. The evaluation focuses on functional performance rather than a comparative analysis of human-machine translation quality, an area already subject to extensive research. The platforms assessed were chosen for their accessibility and ease of use across various operating systems. While local deployment introduces its own challenges, the benefits of enhanced data control, improved privacy, and reduced dependency on cloud services are compelling. The findings of this study contribute to a growing body of knowledge concerning the democratization of AI technology and inform future research and development efforts aimed at making LLMs more accessible and practical for a wider range of users, specifically focusing on the needs of individual translators and small businesses.
- Europe > Austria > Tyrol > Innsbruck (0.04)
- North America > United States > New York (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Services (0.88)
Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology
Felde, Sabine, Buchkremer, Rüdiger, Chehab, Gamal, Thielscher, Christian, Distler, Jörg HW, Schneider, Matthias, Richter, Jutta G.
Large language models (LLMs) show promise for supporting clinical decision-making in complex fields such as rheumatology. Our evaluation shows that smaller language models (SLMs), combined with retrieval-augmented generation (RAG), achieve higher diagnostic and therapeutic performance than larger models, while requiring substantially less energy and enabling cost-efficient, local deployment. These features are attractive for resource-limited healthcare. However, expert oversight remains essential, as no model consistently reached specialist-level accuracy in rheumatology.
- Europe > Germany > North Rhine-Westphalia > Düsseldorf Region > Düsseldorf (0.17)
- North America > United States (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
Utility-Driven Speculative Decoding for Mixture-of-Experts
Saxena, Anish, Tsai, Po-An, Taneja, Hritvik, Jaleel, Aamer, Qureshi, Moinuddin
GPU memory bandwidth is the main bottleneck for low-latency Large Language Model (LLM) inference. Speculative decoding leverages idle GPU compute by using a lightweight drafter to propose K tokens, which the LLM verifies in parallel, boosting token throughput. In conventional dense LLMs, all model weights are fetched each iteration, so speculation adds no latency overhead. Emerging Mixture of Experts (MoE) models activate only a subset of weights per token, greatly reducing data movement. However, we show that speculation is ineffective for MoEs: draft tokens collectively activate more weights, increasing data movement and verification time by 2-3x. When token throughput gains fail to offset this overhead, speculation causes slowdowns up to 1.5x, making it infeasible. Even when useful, the optimal K varies by task, model, and even between requests and iterations. Thus, despite widespread use in dense LLMs, speculation remains impractical in leading MoEs. We present Cascade, a utility-driven framework that selectively enables speculation to avoid slowdowns and dynamically tunes K to accelerate MoE serving. Cascade uses a lightweight metric, speculation utility, the ratio of token gains to verification cost, which shows iteration-level locality, enabling periodic decisions via short test and longer set phases. For each request, Cascade disables speculation if utility drops below one during testing, and when utility exceeds one, tests multiple K-values to choose the utility-maximizing K for the set phase. We implement Cascade in vLLM and evaluate it on five popular MoEs with workloads spanning code, math, extraction, and mixed tasks. Cascade limits slowdown to 5% (vs. 1.5x) and improves throughput by 7-14% over static K, making speculative decoding practical for MoEs.
Cash or Comfort? How LLMs Value Your Inconvenience
Cedro, Mateusz, Ichmoukhamedov, Timour, Goethals, Sofie, He, Yifan, Hinns, James, Martens, David
Large Language Models (LLMs) are increasingly proposed as near-autonomous artificial intelligence (AI) agents capable of making everyday decisions on behalf of humans. Although LLMs perform well on many technical tasks, their behaviour in personal decision-making remains less understood. Previous studies have assessed their rationality and moral alignment with human decisions. However, the behaviour of AI assistants in scenarios where financial rewards are at odds with user comfort has not yet been thoroughly explored. In this paper, we tackle this problem by quantifying the prices assigned by multiple LLMs to a series of user discomforts: additional walking, waiting, hunger and pain. We uncover several key concerns that strongly question the prospect of using current LLMs as decision-making assistants: (1) a large variance in responses between LLMs, (2) within a single LLM, responses show fragility to minor variations in prompt phrasing (e.g., reformulating the question in the first person can considerably alter the decision), (3) LLMs can accept unreasonably low rewards for major inconveniences (e.g., 1 Euro to wait 10 hours), and (4) LLMs can reject monetary gains where no discomfort is imposed (e.g., 1,000 Euro to wait 0 minutes). These findings emphasize the need for scrutiny of how LLMs value human inconvenience, particularly as we move toward applications where such cash-versus-comfort trade-offs are made on users' behalf.
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe > Belgium > Flanders > Antwerp Province > Antwerp (0.04)
Collaboration among Multiple Large Language Models for Medical Question Answering
Shang, Kexin, Chang, Chia-Hsuan, Yang, Christopher C.
Empowered by vast internal knowledge reservoir, the new generation of large language models (LLMs) demonstrate untapped potential to tackle medical tasks. However, there is insufficient effort made towards summoning up a synergic effect from multiple LLMs' expertise and background. In this study, we propose a multi-LLM collaboration framework tailored on a medical multiple-choice questions dataset. Through post-hoc analysis on 3 pre-trained LLM participants, our framework is proved to boost all LLMs reasoning ability as well as alleviate their divergence among questions. We also measure an LLM's confidence when it confronts with adversary opinions from other LLMs and observe a concurrence between LLM's confidence and prediction accuracy.
- North America > United States (0.68)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Health & Medicine (1.00)
- Government > Regional Government (0.46)
Bot Wars Evolved: Orchestrating Competing LLMs in a Counterstrike Against Phone Scams
Basta, Nardine, Atkins, Conor, Kaafar, Dali
We present "Bot Wars," a framework using Large Language Models (LLMs) scam-baiters to counter phone scams through simulated adversarial dialogues. Our key contribution is a formal foundation for strategy emergence through chain-of-thought reasoning without explicit optimization. Through a novel two-layer prompt architecture, our framework enables LLMs to craft demographically authentic victim personas while maintaining strategic coherence. We evaluate our approach using a dataset of 3,200 scam dialogues validated against 179 hours of human scam-baiting interactions, demonstrating its effectiveness in capturing complex adversarial dynamics. Our systematic evaluation through cognitive, quantitative, and content-specific metrics shows that GPT-4 excels in dialogue naturalness and persona authenticity, while Deepseek demonstrates superior engagement sustainability.
- Information Technology > Security & Privacy (1.00)
- Government (0.68)
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
Ai, Mengting, Wei, Tianxin, Chen, Yifan, Zeng, Zhichen, Zhao, Ritchie, Varatkar, Girish, Rouhani, Bita Darvish, Tang, Xianfeng, Tong, Hanghang, He, Jingrui
Mixture-of-Experts (MoE) Transformer, the backbone architecture The profound impact of the Transformer architecture in the domain of multiple phenomenal language models, leverages sparsity of machine learning is undeniable, for the fields including by activating only a fraction of model parameters for each input natural language processing [3, 14, 18, 45, 48, 61] and computer token. The sparse structure, while allowing constant time costs, vision [17, 39, 64], to name a few. To further improve the capabilities results in space inefficiency: we still need to load all the model of pre-trained large language models (LLMs), one general parameters during inference. We introduce ResMoE, an innovative strategy is to scale up their parameters. Mixture-of-Experts (MoE) MoE approximation framework that utilizes Wasserstein barycenter [52] extends the traditional feedforward neural network (FFN) layer to extract a common expert (barycenter expert) and approximate by replacing a single multilayer perceptron (MLP) with multiple the residuals between this barycenter expert and the original ones. MLPs, referred to as "experts". While enhancing the performance, ResMoE enhances the space efficiency for inference of large-scale sparse MoE keeps computing costs (FLOPs) comparable to the original MoE Transformers in a one-shot and data-agnostic manner without dense model, as only a few selected experts will be activated retraining while maintaining minimal accuracy loss, thereby each time. The framework of an MoE layer is demonstrated in paving the way for broader accessibility to large language models.
- North America > Canada > Ontario > Toronto (0.05)
- North America > United States > Illinois > Champaign County > Champaign (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (12 more...)
- Food & Agriculture (0.46)
- Government (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.68)
eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference
Tairin, Suraiya, Mahmud, Shohaib, Shen, Haiying, Iyer, Anand
In recent years, Mixture-of-Experts (MoE) has emerged as an effective approach for enhancing the capacity of deep neural network (DNN) with sub-linear computational costs. However, storing all experts on GPUs incurs significant memory overhead, increasing the monetary cost of MoE-based inference. To address this, we propose eMoE, a memory efficient inference system for MoE-based large language models (LLMs) by leveraging our observations from experiment measurements. eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing. To reduce loading latency while maintaining accuracy, as we found using the same experts for subsequent prompts has minimal impact on perplexity, eMoE invokes the expert predictor every few prompts rather than for each prompt. In addition, it skips predictions for tasks less sensitive to routing accuracy. Finally, it has task-aware scheduling to minimize inference latency by considering Service Level Objectives (SLOs), task-specific output lengths, and expert loading latencies. Experimental results show that compared to existing systems, eMoE reduces memory consumption by up to 80% while maintaining accuracy and reduces inference latency by up to 17%. It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.
- North America > United States > Virginia (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > Dominican Republic (0.04)
- (3 more...)
SAGE: Steering and Refining Dialog Generation with State-Action Augmentation
Recent advances in large language models have demonstrated impressive capabilities in task-oriented applications, yet building emotionally intelligent chatbots that can engage in natural, strategic conversations remains a challenge. We present a novel approach called SAGE that uses latent variables to control long-horizon behavior in dialogue generation. At the core of our method is the State-Action Chain (SAC), which augments standard language model fine-tuning by introducing latent variables that encapsulate emotional states and conversational strategies between dialogue turns. During inference, these variables are generated before each response, enabling coarse-grained control over dialogue progression while maintaining natural interaction patterns. We also introduce a self-improvement pipeline that leverages dialogue tree search, LLM-based reward modeling, and targeted fine-tuning to optimize conversational trajectories. Our experimental results show that models trained with this approach demonstrate improved performance in emotional intelligence metrics while maintaining strong capabilities on LLM benchmarks. The discrete nature of our latent variables facilitates search-based strategies and provides a foundation for future applications of reinforcement learning to dialogue systems, where learning can occur at the state level rather than the token level.
- Personal > Interview (0.95)
- Research Report > New Finding (0.87)
- Media (1.00)
- Leisure & Entertainment (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.93)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.34)
On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems
Vladika, Juraj, Matthes, Florian
Retrieval-augmented generation (RAG) has emerged as an approach to augment large language models (LLMs) by reducing their reliance on static knowledge and improving answer factuality. RAG retrieves relevant context snippets and generates an answer based on them. Despite its increasing industrial adoption, systematic exploration of RAG components is lacking, particularly regarding the ideal size of provided context, and the choice of base LLM and retrieval method. To help guide development of robust RAG systems, we evaluate various context sizes, BM25 and semantic search as retrievers, and eight base LLMs. Moving away from the usual RAG evaluation with short answers, we explore the more challenging long-form question answering in two domains, where a good answer has to utilize the entire context. Our findings indicate that final QA performance improves steadily with up to 15 snippets but stagnates or declines beyond that. Finally, we show that different general-purpose LLMs excel in the biomedical domain than the encyclopedic one, and that open-domain evidence retrieval in large corpora is challenging.
- North America > Mexico > Mexico City > Mexico City (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (9 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.94)
- Health & Medicine > Therapeutic Area > Immunology (0.68)