Goto

Collaborating Authors

 optimization strategy




Efficient-Husformer: Efficient Multimodal Transformer Hyperparameter Optimization for Stress and Cognitive Loads

arXiv.org Artificial Intelligence

Transformer-based models have gained considerable attention in the field of physiological signal analysis. They leverage long-range dependencies and complex patterns in temporal signals, allowing them to achieve performance superior to traditional RNN and CNN models. However, they require high computational intensity and memory demands. In this work, we present Efficient-Husformer, a novel Transformer-based architecture developed with hyperparameter optimization (HPO) for multi-class stress detection across two multimodal physiological datasets (WESAD and CogLoad). The main contributions of this work are: (1) the design of a structured search space, targeting effective hyperparameter optimization; (2) a comprehensive ablation study evaluating the impact of architectural decisions; (3) consistent performance improvements over the original Husformer, with the best configuration achieving an accuracy of 88.41 and 92.61 (improvements of 13.83% and 6.98%) on WESAD and CogLoad datasets, respectively. The best-performing configuration is achieved with the (L + dm) or (L + FFN) modality combinations, using a single layer, 3 attention heads, a model dimension of 18/30, and FFN dimension of 120/30, resulting in a compact model with only about 30k parameters.


QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

arXiv.org Artificial Intelligence

Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate the entire optimized low-level programs, requiring exploration of an extremely vast space encompassing both optimization policies and implementation codes. To address the challenge of exploring an intractable space, we propose Macro Thinking Micro Coding (MTMC), a hierarchical framework inspired by the staged optimization strategy of human experts. It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation. Specifically, Macro Thinking employs reinforcement learning to guide lightweight LLMs in efficiently exploring and learning semantic optimization strategies that maximize hardware utilization. Micro Coding leverages general-purpose LLMs to incrementally implement the stepwise optimization proposals from Macro Thinking, avoiding full-kernel generation errors. Together, they effectively navigate the vast optimization space and intricate implementation details, enabling LLMs for high-performance GPU kernel generation. Comprehensive results on widely adopted benchmarks demonstrate the superior performance of MTMC on GPU kernel generation in both accuracy and running time. On KernelBench, MTMC achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3x speedup over LLMs, and 2.2x over expert-optimized PyTorch Eager kernels. On the more challenging TritonBench, MTMC attains up to 59.64% accuracy and 34x speedup.


Computational Turing Test Reveals Systematic Differences Between Human and AI Language

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations -- testing whether humans can distinguish AI from human output -- despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies -- including fine-tuning, stylistic prompting, and context retrieval -- benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations -- and offer a cautionary note about their current limitations in capturing human communication.


Noise-Aware Optimization in Nominally Identical Manufacturing and Measuring Systems for High-Throughput Parallel Workflows

arXiv.org Artificial Intelligence

Device-to-device variability in experimental noise critically impacts reproducibility, especially in automated, high-throughput systems like additive manufacturing farms. While manageable in small labs, such variability can escalate into serious risks at larger scales, such as architectural 3D printing, where noise may cause structural or economic failures. This contribution presents a noise-aware decision-making algorithm that quantifies and models device-specific noise profiles to manage variability adap-tively. It uses distributional analysis and pairwise divergence metrics with clustering to choose between single-device and robust multi-device Bayesian optimization strategies. Unlike conventional methods that assume homogeneous devices or generic robustness, this framework explicitly leverages inter-device differences to enhance performance, reproducibility, and efficiency. An experimental case study involving three nominally identical 3D printers (same brand, model, and close serial numbers) demonstrates reduced redundancy, lower resource usage, and improved reliability. Overall, this framework establishes a paradigm for precision-and resource-aware optimization in scalable, automated experimental platforms. Introduction Recent advances in automation technologies have revolutionized scientific research, particularly in fields that rely on high-throughput experimentation.




EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models

arXiv.org Artificial Intelligence

CUDA kernel optimization has become a critical bottleneck for AI performance, as deep learning training and inference efficiency directly depends on highly optimized GPU kernels. Despite the promise of Large Language Models (LLMs) for automating kernel optimization, this field suffers from a fragmented ecosystem of isolated and incomparable approaches with unclear problem formulations. Furthermore, general-purpose LLM code evolution methods cannot meet strict correctness requirements of CUDA kernel optimization. We address these fundamental challenges by first formalizing CUDA kernel optimization as a code optimization task with a clear objective, constraints, and evaluation metrics. We then establish the first systematic LLM-based code evolution framework, EvoEngineer, that provides guidance for designing and adapting optimization strategies to achieve a balance between performance and correctness. Finally, we implement a kernel optimization system based on this framework and conduct extensive experiments on 91 real-world CUDA kernels. Our results demonstrate that EvoEngineer achieves a principled balance between performance and correctness, with the highest averaged median speedup of 2.72 over baseline CUDA kernels and a code validity rate of 69.8%, outperforming existing methods on both dimensions. Our method achieves a maximum speedup of 36.75 among all operations over PyTorch kernels and delivers the highest speedup on 28 (56.0%) of 50 operations that achieve over 2 acceleration. CUDA kernel performance has become the critical bottleneck constraining the efficiency of AI training and inference. As foundation models continue scaling to unprecedented sizes (Guo et al., 2025; Jaech et al., 2024), computational demands necessitate maximum GPU utilization efficiency, where even marginal improvements in kernel performance can yield substantial reductions in computational costs. However, manual kernel optimization requires deep expertise across GPU architectures, memory hierarchies, parallelization patterns, and hardware-specific features (Navarro et al., 2020; Hennessy & Patterson, 2011), constituting a major obstacle to scaling AI systems. The kernel code optimization landscape presents extreme complexity, involving intricate tradeoffs between memory coalescing, thread divergence, occupancy optimization, and register usage (Ujald on, 2016; Huang et al., 2021; Zhao et al., 2022).


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. This work address the problem of learning a ranking prediction function that optimizes (N)DCG. The authors propose a surrogate loss based on a non-convex upper bound of the DCG, inspired from robust classification losses. The difference with other existing non-convex upper-bound resides in the fact that the authors introduce the non-convexity at the context level (on a whole query) and not at the pair of items level (see [8]). Then, the authors propose two applications of their algorithm with experimental studies: one is learning a prediction model for a search engine problem, the other to learn a representation for collaborative filtering.