Goto

Collaborating Authors

 penguin


A huge iceberg becomes a deadly trap for penguins

Popular Science

An iceberg sealed the penguin colony's entrance, triggering a 70% survival drop. A group of Emperor penguin chicks is walking on the fast ice at the Emperor penguin colony at Snow Hill Island in the Weddell Sea in Antarctica. Breakthroughs, discoveries, and DIY tips sent six days a week. A massive iceberg has triggered a catastrophic die-off of Emperor Penguin chicks in Antarctica, blocking thousands of parents from reaching their young. The event claimed the lives of approximately 14,000 chicks at the Coulman Island colony in the Ross Sea, the region's largest breeding ground.


Penguin: P arallel-Packed Homomorphic Encryption for Fast Graph Convolutional Network Inference Anonymous Author(s) Affiliation Address email

Neural Information Processing Systems

HE operations (e.g., ciphertext (ct) rotations/multiplications, additions), which could be orders of For example, a GCN layer's computation is dominated by the special consecutive HE operations are defined in Sec. 2. For generality, we assume both feature matrix and adjacency Parallel-Packing (see Sec. 3.2), the ciphertext size is fully exploited, and the total HE operation count We adopt a threat model setting consistent with prior works [9, 14, 3, 7, 18, 22, 27]. The cloud server is semi-honest (e.g.



Tree Search for LLM Agent Reinforcement Learning

Ji, Yuxiang, Ma, Ziyu, Wang, Yong, Chen, Guanhua, Chu, Xiangxiang, Wu, Liaoni

arXiv.org Artificial Intelligence

Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.Figure 1: Comparison of chain-based and tree-based sampling strategies in LLM multi-turn agent RL. The tree structure brings two major advantages: (i) less rollout budget (both on tokens and tool-calls); (ii) higher performance. Reinforcement Learning (RL) has emerged as a pivotal post-training paradigm for Large Language Models (LLMs), catalyzing the development of several frontier models (DeepSeek-AI Team, 2025; Y ang et al., 2025a; OpenAI, 2024). RL-tuned LLMs trained only with outcome rewards acquire complex reasoning abilities and achieve remarkable gains in single-turn tasks, such as mathematical proof and code generation (Team et al., 2025b; Y u et al., 2025; Chu et al., 2025a; Shao et al., 2024; Xin et al., 2024). This suggests that LLMs can learn not only through static imitation, but also by actively interacting with dynamic environments. Guided by this prospect, recent works have extended this RL paradigm to more complex agent settings involving dynamic, multi-turn interactions (Feng et al., 2025b; Singh et al., 2025; Wang et al., 2025b; Qian et al., 2025; Feng et al., Work done during internship at AMAP, Alibaba Group. Right (Ours): Tree search with nodes corresponding to complete agent step.


Penguin: P arallel-Packed Homomorphic Encryption for Fast Graph Convolutional Network Inference

Neural Information Processing Systems

HE operations (e.g., ciphertext (ct) rotations/multiplications, additions), which could be orders of For example, a GCN layer's computation is dominated by the special consecutive HE operations are defined in Sec. 2. For generality, we assume both feature matrix and adjacency Parallel-Packing (see Sec. 3.2), the ciphertext size is fully exploited, and the total HE operation count We adopt a threat model setting consistent with prior works [9, 14, 3, 7, 18, 22, 27]. The cloud server is semi-honest (e.g.



Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning

Racharak, Teeradaj, Ragkhitwetsagul, Chaiyong, Sontesadisai, Chommakorn, Sunetnanta, Thanwadee

arXiv.org Artificial Intelligence

In-context learning (ICL) has emerged as a powerful capability of large language models (LLMs), enabling them to perform new tasks based on a few provided examples without explicit fine-tuning. Despite their impressive adaptability, these models remain vulnerable to subtle adversarial perturbations and exhibit unpredictable behavior when faced with linguistic variations. Inspired by software testing principles, we introduce a software testing-inspired framework, called MMT4NL, for evaluating the trustworthiness of in-context learning by utilizing adversarial perturbations and software testing techniques. It includes diverse evaluation aspects of linguistic capabilities for testing the ICL capabilities of LLMs. MMT4NL is built around the idea of crafting metamorphic adversarial examples from a test set in order to quantify and pinpoint bugs in the designed prompts of ICL. Our philosophy is to treat any LLM as software and validate its functionalities just like testing the software. Finally, we demonstrate applications of MMT4NL on the sentiment analysis and question-answering tasks. Our experiments could reveal various linguistic bugs in state-of-the-art LLMs.


PENGUIN: Enhancing Transformer with Periodic-Nested Group Attention for Long-term Time Series Forecasting

Sun, Tian, Chen, Yuqi, Sun, Weiwei

arXiv.org Artificial Intelligence

Long-term time series forecasting (LTSF) is a fundamental task with wide-ranging applications. Although Transformer-based models have made significant breakthroughs in forecasting, their effectiveness for time series forecasting remains debatable. In this paper, we revisit the significance of self-attention and propose a simple yet effective mechanism, Periodic-Nested Group Attention, namely PENGUIN. Our approach highlights the importance of explicitly modeling periodic patterns and incorporating relative attention bias for effective time series modeling. To this end, we introduce a periodic-nested relative attention bias that captures periodic structures directly. To handle multiple coexisting periodicities (e.g., daily and weekly cycles), we design a grouped attention mechanism, where each group targets a specific periodicity using a multi-query attention mechanism. Extensive experiments across diverse benchmarks demonstrate that PENGUIN consistently outperforms both MLP-based and Transformer-based models.


Can LLMs Solve ASP Problems? Insights from a Benchmarking Study (Extended Version)

Ren, Lin, Xiao, Guohui, Qi, Guilin, Geng, Yishuai, Xue, Haohan

arXiv.org Artificial Intelligence

Answer Set Programming (ASP) is a powerful paradigm for non-monotonic reasoning. Recently, large language models (LLMs) have demonstrated promising capabilities in logical reasoning. Despite this potential, current evaluations of LLM capabilities in ASP are often limited. Existing works normally employ overly simplified ASP programs, do not support negation, disjunction, or multiple answer sets. Furthermore, there is a lack of benchmarks that introduce tasks specifically designed for ASP solving. To bridge this gap, we introduce ASPBench, a comprehensive ASP benchmark, including three ASP specific tasks: ASP entailment, answer set verification, and answer set computation. Our extensive evaluations on ASPBench reveal that while 14 state-of-the-art LLMs, including \emph{deepseek-r1}, \emph{o4-mini}, and \emph{gemini-2.5-flash-thinking}, perform relatively well on the first two simpler tasks, they struggle with answer set computation, which is the core of ASP solving. These findings offer insights into the current limitations of LLMs in ASP solving. This highlights the need for new approaches that integrate symbolic reasoning capabilities more effectively. The code and dataset are available at https://github.com/HomuraT/ASPBench.


Direct Behavior Optimization: Unlocking the Potential of Lightweight LLMs

Yang, Hongming, Lin, Shi, Shao, Jun, Lin, Changting, Zhu, Donghai, Han, Meng, Kong, Qinglei

arXiv.org Artificial Intelligence

Lightweight Large Language Models (LwLLMs) are reduced-parameter, optimized models designed to run efficiently on consumer-grade hardware, offering significant advantages in resource efficiency, cost-effectiveness, and data privacy. However, these models often struggle with limited inference and reasoning capabilities, which restrict their performance on complex tasks and limit their practical applicability. Moreover, existing prompt optimization methods typically rely on extensive manual effort or the meta-cognitive abilities of state-of-the-art LLMs, making them less effective for LwLLMs. To address these challenges, we introduce DeBoP, a new Direct Behavior Optimization Paradigm, original from the Chain-of-Thought (CoT) prompting technique. Unlike CoT Prompting, DeBoP is an automatic optimization method, which focuses on the optimization directly on the behavior of LwLLMs. In particular, DeBoP transforms the optimization of complex prompts into the optimization of discrete, quantifiable execution sequences using a gradient-free Monte Carlo Tree Search. We evaluate DeBoP on seven challenging tasks where state-of-the-art LLMs excel but LwLLMs generally underperform. Experimental results demonstrate that DeBoP significantly outperforms recent prompt optimization methods on most tasks. In particular, DeBoP-optimized LwLLMs surpass GPT-3.5 on most tasks while reducing computational time by approximately 60% compared to other automatic prompt optimization methods.