Xiong, Haoyi
Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey
Guan, Shengyue, Xiong, Haoyi, Wang, Jindong, Bian, Jiang, Zhu, Bin, Lou, Jian-guang
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings. Using a PRISMA-inspired framework, we systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication, and establishing a solid foundation for our analysis. Our study offers a structured approach by developing two interrelated taxonomy systems: one that defines \emph{what to evaluate} and another that explains \emph{how to evaluate}. The first taxonomy identifies key components of LLM-based agents for multi-turn conversations and their evaluation dimensions, including task completion, response quality, user experience, memory and context retention, as well as planning and tool integration. These components ensure that the performance of conversational agents is assessed in a holistic and meaningful manner. The second taxonomy system focuses on the evaluation methodologies. It categorizes approaches into annotation-based evaluations, automated metrics, hybrid strategies that combine human assessments with quantitative measures, and self-judging methods utilizing LLMs. This framework not only captures traditional metrics derived from language understanding, such as BLEU and ROUGE scores, but also incorporates advanced techniques that reflect the dynamic, interactive nature of multi-turn dialogues.
Interpretable Feature Interaction via Statistical Self-supervised Learning on Tabular Data
Zhang, Xiaochen, Xiong, Haoyi
In high-dimensional and high-stakes contexts, ensuring both rigorous statistical guarantees and interpretability in feature extraction from complex tabular data remains a formidable challenge. Traditional methods such as Principal Component Analysis (PCA) reduce dimensionality and identify key features that explain the most variance, but are constrained by their reliance on linear assumptions. In contrast, neural networks offer assumption-free feature extraction through self-supervised learning techniques such as autoencoders, though their interpretability remains a challenge in fields requiring transparency. To address this gap, this paper introduces Spofe, a novel self-supervised machine learning pipeline that marries the power of kernel principal components for capturing nonlinear dependencies with a sparse and principled polynomial representation to achieve clear interpretability with statistical rigor. Underpinning our approach is a robust theoretical framework that delivers precise error bounds and rigorous false discovery rate (FDR) control via a multi-objective knockoff selection procedure; it effectively bridges the gap between data-driven complexity and statistical reliability via three stages: (1) generating self-supervised signals using kernel principal components to model complex patterns, (2) distilling these signals into sparse polynomial functions for improved interpretability, and (3) applying a multi-objective knockoff selection procedure with significance testing to rigorously identify important features. Extensive experiments on diverse real-world datasets demonstrate the effectiveness of Spofe, consistently surpassing KPCA, SKPCA, and other methods in feature selection for regression and classification tasks. Visualization and case studies highlight its ability to uncover key insights, enhancing interpretability and practical utility.
SOLA-GCL: Subgraph-Oriented Learnable Augmentation Method for Graph Contrastive Learning
Peng, Tianhao, Li, Xuhong, Yuan, Haitao, Li, Yuchen, Xiong, Haoyi
Graph contrastive learning has emerged as a powerful technique for learning graph representations that are robust and discriminative. However, traditional approaches often neglect the critical role of subgraph structures, particularly the intra-subgraph characteristics and inter-subgraph relationships, which are crucial for generating informative and diverse contrastive pairs. These subgraph features are crucial as they vary significantly across different graph types, such as social networks where they represent communities, and biochemical networks where they symbolize molecular interactions. To address this issue, our work proposes a novel subgraph-oriented learnable augmentation method for graph contrastive learning, termed SOLA-GCL, that centers around subgraphs, taking full advantage of the subgraph information for data augmentation. Specifically, SOLA-GCL initially partitions a graph into multiple densely connected subgraphs based on their intrinsic properties. To preserve and enhance the unique characteristics inherent to subgraphs, a graph view generator optimizes augmentation strategies for each subgraph, thereby generating tailored views for graph contrastive learning. This generator uses a combination of intra-subgraph and inter-subgraph augmentation strategies, including node dropping, feature masking, intra-edge perturbation, inter-edge perturbation, and subgraph swapping. Extensive experiments have been conducted on various graph learning applications, ranging from social networks to molecules, under semi-supervised learning, unsupervised learning, and transfer learning settings to demonstrate the superiority of our proposed approach over the state-of-the-art in GCL.
Knoop: Practical Enhancement of Knockoff with Over-Parameterization for Variable Selection
Zhang, Xiaochen, Cai, Yunfeng, Xiong, Haoyi
Variable selection plays a crucial role in enhancing modeling effectiveness across diverse fields, addressing the challenges posed by high-dimensional datasets of correlated variables. This work introduces a novel approach namely Knockoff with over-parameterization (Knoop) to enhance Knockoff filters for variable selection. Specifically, Knoop first generates multiple knockoff variables for each original variable and integrates them with the original variables into an over-parameterized Ridgeless regression model. For each original variable, Knoop evaluates the coefficient distribution of its knockoffs and compares these with the original coefficients to conduct an anomaly-based significance test, ensuring robust variable selection. Extensive experiments demonstrate superior performance compared to existing methods in both simulation and real-world datasets. Knoop achieves a notably higher Area under the Curve (AUC) of the Receiver Operating Characteristic (ROC) Curve for effectively identifying relevant variables against the ground truth by controlled simulations, while showcasing enhanced predictive accuracy across diverse regression and classification tasks. The analytical results further backup our observations.
EICopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents
Yun, Yuhui, Ye, Huilong, Li, Xinru, Li, Ruojia, Deng, Jingfeng, Li, Li, Xiong, Haoyi
The paper introduces EICopilot, an novel agent-based solution enhancing search and exploration of enterprise registration data within extensive online knowledge graphs like those detailing legal entities, registered capital, and major shareholders. Traditional methods necessitate text-based queries and manual subgraph explorations, often resulting in time-consuming processes. EICopilot, deployed as a chatbot via Baidu Enterprise Search, improves this landscape by utilizing Large Language Models (LLMs) to interpret natural language queries. This solution automatically generates and executes Gremlin scripts, providing efficient summaries of complex enterprise relationships. Distinct feature a data pre-processing pipeline that compiles and annotates representative queries into a vector database of examples for In-context learning (ICL), a comprehensive reasoning pipeline combining Chain-of-Thought with ICL to enhance Gremlin script generation for knowledge graph search and exploration, and a novel query masking strategy that improves intent recognition for heightened script accuracy. Empirical evaluations demonstrate the superior performance of EICopilot, including speed and accuracy, over baseline methods, with the \emph{Full Mask} variant achieving a syntax error rate reduction to as low as 10.00% and an execution correctness of up to 82.14%. These components collectively contribute to superior querying capabilities and summarization of intricate datasets, positioning EICopilot as a groundbreaking tool in the exploration and exploitation of large-scale knowledge graphs for enterprise information search.
Zigzag Diffusion Sampling: Diffusion Models Can Self-Improve via Self-Reflection
Bai, Lichen, Shao, Shitong, Zhou, Zikai, Qi, Zipeng, Xu, Zhiqiang, Xiong, Haoyi, Xie, Zeke
Style: Position: Color: Counting: Text: Object co-occurrence: A man is cooking, A sheep to the right of a A photo of a yellow dining A photo of two bears A sign that says'Diffusion'. Figure 1: The qualitative results of Z-Sampling demonstrate the effectiveness of our method in various aspects, such as style, position, color, counting, text rendering, and object co-occurrence. Diffusion models, the most popular generative paradigm so far, can inject conditional information into the generation path to guide the latent towards desired directions. However, existing text-to-image diffusion models often fail to maintain high image quality and high prompt-image alignment for those challenging prompts. To mitigate this issue and enhance existing pretrained diffusion models, we mainly made three contributions in this paper. First, we propose diffusion self-reflection that alternately performs denoising and inversion and demonstrate that such diffusion self-reflection can leverage the guidance gap between denoising and inversion to capture prompt-related semantic information with theoretical and empirical evidence. Second, motivated by theoretical analysis, we derive Zigzag Diffusion Sampling (Z-Sampling), a novel self-reflection-based diffusion sampling method that leverages the guidance gap between denosing and inversion to accumulate semantic information step by step along the sampling path, leading to improved sampling results. Moreover, as a plug-and-play method, Z-Sampling can be generally applied to various diffusion models (e.g., accelerated ones and Transformer-based ones) with very limited coding and computational costs. Third, our extensive experiments demonstrate that Z-Sampling can generally and significantly enhance generation quality across various benchmark datasets, diffusion models, and performance evaluation metrics. Moreover, Z-Sampling can further enhance existing diffusion models combined with other orthogonal methods, including Diffusion-DPO. One key ability of diffusion models is to guide the sampling path based on some conditions (e.g., texts), leading to conditional or controllable generation (Ho & Salimans, 2022). However, while strong guidance may improve semantic alignment to those challenging prompts, it often causes significant decline in image fidelity, leading to mode collapse, and resulting inevitable accumulation of errors during the sampling process (Chung et al., 2024). To mitigate this issue, some studies apply additional manifold constraints to the sampling paths (Chung et al., 2024; Yang et al.;
Pre-trained Molecular Language Models with Random Functional Group Masking
Peng, Tianhao, Li, Yuchen, Li, Xuhong, Bian, Jiang, Xie, Zeke, Sui, Ning, Mumtaz, Shahid, Xu, Yanwu, Kong, Linghe, Xiong, Haoyi
Recent advancements in computational chemistry have leveraged the power of trans-former-based language models, such as MoLFormer, pre-trained using a vast amount of simplified molecular-input line-entry system (SMILES) sequences, to understand and predict molecular properties and activities, a critical step in fields like drug discovery and materials science. To further improve performance, researchers have introduced graph neural networks with graph-based molecular representations, such as GEM, incorporating the topology, geometry, 2D or even 3D structures of molecules into pre-training. While most of molecular graphs in existing studies were automatically converted from SMILES sequences, it is to assume that transformer-based language models might be able to implicitly learn structure-aware representations from SMILES sequences. In this paper, we propose \ours{} -- a SMILES-based \underline{\em M}olecular \underline{\em L}anguage \underline{\em M}odel, which randomly masking SMILES subsequences corresponding to specific molecular \underline{\em F}unctional \underline{\em G}roups to incorporate structure information of atoms during the pre-training phase. This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities. Extensive experimental evaluations across 11 benchmark classification and regression tasks in the chemical domain demonstrate the robustness and superiority of \ours{}. Our findings reveal that \ours{} outperforms existing pre-training models, either based on SMILES or graphs, in 9 out of the 11 downstream tasks, ranking as a close second in the remaining ones.
IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis
Shao, Shitong, Zhou, Zikai, Bai, Lichen, Xiong, Haoyi, Xie, Zeke
The multi-step sampling mechanism, a key feature of visual diffusion models, has significant potential to replicate the success of OpenAI's Strawberry in enhancing performance by increasing the inference computational cost. Sufficient prior studies have demonstrated that correctly scaling up computation in the sampling process can successfully lead to improved generation quality, enhanced image editing, and compositional generalization. While there have been rapid advancements in developing inference-heavy algorithms for improved image generation, relatively little work has explored inference scaling laws in video diffusion models (VDMs). Furthermore, existing research shows only minimal performance gains that are perceptible to the naked eye. To address this, we design a novel training-free algorithm IV-Mixed Sampler that leverages the strengths of image diffusion models (IDMs) to assist VDMs surpass their current capabilities. The core of IV-Mixed Sampler is to use IDMs to significantly enhance the quality of each video frame and VDMs ensure the temporal coherence of the video during the sampling process. Our experiments have demonstrated that IV-Mixed Sampler achieves state-of-the-art performance on 4 benchmarks including UCF-101-FVD, MSR-VTT-FVD, Chronomagic-Bench-150, and Chronomagic-Bench-1649. For example, the open-source Animatediff with IV-Mixed Sampler reduces the UMT-FVD score from 275.2 to 228.6, closing to 223.1 from the closed-source Pika-2.0.
Pre-trained Graphformer-based Ranking at Web-scale Search (Extended Abstract)
Li, Yuchen, Xiong, Haoyi, Kong, Linghe, Sun, Zeyi, Chen, Hongyang, Wang, Shuaiqiang, Yin, Dawei
Although graphformer[Yang et al., 2021] has been proposed to combine advantages from GNNs and Both Transformer and Graph Neural Networks Transformers for representation learning with textual graphs, (GNNs) have been employed in the domain of learning there still lack of joint efforts from the two domains (i.e., to rank (LTR). However, these approaches adhere query-webpage pairs and graphs) in LTR. In order to improve to two distinct yet complementary problem the performance of over-parameterized models like formulations: ranking score regression based on Transformers or GNNs, the paradigm of pre-training and query-webpage pairs, and link prediction within fine-tuning has been extensively employed[Liao et al., 2024; query-webpage bipartite graphs, respectively. While Chen et al., 2024g; Chen et al., 2022; Song et al., 2024; it is possible to pre-train GNNs or Transformers on Lyu et al., 2023]. This involves firstly training the models source datasets and subsequently fine-tune them on on large-scale source datasets in an unsupervised or selfsupervised sparsely annotated LTR datasets, the distributional manner to develop their core representation learning shifts between the pair-based and bipartite graph capabilities [Qiang et al., 2023; Xiong et al., 2024a; domains present significant challenges in integrating Xiong et al., 2024b; Lyu et al., 2020]. Subsequently, the pretrained these heterogeneous models into a unified LTR models can be fine-tuned using a small number of annotated framework at web scale. To address this, we introduce samples from the target datasets [Kirichenko et al., 2022; the novel MPGraf model, which leverages Huang et al., 2021; Chen et al., 2023e; Chen et al., 2023d; a modular and capsule-based pre-training strategy, Chen et al., 2023b]. However, such paradigm could not be aiming to cohesively integrate the regression capabilities easily followed by the LTR models leveraging both querywebpage of Transformers with the link prediction pairs and graphs together.
Generative Pre-trained Ranking Model with Over-parameterization at Web-Scale (Extended Abstract)
Li, Yuchen, Xiong, Haoyi, Kong, Linghe, Bian, Jiang, Wang, Shuaiqiang, Chen, Guihai, Yin, Dawei
Learning to rank (LTR) is widely employed in web The optimization of the user experience, achieved by catering searches to prioritize pertinent webpages from retrieved to information needs, largely depends on the effective content based on input queries. However, sorting of retrieved content. In this realm, Learning to Rank traditional LTR models encounter two principal obstacles (LTR) becomes instrumental, requiring a considerable amount that lead to suboptimal performance: (1) the of query-webpage pairings with relevancy scores for effective lack of well-annotated query-webpage pairs with supervised LTR [Li et al., 2023b; Qin and Liu, 2013; ranking scores covering a diverse range of search Li et al., 2023c; Lyu et al., 2020; Peng et al., 2024; query popularities, which hampers their ability to Wang et al., 2024b]. Nevertheless, the commonplace scarcity address queries across the popularity spectrum, and of well-described, query-webpage pairings often compels (2) inadequately trained models that fail to induce semi-supervised LTR, harnessing both labeled and unlabeled generalized representations for LTR, resulting in samples for the process [Szummer and Yilmaz, 2011; overfitting. To address these challenges, we propose Zhang et al., 2016; Zhu et al., 2023; Peng et al., 2023].