Zhang, Jian
Taste More, Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting
Yang, Maochen, Li, Zekun, Zhang, Jian, Qi, Lei, Shi, Yinghuan
Semi-supervised crowd counting is crucial for addressing the high annotation costs of densely populated scenes. Although several methods based on pseudo-labeling have been proposed, it remains challenging to effectively and accurately utilize unlabeled data. In this paper, we propose a novel framework called Taste More Taste Better (TMTB), which emphasizes both data and model aspects. Firstly, we explore a data augmentation technique well-suited for the crowd counting task. By inpainting the background regions, this technique can effectively enhance data diversity while preserving the fidelity of the entire scenes. Secondly, we introduce the Visual State Space Model as backbone to capture the global context information from crowd scenes, which is crucial for extremely crowded, low-light, and adverse weather scenarios. In addition to the traditional regression head for exact prediction, we employ an Anti-Noise classification head to provide less exact but more accurate supervision, since the regression head is sensitive to noise in manual annotations. We conduct extensive experiments on four benchmark datasets and show that our method outperforms state-of-the-art methods by a large margin. Code is publicly available on https://github.com/syhien/taste_more_taste_better.
Balanced Direction from Multifarious Choices: Arithmetic Meta-Learning for Domain Generalization
Wang, Xiran, Zhang, Jian, Qi, Lei, Shi, Yinghuan
Domain generalization is proposed to address distribution shift, arising from statistical disparities between training source and unseen target domains. The widely used first-order meta-learning algorithms demonstrate strong performance for domain generalization by leveraging the gradient matching theory, which aims to establish balanced parameters across source domains to reduce overfitting to any particular domain. However, our analysis reveals that there are actually numerous directions to achieve gradient matching, with current methods representing just one possible path. These methods actually overlook another critical factor that the balanced parameters should be close to the centroid of optimal parameters of each source domain. T o address this, we propose a simple yet effective arithmetic meta-learning with arithmetic-weighted gradients. This approach, while adhering to the principles of gradient matching, promotes a more precise balance by estimating the centroid between domain-specific optimal parameters.
MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving
Zhang, Jian, Wang, Zhiyuan, Wang, Zhangqi, Zhang, Xinyu, Xu, Fangzhi, Lin, Qika, Mao, Rui, Cambria, Erik, Liu, Jun
Multimodal scientific problems (MSPs) involve complex issues that require the integration of multiple modalities, such as text and diagrams, presenting a significant challenge in artificial intelligence. While progress has been made in addressing traditional scientific problems, MSPs still face two primary issues: the challenge of multi-modal comprehensive reasoning in scientific problem-solving and the lack of reflective and rethinking capabilities. To address these issues, we introduce a Multi-Agent framework based on the Big Seven Personality and Socratic guidance (MAPS). This framework employs seven distinct agents that leverage feedback mechanisms and the Socratic method to guide the resolution of MSPs. To tackle the first issue, we propose a progressive four-agent solving strategy, where each agent focuses on a specific stage of the problem-solving process. For the second issue, we introduce a Critic agent, inspired by Socratic questioning, which prompts critical thinking and stimulates autonomous learning. We conduct extensive experiments on the EMMA, Olympiad, and MathVista datasets, achieving promising results that outperform the current SOTA model by 15.84% across all tasks. Meanwhile, the additional analytical experiments also verify the model's progress as well as generalization ability.
MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization
Zhang, Jian, Wang, Zhangqi, Zhu, Haiping, Liu, Jun, Lin, Qika, Cambria, Erik
The basic question-answering format of large language models involves inputting a prompt and receiving a response, and the quality of the prompt directly impacts the effectiveness of the response. Automated Prompt Optimization (APO) aims to break free from the cognitive biases of manually designed prompts and explores a broader design space for prompts. However, existing APO methods suffer from limited flexibility of fixed templates and inefficient search in prompt spaces as key issues. To this end, we propose a Multi-Agent framework Incorporating Socratic guidance (MARS), which utilizes multi-agent fusion technology for automatic planning, with gradual continuous optimization and evaluation. Specifically, MARS comprises seven agents, each with distinct functionalities, which autonomously use the Planner to devise an optimization path that ensures flexibility. Additionally, it employs a Teacher-Critic-Student Socratic dialogue pattern to iteratively optimize the prompts while conducting effective search. We conduct extensive experiments on various datasets to validate the effectiveness of our method, and perform additional analytical experiments to assess the model's advancement as well as the interpretability.
Multi-Agent Image Restoration
Jiang, Xu, Li, Gehui, Chen, Bin, Zhang, Jian
Image restoration (IR) is challenging due to the complexity of real-world degradations. While many specialized and all-in-one IR models have been developed, they fail to effectively handle complex, mixed degradations. Recent agentic methods RestoreAgent and AgenticIR leverage intelligent, autonomous workflows to alleviate this issue, yet they suffer from suboptimal results and inefficiency due to their resource-intensive finetunings, and ineffective searches and tool execution trials for satisfactory outputs. In this paper, we propose MAIR, a novel Multi-Agent approach for complex IR problems. We introduce a real-world degradation prior, categorizing degradations into three types: (1) scene, (2) imaging, and (3) compression, which are observed to occur sequentially in real world, and reverse them in the opposite order. Built upon this three-stage restoration framework, MAIR emulates a team of collaborative human specialists, including a "scheduler" for overall planning and multiple "experts" dedicated to specific degradations. This design minimizes search space and trial efforts, improving image quality while reducing inference costs. In addition, a registry mechanism is introduced to enable easy integration of new tools. Experiments on both synthetic and real-world datasets show that proposed MAIR achieves competitive performance and improved efficiency over the previous agentic IR system. Code and models will be made available.
GKG-LLM: A Unified Framework for Generalized Knowledge Graph Construction
Zhang, Jian, Wei, Bifan, Qi, Shihao, Zhu, haiping, Liu, Jun, Lin, Qika
The construction of Generalized Knowledge Graph (GKG), including knowledge graph, event knowledge graph and commonsense knowledge graph, is fundamental for various natural language processing tasks. Current studies typically construct these types of graph separately, overlooking holistic insights and potential unification that could be beneficial in computing resources and usage perspectives. However, a key challenge in developing a unified framework for GKG is obstacles arising from task-specific differences. In this study, we propose a unified framework for constructing generalized knowledge graphs to address this challenge. First, we collect data from 15 sub-tasks in 29 datasets across the three types of graphs, categorizing them into in-sample, counter-task, and out-of-distribution (OOD) data. Then, we propose a three-stage curriculum learning fine-tuning framework, by iteratively injecting knowledge from the three types of graphs into the Large Language Models. Extensive experiments show that our proposed model improves the construction of all three graph types across in-domain, OOD and counter-task data.
DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models
Shen, Yi, Zhang, Jian, Huang, Jieyun, Shi, Shuming, Zhang, Wenjing, Yan, Jiangze, Wang, Ning, Wang, Kai, Lian, Shiguo
Recent advancements in slow-thinking reasoning models have shown exceptional performance in complex reasoning tasks. However, these models often exhibit overthinking-generating redundant reasoning steps for simple problems, leading to excessive computational resource usage. While current mitigation strategies uniformly reduce reasoning tokens, they risk degrading performance on challenging tasks that require extended reasoning. This paper introduces Difficulty-Adaptive Slow-Thinking (DAST), a novel framework that enables models to autonomously adjust the length of Chain-of-Thought(CoT) based on problem difficulty. We first propose a Token Length Budget (TLB) metric to quantify difficulty, then leveraging length-aware reward shaping and length preference optimization to implement DAST. DAST penalizes overlong responses for simple tasks while incentivizing sufficient reasoning for complex problems. Experiments on diverse datasets and model scales demonstrate that DAST effectively mitigates overthinking (reducing token usage by over 30\% on average) while preserving reasoning accuracy on complex problems.
EPBC-YOLOv8: An efficient and accurate improved YOLOv8 underwater detector based on an attention mechanism
Jiang, Xing, Zhuang, Xiting, Chen, Jisheng, Zhang, Jian
In this study, we enhance underwater target detection by integrating channel and spatial attention into YOLOv8's backbone, applying Pointwise Convolution in FasterNeXt for the FasterPW model, and leveraging Weighted Concat in a BiFPN-inspired WFPN structure for improved cross-scale connections and robustness. Utilizing CARAFE for refined feature reassembly, our framework addresses underwater image degradation, achieving mAP at 0.5 scores of 76.7 percent and 79.0 percent on URPC2019 and URPC2020 datasets, respectively. These scores are 2.3 percent and 0.7 percent higher than the original YOLOv8, showcasing enhanced precision in detecting marine organisms.
DA-LIF: Dual Adaptive Leaky Integrate-and-Fire Model for Deep Spiking Neural Networks
Zhang, Tianqing, Yu, Kairong, Zhang, Jian, Wang, Hongwei
Spiking Neural Networks (SNNs) are valued for their ability to process spatio-temporal information efficiently, offering biological plausibility, low energy consumption, and compatibility with neuromorphic hardware. However, the commonly used Leaky Integrate-and-Fire (LIF) model overlooks neuron heterogeneity and independently processes spatial and temporal information, limiting the expressive power of SNNs. In this paper, we propose the Dual Adaptive Leaky Integrate-and-Fire (DA-LIF) model, which introduces spatial and temporal tuning with independently learnable decays. Evaluations on both static (CIFAR10/100, ImageNet) and neuromorphic datasets (CIFAR10-DVS, DVS128 Gesture) demonstrate superior accuracy with fewer timesteps compared to state-of-the-art methods. Importantly, DA-LIF achieves these improvements with minimal additional parameters, maintaining low energy consumption. Extensive ablation studies further highlight the robustness and effectiveness of the DA-LIF model.
ILETIA: An AI-enhanced method for individualized trigger-oocyte pickup interval estimation of progestin-primed ovarian stimulation protocol
Wu, Binjian, Li, Qian, Kuang, Zhe, Gao, Hongyuan, Liu, Xinyi, Guo, Haiyan, Chen, Qiuju, Liu, Xinyi, Jiang, Yangruizhe, Zhang, Yuqi, Zha, Jinyin, Li, Mingyu, Ren, Qiuhan, Feng, Sishuo, Zhang, Haicang, Lu, Xuefeng, Zhang, Jian
In vitro fertilization-embryo transfer (IVF-ET) stands as one of the most prevalent treatments for infertility. During an IVF-ET cycle, the time interval between trigger shot and oocyte pickup (OPU) is a pivotal period for follicular maturation, which determines mature oocytes yields and impacts the success of subsequent procedures. However, accurately predicting this interval is severely hindered by the variability of clinicians'experience that often leads to suboptimal oocyte retrieval rate. To address this challenge, we propose ILETIA, the first machine learning-based method that could predict the optimal trigger-OPU interval for patients receiving progestin-primed ovarian stimulation (PPOS) protocol. Specifically, ILETIA leverages a Transformer to learn representations from clinical tabular data, and then employs gradient-boosted trees for interval prediction. For model training and evaluating, we compiled a dataset PPOS-DS of nearly ten thousand patients receiving PPOS protocol, the largest such dataset to our knowledge. Experimental results demonstrate that our method achieves strong performance (AUROC = 0.889), outperforming both clinicians and other widely used computational models. Moreover, ILETIA also supports premature ovulation risk prediction in a specific OPU time (AUROC = 0.838). Collectively, by enabling more precise and individualized decisions, ILETIA has the potential to improve clinical outcomes and lay the foundation for future IVF-ET research.