AITopics

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)

Neural Information Processing SystemsFeb-18-2026, 07:42:43 GMT

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation.

diffusion model, large language model, machine learning, (18 more...)

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-11-2026, 12:45:21 GMT

PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph

Yikang LI, Tao Ma, Yeqi Bai, Nan Duan, Sining Wei, Xiaogang Wang

Among these forms, scene graphs are powerful structured representations of the images that encode objects and their interactions.

artificial intelligence, machine learning, scene graph, (17 more...)

Country: North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

arXiv.org Artificial IntelligenceDec-3-2025

Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

Xiong, Yuan, Miao, Ziqi, Li, Lijun, Qian, Chen, Li, Jie, Shao, Jing

While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack's efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and Qwen2.5-VL-72B models, respectively, with Attack Success Rates (ASR) reaching 86.31\% and 91.07\%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.

harmful query, large language model, machine learning, (19 more...)

2512.02973

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (0.67)
Law Enforcement & Public Safety (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceDec-2-2025

LLM-based Automated Grading with Human-in-the-Loop

Chu, Yucheng, Li, Hang, Yang, Kaiqi, Copur-Gencturk, Yasemin, Tang, Jiliang

The rise of artificial intelligence (AI) technologies, particularly large language models (LLMs), has brought significant advancements to the field of education. Among various applications, automatic short answer grading (ASAG), which focuses on evaluating open-ended textual responses, has seen remarkable progress with the introduction of LLMs. These models not only enhance grading performance compared to traditional ASAG approaches but also move beyond simple comparisons with predefined "golden" answers, enabling more sophisticated grading scenarios, such as rubric-based evaluation. However, existing LLM-powered methods still face challenges in achieving human-level grading performance in rubric-based assessments due to their reliance on fully automated approaches. In this work, we explore the potential of LLMs in ASAG tasks by leveraging their interactive capabilities through a human-in-the-loop (HITL) approach. Our proposed framework, GradeHITL, utilizes the generative properties of LLMs to pose questions to human experts, incorporating their insights to refine grading rubrics dynamically. This adaptive process significantly improves grading accuracy, outperforming existing methods and bringing ASAG closer to human-level evaluation.

large language model, machine learning, natural language, (19 more...)

2504.05239

Country: North America > United States > California (0.46)

Genre: Research Report > New Finding (0.68)

Industry: Education > Educational Technology > Educational Software > Computer-Aided Assessment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceOct-24-2025

CE-Nav: Flow-Guided Reinforcement Refinement for Cross-Embodiment Local Navigation

Yang, Kai, Zhang, Tianlin, Wang, Zhengbo, Chu, Zedong, Wu, Xiaolong, Cai, Yang, Xu, Mu

Generalizing local navigation policies across diverse robot morphologies is a critical challenge. Progress is often hindered by the need for costly and embodiment-specific data, the tight coupling of planning and control, and the "disastrous averaging" problem where deterministic models fail to capture multi-modal decisions (e.g., turning left or right). We introduce CE-Nav, a novel two-stage (IL-then-RL) framework that systematically decouples universal geometric reasoning from embodiment-specific dynamic adaptation. First, we train an embodiment-agnostic General Expert offline using imitation learning. This expert, a conditional normalizing flow model named VelFlow, learns the full distribution of kinematically-sound actions from a large-scale dataset generated by a classical planner, completely avoiding real robot data and resolving the multi-modality issue. Second, for a new robot, we freeze the expert and use it as a guiding prior to train a lightweight, Dynamics-Aware Refiner via online reinforcement learning. This refiner rapidly learns to compensate for the target robot's specific dynamics and controller imperfections with minimal environmental interaction. Extensive experiments on quadrupeds, bipeds, and quadrotors show that CE-Nav achieves state-of-the-art performance while drastically reducing adaptation cost. Successful real-world deployments further validate our approach as an efficient and scalable solution for building generalizable navigation systems. Code is available at https://github.com/amap-cvlab/CE-Nav.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2509.23203

Genre:

Research Report (0.82)
Instructional Material (0.66)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.35)

Neural Information Processing SystemsOct-10-2025, 18:02:35 GMT

d68c1d10957c8d21ed9dea209533c5a4-Paper-Conference.pdf

arxiv preprint arxiv, diffusion model, llm, (14 more...)

Country:

Asia > China > Shanghai > Shanghai (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Xu, Xiangyang, Gao, Hongyang

Flow-Matching Based Refiner for Molecular Conformer Generation

arXiv.org Artificial IntelligenceOct-7-2025

Low-energy molecular conformers generation (MCG) is a foundational yet challenging problem in drug discovery. Denoising-based methods include diffusion and flow-matching methods that learn mappings from a simple base distribution to the molecular conformer distribution. However, these approaches often suffer from error accumulation during sampling, especially in the low SNR steps, which are hard to train. To address these challenges, we propose a flow-matching refiner for the MCG task. The proposed method initializes sampling from mixed-quality outputs produced by upstream denoising models and reschedules the noise scale to bypass the low-SNR phase, thereby improving sample quality. On the GEOM-QM9 and GEOM-Drugs benchmark datasets, the generator-refiner pipeline improves quality with fewer total denoising steps while preserving diversity.

artificial intelligence, conformer, machine learning, (14 more...)

2510.04878

Genre: Research Report (1.00)

Industry: Health & Medicine (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceFeb-20-2025

OccProphet: Pushing Efficiency Frontier of Camera-Only 4D Occupancy Forecasting with Observer-Forecaster-Refiner Framework

Chen, Junliang, Xu, Huaiyuan, Wang, Yi, Chau, Lap-Pui

Predicting variations in complex traffic environments is crucial for the safety of autonomous driving. Recent advancements in occupancy forecasting have enabled forecasting future 3D occupied status in driving environments by observing historical 2D images. However, high computational demands make occupancy forecasting less efficient during training and inference stages, hindering its feasibility for deployment on edge agents. In this paper, we propose a novel framework, i.e., OccProphet, to efficiently and effectively learn occupancy forecasting with significantly lower computational requirements while improving forecasting accuracy. OccProphet comprises three lightweight components: Observer, Forecaster, and Refiner. The Observer extracts spatio-temporal features from 3D multi-frame voxels using the proposed Efficient 4D Aggregation with Tripling-Attention Fusion, while the Forecaster and Refiner conditionally predict and refine future occupancy inferences. Experimental results on nuScenes, Lyft-Level5, and nuScenes-Occupancy datasets demonstrate that OccProphet is both training-and inference-friendly. OccProphet reduces 58% 78% of the computational cost with a 2.6 speedup compared with the state-of-the-art Cam4DOcc. Moreover, it achieves 4% 18% relatively higher forecasting accuracy. OccProphet only receives multi-camera video input and produces future occupancies. Autonomous driving holds significant promise for reshaping transportation and urban mobility.

forecasting, occprophet, occupancy forecasting, (16 more...)

2502.1518

Country:

Asia > China > Hong Kong (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)

Genre: Research Report (0.50)

Industry: Transportation > Ground > Road (0.77)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.70)

arXiv.org Artificial IntelligenceDec-16-2024

SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models

Cheng, Jiale, Liu, Xiao, Wang, Cunxiang, Gu, Xiaotao, Lu, Yida, Zhang, Dan, Dong, Yuxiao, Tang, Jie, Wang, Hongning, Huang, Minlie

Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference learning. Such practice can introduce content variations irrelevant to whether the instruction is precisely followed (e.g., different expressions about the same semantic), interfering with the goal of teaching models to recognize the key differences that lead to improved instruction following. R, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions. By playing against itself, an LLM employs a tree-search strategy to refine its previous responses with respect to the instruction while minimizing unnecessary variations. R demonstrates promising scalability and transferability, greatly enhancing models like GLM-4-9B and LLaMA3-70B. We also identify how inference scaling in tree search would impact model performance. Our code and data are publicly available at https://github.com/thu-coai/SPaR. Write a story and end it with "The devil is in the details." The is in the details. Figure 1: An example of the interfering factors (story content) in independently sampled multiple responses (Left). Refined response pairs exclude these factors, highlight the key difference (ending sentence), and lead to improved performance on iteratively trained LLaMA3-8B-Instruct (Right).

arxiv preprint arxiv, large language model, machine learning, (18 more...)

2412.11605

Genre: Research Report > New Finding (0.46)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)