Goto

Collaborating Authors

 gao


Automated Hazard Detection in Construction Sites Using Large Language and Vision-Language Models

Sahraoui, Islem

arXiv.org Artificial Intelligence

This thesis explores a multimodal AI framework for enhancing construction safety through the combined analysis of textual and visual data. In safety-critical environments such as construction sites, accident data often exists in multiple formats, such as written reports, inspection records, and site imagery, making it challenging to synthesize hazards using traditional approaches. To address this, this thesis proposed a multimodal AI framework that combines text and image analysis to assist in identifying safety hazards on construction sites. Two case studies were consucted to evaluate the capabilities of large language models (LLMs) and vision-language models (VLMs) for automated hazard identification.The first case study introduces a hybrid pipeline that utilizes GPT 4o and GPT 4o mini to extract structured insights from a dataset of 28,000 OSHA accident reports (2000-2025). The second case study extends this investigation using Molmo 7B and Qwen2 VL 2B, lightweight, open-source VLMs. Using the public ConstructionSite10k dataset, the performance of the two models was evaluated on rule-level safety violation detection using natural language prompts. This experiment served as a cost-aware benchmark against proprietary models and allowed testing at scale with ground-truth labels. Despite their smaller size, Molmo 7B and Quen2 VL 2B showed competitive performance in certain prompt configurations, reinforcing the feasibility of low-resource multimodal systems for rule-aware safety monitoring.


OpenAI's new LLM exposes the secrets of how AI really works

MIT Technology Review

The experimental model won't compete with the biggest and best, but it could tell us why they behave in weird ways--and how trustworthy they really are. ChatGPT maker OpenAI has built an experimental large language model that is far easier to understand than typical models. That's a big deal, because today's LLMs are black boxes: Nobody fully understands how they do what they do. Building a model that is more transparent sheds light on how LLMs work in general, helping researchers figure out why models hallucinate, why they go off the rails, and just how far we should trust them with critical tasks. "As these AI systems get more powerful, they're going to get integrated more and more into very important domains," Leo Gao, a research scientist at OpenAI, told in an exclusive preview of the new work. "It's very important to make sure they're safe."


Real-Time Crowd Counting for Embedded Systems with Lightweight Architecture

Zhao, Zhiyuan, Wen, Yubin, Yang, Siyu, Ning, Lichen, Liu, Yuandong, Gao, Junyu

arXiv.org Artificial Intelligence

Crowd counting is a task of estimating the number of the crowd through images, which is extremely valuable in the fields of intelligent security, urban planning, public safety management, and so on. However, the existing counting methods have some problems in practical application on embedded systems for these fields, such as excessive model parameters, abundant complex calculations, etc. The practical application of embedded systems requires the model to be real-time, which means that the model is fast enough. Considering the aforementioned problems, we design a super real-time model with a stem-encoder-decoder structure for crowd counting tasks, which achieves the fastest inference compared with state-of-the-arts. Firstly, large convolution kernels in the stem network are used to enlarge the receptive field, which effectively extracts detailed head information. Then, in the encoder part, we use conditional channel weighting and multi-branch local fusion block to merge multi-scale features with low computational consumption. This part is crucial to the super real-time performance of the model. Finally, the feature pyramid networks are added to the top of the encoder to alleviate its incomplete fusion problems. Experiments on three benchmarks show that our network is suitable for super real-time crowd counting on embedded systems, ensuring competitive accuracy. At the same time, the proposed network reasoning speed is the fastest. Specifically, the proposed network achieves 381.7 FPS on NVIDIA GTX 1080Ti and 71.9 FPS on NVIDIA Jetson TX1.


Explore the Reinforcement Learning for the LLM based ASR and TTS system

Gao, Changfeng, Li, Yabin, An, Keyu, Gao, Zhifu, Du, Zhihao, Zhao, Han, Li, Xiangang

arXiv.org Artificial Intelligence

In recent years, large language models (LLMs) have played an important role in automatic speech recognition (ASR) and text-to-speech (TTS) systems. While reinforcement learning (RL) has significantly enhanced LLM performance in text-based tasks, its application to ASR and TTS remains underexplored due to the complexity of training audio-based models. In this study, we propose a lightweight RL framework tailored for audio-based LLMs that can process audio inputs and generate audio outputs. Based on this framework, we evaluate the effectiveness of reinforcement learning on both ASR and TTS tasks. For the ASR task, we experiment with different rule-based reward functions within the Group Relative Policy Optimization (GRPO) framework and investigate the impact of RL data construction. For the TTS task, we compare GRPO with Differentiable Reward Optimization (DiffRO) and further combine the two approaches to achieve improved performance. Our experiments demonstrate that RL can significantly enhance the performance of both ASR and TTS systems, even with limited training data and a small number of optimization steps.


LRMR: LLM-Driven Relational Multi-node Ranking for Lymph Node Metastasis Assessment in Rectal Cancer

Dong, Yaoxian, Gao, Yifan, Li, Haoyue, Cui, Yanfen, Gao, Xin

arXiv.org Artificial Intelligence

Accurate preoperative assessment of lymph node (LN) metastasis in rectal cancer guides treatment decisions, yet conventional MRI evaluation based on morphological criteria shows limited diagnostic performance. While some artificial intelligence models have been developed, they often operate as black boxes, lacking the interpretability needed for clinical trust. Moreover, these models typically evaluate nodes in isolation, overlooking the patient-level context. To address these limitations, we introduce LRMR, an LLM-Driven Relational Multi-node Ranking framework. This approach reframes the diagnostic task from a direct classification problem into a structured reasoning and ranking process. The LRMR framework operates in two stages. First, a multimodal large language model (LLM) analyzes a composite montage image of all LNs from a patient, generating a structured report that details ten distinct radiological features. Second, a text-based LLM performs pairwise comparisons of these reports between different patients, establishing a relative risk ranking based on the severity and number of adverse features. We evaluated our method on a retrospective cohort of 117 rectal cancer patients. LRMR achieved an area under the curve (AUC) of 0.7917 and an F1-score of 0.7200, outperforming a range of deep learning baselines, including ResNet50 (AUC 0.7708). Ablation studies confirmed the value of our two main contributions: removing the relational ranking stage or the structured prompting stage led to a significant performance drop, with AUCs falling to 0.6875 and 0.6458, respectively. Our work demonstrates that decoupling visual perception from cognitive reasoning through a two-stage LLM framework offers a powerful, interpretable, and effective new paradigm for assessing lymph node metastasis in rectal cancer.


Differentiable Reward Optimization for LLM based TTS system

Gao, Changfeng, Du, Zhihao, Zhang, Shiliang

arXiv.org Artificial Intelligence

This paper proposes a novel Differentiable Reward Optimization (DiffRO) method aimed at enhancing the performance of neural codec language models based text-to-speech (TTS) systems. In contrast to conventional reinforcement learning from human feedback (RLHF) approaches applied to TTS, DiffRO directly compute the rewards based on neural codec tokens, rather than relying on synthesized audio. Furthermore, we employ the Gumbel-Softmax technique to render the reward function differentiable, thereby streamlining the RLHF training process. Additionally, we introduce a multi-task reward (MTR) model which can provide feedback from different perspectives and find that it can augment the system's capability to follow instructions effectively. Experimental results indicate that DiffRO significantly improves the pronunciation accuracy of the TTS system, achieving state-of-the-art (SOT A) WER results on the seed-tts-eval benchmark. Moreover, with the integration of the MTR model, we demonstrate the ability to control emotional and quality attributes in a zero-shot manner.


Review for NeurIPS paper: Noise-Contrastive Estimation for Multivariate Point Processes

Neural Information Processing Systems

The paper derives a new estimation method for multi-variate point processes that is based on the'ranking'-variant of NCE. The paper is borderline: two reviewers think that the difference to previous work by Gao (who use NCE to estimate point-processes) and the empirical comparison is not sufficient. Two other reviewers disagree, with one in particular arguing that the paper should be accepted. The meta-reviewer thinks that the theory in the paper is sufficiently different from Gao's work, and that the theoretical aspects of the paper are deeper and more rigorous. The results do not follow directly from previous work by Gutmann & Hyvarinen (2012) or Ma & Collins (2018). The empirical results are good and the method should be useful in practice.


Quantum-inspired Reinforcement Learning for Synthesizable Drug Design

Wang, Dannong, Chen, Jintai, Liang, Zhiding, Fu, Tianfan, Liu, Xiao-Yang

arXiv.org Artificial Intelligence

Synthesizable molecular design (also known as synthesizable molecular optimization) is a fundamental problem in drug discovery, and involves designing novel molecular structures to improve their properties according to drug-relevant oracle functions (i.e., objective) while ensuring synthetic feasibility. However, existing methods are mostly based on random search. To address this issue, in this paper, we introduce a novel approach using the reinforcement learning method with quantum-inspired simulated annealing policy neural network to navigate the vast discrete space of chemical structures intelligently. Specifically, we employ a deterministic REINFORCE algorithm using policy neural networks to output transitional probability to guide state transitions and local search using genetic algorithm to refine solutions to a local optimum within each iteration. Our methods are evaluated with the Practical Molecular Optimization (PMO) benchmark framework with a 10K query budget. We further showcase the competitive performance of our method by comparing it against the state-of-the-art genetic algorithms-based method.


Rotations of G\"odel algebras with modal operators

Flaminio, Tommaso, Godo, Lluis, Menchón, Paula, Rodriguez, Ricardo O.

arXiv.org Artificial Intelligence

The present paper is devoted to study the effect of connected and disconnected rotations of G\"odel algebras with operators grounded on directly indecomposable structures. The structures resulting from this construction we will present are nilpotent minimum (with or without negation fixpoint, depending on whether the rotation is connected or disconnected) with special modal operators defined on a directly indecomposable algebra. In this paper we will present a (quasi-)equational definition of these latter structures. Our main results show that directly indecomposable nilpotent minimum algebras (with or without negation fixpoint) with modal operators are fully characterized as connected and disconnected rotations of directly indecomposable G\"odel algebras endowed with modal operators.


Researchers build AI-driven sarcasm detector

The Guardian

Never mind that it can pass the bar exam, ace medical tests and read bedtime stories with emotion, artificial intelligence will never match the marvel of the human mind without first mastering the art of sarcasm. But that art, it seems, may be next on the list of the technology's dizzying capabilities. Researchers in the Netherlands have built an AI-driven sarcasm detector that can spot when the lowest form of wit, and the highest form of intelligence, is being deployed. "We are able to recognise sarcasm in a reliable way, and we're eager to grow that," said Matt Coler at the University of Groningen's speech technology lab. "We want to see how far we can push it."