Shillong
Dual-branch Prompting for Multimodal Machine Translation
Wang, Jie, Yang, Zhendong, Zong, Liansong, Zhang, Xiaobo, Wang, Dexian, Zhang, Ji
Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.
- Asia > China > Sichuan Province > Chengdu (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (5 more...)
Concise and Sufficient Sub-Sentence Citations for Retrieval-Augmented Generation
Chen, Guo, Li, Qiuyuan, Li, Qiuxian, Dai, Hongliang, Chen, Xiang, Li, Piji
In retrieval-augmented generation (RAG) question answering systems, generating citations for large language model (LLM) outputs enhances verifiability and helps users identify potential hallucinations. However, we observe two problems in the citations produced by existing attribution methods. First, the citations are typically provided at the sentence or even paragraph level. Long sentences or paragraphs may include a substantial amount of irrelevant content. Second, sentence-level citations may omit information that is essential for verifying the output, forcing users to read the surrounding context. In this paper, we propose generating sub-sentence citations that are both concise and sufficient, thereby reducing the effort required by users to confirm the correctness of the generated output. To this end, we first develop annotation guidelines for such citations and construct a corresponding dataset. Then, we propose an attribution framework for generating citations that adhere to our standards. This framework leverages LLMs to automatically generate fine-tuning data for our task and employs a credit model to filter out low-quality examples. Our experiments on the constructed dataset demonstrate that the propose approach can generate high-quality and more readable citations.
- North America > Guatemala (0.14)
- Asia > India > Assam > Dispur (0.06)
- Asia > China > Jiangsu Province > Nanjing (0.05)
- (2 more...)
Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution
Li, Kun, Zhang, Tianhua, Li, Yunxiang, Luo, Hongyin, Moustafa, Abdalla, Wu, Xixin, Glass, James, Meng, Helen
Improving context faithfulness in large language models is essential for developing trustworthy retrieval augmented generation systems and mitigating hallucinations, especially in long-form question answering (LFQA) tasks or scenarios involving knowledge conflicts. Existing methods either intervene LLMs only at inference without addressing their inherent limitations or overlook the potential for self-improvement. In this paper, we introduce GenDiE (Generate, Discriminate, Evolve), a novel self-evolving framework that enhances context faithfulness through fine-grained sentence-level optimization. GenDiE combines both generative and discriminative training, equipping LLMs with self-generation and self-scoring capabilities to facilitate iterative self-evolution. This supports both data construction for model alignment and score-guided search during inference. Furthermore, by treating each sentence in a response as an independent optimization unit, GenDiE effectively addresses the limitations of previous approaches that optimize at the holistic answer level, which may miss unfaithful details. Experiments on ASQA (in-domain LFQA) and ConFiQA (out-of-domain counterfactual QA) datasets demonstrate that GenDiE surpasses various baselines in both faithfulness and correctness, and exhibits robust performance for domain adaptation.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Wyoming (0.14)
- North America > United States > Utah (0.14)
- (44 more...)
- Government > Regional Government > North America Government > United States Government (1.00)
- Government > Military (1.00)
Detection, Retrieval, and Explanation Unified: A Violence Detection System Based on Knowledge Graphs and GAT
Jiang, Wen-Dong, Chang, Chih-Yung, Roy, Diptendu Sinha
Recently, violence detection systems developed using unified multimodal models have achieved significant success and attracted widespread attention. However, most of these systems face two critical challenges: the lack of interpretability as black-box models and limited functionality, offering only classification or retrieval capabilities. To address these challenges, this paper proposes a novel interpretable violence detection system, termed the Three-in-One (TIO) System. The TIO system integrates knowledge graphs (KG) and graph attention networks (GAT) to provide three core functionalities: detection, retrieval, and explanation. Specifically, the system processes each video frame along with text descriptions generated by a large language model (LLM) for videos containing potential violent behavior. It employs ImageBind to generate high-dimensional embeddings for constructing a knowledge graph, uses GAT for reasoning, and applies lightweight time series modules to extract video embedding features. The final step connects a classifier and retriever for multi-functional outputs. The interpretability of KG enables the system to verify the reasoning process behind each output. Additionally, the paper introduces several lightweight methods to reduce the resource consumption of the TIO system and enhance its efficiency. Extensive experiments conducted on the XD-Violence and UCF-Crime datasets validate the effectiveness of the proposed system. A case study further reveals an intriguing phenomenon: as the number of bystanders increases, the occurrence of violent behavior tends to decrease.
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
- Asia > India > Meghalaya > Shillong (0.04)
- North America > United States > West Virginia (0.04)
Injecting Explainability and Lightweight Design into Weakly Supervised Video Anomaly Detection Systems
Jiang, Wen-Dong, Chang, Chih-Yung, Chang, Hsiang-Chuan, Chen, Ji-Yuan, Roy, Diptendu Sinha
Weakly Supervised Monitoring Anomaly Detection (WSMAD) utilizes weak supervision learning to identify anomalies, a critical task for smart city monitoring. However, existing multimodal approaches often fail to meet the real-time and interpretability requirements of edge devices due to their complexity. This paper presents TCVADS (Two-stage Cross-modal Video Anomaly Detection System), which leverages knowledge distillation and cross-modal contrastive learning to enable efficient, accurate, and interpretable anomaly detection on edge devices.TCVADS operates in two stages: coarse-grained rapid classification and fine-grained detailed analysis. In the first stage, TCVADS extracts features from video frames and inputs them into a time series analysis module, which acts as the teacher model. Insights are then transferred via knowledge distillation to a simplified convolutional network (student model) for binary classification. Upon detecting an anomaly, the second stage is triggered, employing a fine-grained multi-class classification model. This stage uses CLIP for cross-modal contrastive learning with text and images, enhancing interpretability and achieving refined classification through specially designed triplet textual relationships. Experimental results demonstrate that TCVADS significantly outperforms existing methods in model performance, detection efficiency, and interpretability, offering valuable contributions to smart city monitoring applications.
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
- Oceania > Australia (0.04)
- Asia > India > Meghalaya > Shillong (0.04)
- (3 more...)
- Research Report > New Finding (0.66)
- Research Report > Promising Solution (0.46)
Explaining the Unexplained: Revealing Hidden Correlations for Better Interpretability
Jiang, Wen-Dong, Chang, Chih-Yung, Yen, Show-Jane, Roy, Diptendu Sinha
Thanks to the rapid advancement of computer hardware, deep learning has made significant progress in the application of unstructured data, such as images (Cao & Chen, 2025) and text (Li et al., 2024). Specifically, the success of representation learning (Wang & Lian, 2025; Zhang et al., 2025) has gradually replaced the earlier approaches of transforming unstructured data into structured formats. The key to the success of representation learning lies in leveraging a large number of parameters for backpropagation, enabling the model to adapt to data with non-normal distributions. Although models based on backpropagation neural networks (Yang et al., 2019; Banerjee et al., 2023) have achieved significant technical advancements, their application in many sensitive domains, such as medicine (Zhang et al., 2025) and industrial inspection (Rathee et al., 2021), still faces considerable challenges due to the difficulty in understanding the basis of their decision-making. Explainable Artificial Intelligence (XAI) aims to reveal the inner mechanisms of neural network decisions, thereby making these models more reliable for applications in sensitive domains. In recent years, several studies (Li et al., 2025; Jing et al., 2025; Liu et al., 2024; Guan et al., 2024) have focused on injecting explainability into deep learning models and using various visualization techniques to explain the decisions of these "black box" models. While these models have achieved a certain level of interpretability, two pressing issues remain (Huang & Marques, 2023; Huang & Marques, 2024): first, whether the correlations between different attributes are correctly evaluated, and second, whether the model's decision-making pathway truly aligns with human reasoning, even when the model's understanding appears consistent with user expectations.
From Explicit Rules to Implicit Reasoning in an Interpretable Violence Monitoring System
Jiang, Wen-Dong, Chang, Chih-Yung, Kuai, Ssu-Chi, Roy, Diptendu Sinha
Recently, research based on pre-trained models has demonstrated outstanding performance in violence surveillance tasks. However, most of them were black-box systems which faced challenges regarding explainability during training and inference processes. An important question is how to incorporate explicit knowledge into these implicit models, thereby designing expertdriven and interpretable violence surveillance systems. This paper proposes a new paradigm for weakly supervised violence monitoring (WSVM) called Rule base Violence Monitoring (RuleVM). The proposed RuleVM uses a dual-branch structure with different designs for images and text. One of the branches is called the implicit branch, which uses only visual features for coarse-grained binary classification. In this branch, image feature extraction is divided into two channels: one responsible for extracting scene frames and the other focusing on extracting actions. The other branch is called the explicit branch, which utilizes language-image alignment to perform fine-grained classification. For the language channel design in the explicit branch, the proposed RuleVM uses the state-of-the-art YOLOWorld model to detect objects in video frames, and association rules are identified through data mining methods as descriptions of the video. Leveraging the dual-branch architecture, RuleVM achieves interpretable coarse-grained and fine-grained violence surveillance. Extensive experiments were conducted on two commonly used benchmarks, and the results show that RuleVM achieved the best performance in both coarse-grained and finegrained monitoring, significantly outperforming existing state-ofthe-art methods. Moreover, interpretability experiments uncovered some interesting rules, such as the observation that as the number of people increases, the risk level of violent behavior also rises.
- Transportation (0.66)
- Information Technology > Security & Privacy (0.48)
Deep Reinforcement Learning with Explicit Context Representation
Munguia-Galeano, Francisco, Tan, Ah-Hwee, Ji, Ze
Reinforcement learning (RL) has shown an outstanding capability for solving complex computational problems. However, most RL algorithms lack an explicit method that would allow learning from contextual information. Humans use context to identify patterns and relations among elements in the environment, along with how to avoid making wrong actions. On the other hand, what may seem like an obviously wrong decision from a human perspective could take hundreds of steps for an RL agent to learn to avoid. This paper proposes a framework for discrete environments called Iota explicit context representation (IECR). The framework involves representing each state using contextual key frames (CKFs), which can then be used to extract a function that represents the affordances of the state; in addition, two loss functions are introduced with respect to the affordances of the state. The novelty of the IECR framework lies in its capacity to extract contextual information from the environment and learn from the CKFs' representation. We validate the framework by developing four new algorithms that learn using context: Iota deep Q-network (IDQN), Iota double deep Q-network (IDDQN), Iota dueling deep Q-network (IDuDQN), and Iota dueling double deep Q-network (IDDDQN). Furthermore, we evaluate the framework and the new algorithms in five discrete environments. We show that all the algorithms, which use contextual information, converge in around 40,000 training steps of the neural networks, significantly outperforming their state-of-the-art equivalents.
- Asia > Singapore (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > Mexico (0.04)
- (5 more...)
- Education (0.68)
- Leisure & Entertainment > Games (0.46)
AIDPS:Adaptive Intrusion Detection and Prevention System for Underwater Acoustic Sensor Networks
Das, Soumadeep, Pasikhani, Aryan Mohammadi, Gope, Prosanta, Clark, John A., Patel, Chintan, Sikdar, Biplab
Underwater Acoustic Sensor Networks (UW-ASNs) are predominantly used for underwater environments and find applications in many areas. However, a lack of security considerations, the unstable and challenging nature of the underwater environment, and the resource-constrained nature of the sensor nodes used for UW-ASNs (which makes them incapable of adopting security primitives) make the UW-ASN prone to vulnerabilities. This paper proposes an Adaptive decentralised Intrusion Detection and Prevention System called AIDPS for UW-ASNs. The proposed AIDPS can improve the security of the UW-ASNs so that they can efficiently detect underwater-related attacks (e.g., blackhole, grayhole and flooding attacks). To determine the most effective configuration of the proposed construction, we conduct a number of experiments using several state-of-the-art machine learning algorithms (e.g., Adaptive Random Forest (ARF), light gradient-boosting machine, and K-nearest neighbours) and concept drift detection algorithms (e.g., ADWIN, kdqTree, and Page-Hinkley). Our experimental results show that incremental ARF using ADWIN provides optimal performance when implemented with One-class support vector machine (SVM) anomaly-based detectors. Furthermore, our extensive evaluation results also show that the proposed scheme outperforms state-of-the-art bench-marking methods while providing a wider range of desirable features such as scalability and complexity.
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- North America > United States > New York > Rensselaer County > Troy (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (4 more...)
- Information Technology > Communications > Networks > Sensor Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.55)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)
Deep Learning for Diverse Data Types Steganalysis: A Review
Kheddar, Hamza, Hemis, Mustapha, Himeur, Yassine, Megías, David, Amira, Abbes
Steganography and steganalysis are two interrelated aspects of the field of information security. Steganography seeks to conceal communications, whereas steganalysis is aimed to either find them or even, if possible, recover the data they contain. Steganography and steganalysis have attracted a great deal of interest, particularly from law enforcement. Steganography is often used by cybercriminals and even terrorists to avoid being captured while in possession of incriminating evidence, even encrypted, since cryptography is prohibited or restricted in many countries. Therefore, knowledge of cutting-edge techniques to uncover concealed information is crucial in exposing illegal acts. Over the last few years, a number of strong and reliable steganography and steganalysis techniques have been introduced in the literature. This review paper provides a comprehensive overview of deep learning-based steganalysis techniques used to detect hidden information within digital media. The paper covers all types of cover in steganalysis, including image, audio, and video, and discusses the most commonly used deep learning techniques. In addition, the paper explores the use of more advanced deep learning techniques, such as deep transfer learning (DTL) and deep reinforcement learning (DRL), to enhance the performance of steganalysis systems. The paper provides a systematic review of recent research in the field, including data sets and evaluation metrics used in recent studies. It also presents a detailed analysis of DTL-based steganalysis approaches and their performance on different data sets. The review concludes with a discussion on the current state of deep learning-based steganalysis, challenges, and future research directions.
- North America > United States > Alaska (0.04)
- North America > United States > New York > Broome County > Binghamton (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (22 more...)
- Research Report > Promising Solution (1.00)
- Research Report > New Finding (1.00)
- Overview (1.00)