contextualization
Towards Contextual Sensitive Data Detection
Telkamp, Liang, Hulsebos, Madelon
The emergence of open data portals necessitates more attention to protecting sensitive data before datasets get published and exchanged. While an abundance of methods for suppressing sensitive data exist, the conceptualization of sensitive data and methods to detect it, focus particularly on personal data that, if disclosed, may be harmful or violate privacy. We observe the need for refining and broadening our definitions of sensitive data, and argue that the sensitivity of data depends on its context. Based on this definition, we introduce two mechanisms for contextual sensitive data detection that consider the broader context of a dataset at hand. First, we introduce type contextualization, which first detects the semantic type of particular data values, then considers the overall context of the data values within the dataset or document. Second, we introduce domain contextualization which determines sensitivity of a given dataset in the broader context based on the retrieval of relevant rules from documents that specify data sensitivity (e.g., data topic and geographic origin). Experiments with these mechanisms, assisted by large language models (LLMs), confirm that: 1) type-contextualization significantly reduces the number of false positives for type-based sensitive data detection and reaches a recall of 94% compared to 63% with commercial tools, and 2) domain-contextualization leveraging sensitivity rule retrieval is effective for context-grounded sensitive data detection in non-standard data domains such as humanitarian datasets. Evaluation with humanitarian data experts also reveals that context-grounded LLM explanations provide useful guidance in manual data auditing processes, improving consistency. We open-source mechanisms and annotated datasets for contextual sensitive data detection at https://github.com/trl-lab/sensitive-data-detection.
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
ContextNav: Towards Agentic Multimodal In-Context Learning
Fu, Honghao, Ouyang, Yuan, Chang, Kai-Wei, Wang, Yiwei, Huang, Zi, Cai, Yujun
Recent advances demonstrate that multimodal large language models (MLLMs) exhibit strong multimodal in-context learning (ICL) capabilities, enabling them to adapt to novel vision-language tasks from a few contextual examples. However, existing ICL approaches face challenges in reconciling scalability with robustness across diverse tasks and noisy contextual examples: manually selecting examples produces clean contexts but is labor-intensive and task-specific, while similarity-based retrieval improves scalability but could introduce irrelevant or structurally inconsistent samples that degrade ICL performance. To address these limitations, we propose ContextNav, the first agentic framework that integrates the scalability of automated retrieval with the quality and adaptiveness of human-like curation, enabling noise-robust and dynamically optimized contextualization for multimodal ICL. ContextNav unifies context management and noise-robust contextualization within a closed-loop workflow driven by graph-based orchestration. Specifically, it builds a resource-aware multimodal embedding pipeline, maintains a retrievable vector database, and applies agentic retrieval and structural alignment to construct noise-resilient contexts. An Operational Grammar Graph (OGG) further supports adaptive workflow planning and optimization, enabling the agent to refine its operational strategies based on downstream ICL feedback. Experimental results demonstrate that ContextNav achieves state-of-the-art performance across various datasets, underscoring the promise of agentic workflows for advancing scalable and robust contextualization in multimodal ICL.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Austria > Vienna (0.14)
- Oceania > Australia > Queensland (0.04)
- (2 more...)
- Workflow (1.00)
- Research Report > New Finding (0.48)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
Uncovering Vulnerabilities of LLM-Assisted Cyber Threat Intelligence
Meng, Yuqiao, Tang, Luoxi, Yu, Feiyang, Jia, Jinyuan, Yan, Guanhua, Yang, Ping, Xi, Zhaohan
Large Language Models (LLMs) are intensively used to assist security analysts in counteracting the rapid exploitation of cyber threats, wherein LLMs offer cyber threat intelligence (CTI) to support vulnerability assessment and incident response. While recent work has shown that LLMs can support a wide range of CTI tasks such as threat analysis, vulnerability detection, and intrusion defense, significant performance gaps persist in practical deployments. In this paper, we investigate the intrinsic vulnerabilities of LLMs in CTI, focusing on challenges that arise from the nature of the threat landscape itself rather than the model architecture. Using large-scale evaluations across multiple CTI benchmarks and real-world threat reports, we introduce a novel categorization methodology that integrates stratification, autoregressive refinement, and human-in-the-loop supervision to reliably analyze failure instances. Through extensive experiments and human inspections, we reveal three fundamental vulnerabilities: spurious correlations, contradictory knowledge, and constrained generalization, that limit LLMs in effectively supporting CTI. Subsequently, we provide actionable insights for designing more robust LLM-powered CTI systems to facilitate future research.
- North America > United States > New York > Broome County > Binghamton (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > Hawaii (0.04)
- (3 more...)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
ExBigBang: A Dynamic Approach for Explainable Persona Classification through Contextualized Hybrid Transformer Analysis
Afzoon, Saleh, Beheshti, Amin, Rezvani, Nabi, Khunjush, Farshad, Naseem, Usman, McMahon, John, Fathollahi, Zahra, Labani, Mahdieh, Mansoor, Wathiq, Zhang, Xuyun
In user-centric design, persona development plays a vital role in understanding user behaviour, capturing needs, segmenting audiences, and guiding design decisions. However, the growing complexity of user interactions calls for a more contextualized approach to ensure designs align with real user needs. While earlier studies have advanced persona classification by modelling user behaviour, capturing contextual information, especially by integrating textual and tabular data, remains a key challenge. These models also often lack explainability, leaving their predictions difficult to interpret or justify. To address these limitations, we present ExBigBang (Explainable BigBang), a hybrid text-tabular approach that uses transformer-based architectures to model rich contextual features for persona classification. ExBigBang incorporates metadata, domain knowledge, and user profiling to embed deeper context into predictions. Through a cyclical process of user profiling and classification, our approach dynamically updates to reflect evolving user behaviours. Experiments on a benchmark persona classification dataset demonstrate the robustness of our model. An ablation study confirms the benefits of combining text and tabular data, while Explainable AI techniques shed light on the rationale behind the model's predictions.
- North America (0.93)
- Asia > Middle East (0.93)
- Information Technology > Security & Privacy (0.93)
- Education (0.93)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)
HypKG: Hypergraph-based Knowledge Graph Contextualization for Precision Healthcare
Xie, Yuzhang, Han, Xu, Xu, Ran, Hu, Xiao, Lu, Jiaying, Yang, Carl
Knowledge graphs (KGs) are important products of the semantic web, which are widely used in various application domains. Healthcare is one of such domains where KGs are intensively used, due to the high requirement for knowledge accuracy and interconnected nature of healthcare data. However, KGs storing general factual information often lack the ability to account for important contexts of the knowledge such as the status of specific patients, which are crucial in precision healthcare. Meanwhile, electronic health records (EHRs) provide rich personal data, including various diagnoses and medications, which provide natural contexts for general KGs. In this paper, we propose HypKG, a framework that integrates patient information from EHRs into KGs to generate contextualized knowledge representations for accurate healthcare predictions. Using advanced entity-linking techniques, we connect relevant knowledge from general KGs with patient information from EHRs, and then utilize a hypergraph model to "contextualize" the knowledge with the patient information. Finally, we employ hypergraph transformers guided by downstream prediction tasks to jointly learn proper con-textualized representations for both KGs and patients, fully leveraging existing knowledge in KGs and patient contexts in EHRs. In experiments using a large biomedical KG and two real-world EHR datasets, HypKG demonstrates significant improvements in healthcare prediction tasks across multiple evaluation metrics. Additionally, by integrating external contexts, HypKG can learn to adjust the representations of entities and relations in KG, potentially improving the quality and real-world utility of knowledge.
DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition
Sudo, Yui, Fukumoto, Yosuke, Shakeel, Muhammad, Peng, Yifan, Lin, Chyi-Jiunn, Watanabe, Shinji
Contextual biasing (CB) improves automatic speech recognition for rare and unseen phrases. Recent studies have introduced dynamic vocabulary, which represents context phrases as expandable tokens in autoregressive (AR) models. This method improves CB accuracy but with slow inference speed. While dynamic vocabulary can be applied to non-autoregressive (NAR) models, such as connectionist temporal classification (CTC), the conditional independence assumption fails to capture dependencies between static and dynamic tokens. This paper proposes DYNAC (Dynamic Vocabulary-based NAR Contextualization), a self-conditioned CTC method that integrates dynamic vocabulary into intermediate layers. Conditioning the encoder on dynamic vocabulary, DYNAC effectively captures dependencies between static and dynamic tokens while reducing the real-time factor (RTF). Experimental results show that DYNAC reduces RTF by 81% with a 0.1-point degradation in word error rate on the LibriSpeech 960 test-clean set.
- Asia > Japan (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Effective Context in Neural Speech Models
Meng, Yen, Goldwater, Sharon, Tang, Hao
Modern neural speech models benefit from having longer context, and many approaches have been proposed to increase the maximum context a model can use. However, few have attempted to measure how much context these models actually use, i.e., the effective context. Here, we propose two approaches to measuring the effective context, and use them to analyze different speech Transformers. For supervised models, we find that the effective context correlates well with the nature of the task, with fundamental frequency tracking, phone classification, and word classification requiring increasing amounts of effective context. For self-supervised models, we find that effective context increases mainly in the early layers, and remains relatively short -- similar to the supervised phone model. Given that these models do not use a long context during prediction, we show that HuBERT can be run in streaming mode without modification to the architecture and without further fine-tuning.
Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B
Bakalova, Aleksandra, Veitsman, Yana, Huang, Xinting, Hahn, Michael
In-Context Learning (ICL) is an intriguing ability of large language models (LLMs). Despite a substantial amount of work on its behavioral aspects and how it emerges in miniature setups, it remains unclear which mechanism assembles task information from the individual examples in a fewshot prompt. We use causal interventions to identify information flow in Gemma-2 2B for five naturalistic ICL tasks. We find that the model infers task information using a two-step strategy we call contextualize-then-aggregate: In the lower layers, the model builds up representations of individual fewshot examples, which are contextualized by preceding examples through connections between fewshot input and output tokens across the sequence. In the higher layers, these representations are aggregated to identify the task and prepare prediction of the next output. The importance of the contextualization step differs between tasks, and it may become more important in the presence of ambiguous examples. Overall, by providing rigorous causal analysis, our results shed light on the mechanisms through which ICL happens in language models.
- Europe > Moldova (0.04)
- Europe > Germany > Saarland (0.04)
- Europe > Montenegro (0.04)
- (4 more...)
Regularization, Semi-supervision, and Supervision for a Plausible Attention-Based Explanation
Nguyen, Duc Hau, Mallart, Cyrielle, Gravier, Guillaume, Sébillot, Pascale
Attention mechanism is contributing to the majority of recent advances in machine learning for natural language processing. Additionally, it results in an attention map that shows the proportional influence of each input in its decision. Empirical studies postulate that attention maps can be provided as an explanation for model output. However, it is still questionable to ask whether this explanation helps regular people to understand and accept the model output (the plausibility of the explanation). Recent studies show that attention weights in the RNN encoders are hardly plausible because they spread on input tokens. We thus propose 3 additional constraints to the learning objective function to improve the plausibility of the attention map: regularization to increase the attention weight sparsity, semi-supervision to supervise the map by a heuristic and supervision by human annotation. Results show that all techniques can improve the attention map plausibility at some level. We also observe that specific instructions for human annotation might have a negative effect on classification performance. Beyond the attention map, the result of experiments on text classification tasks also shows that no matter how the constraint brings the gain, the contextualization layer plays a crucial role in finding the right space for finding plausible tokens.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > France > Brittany > Ille-et-Vilaine > Rennes (0.05)
A Metasemantic-Metapragmatic Framework for Taxonomizing Multimodal Communicative Alignment
Drawing on contemporary pragmatist philosophy and linguistic theories on cognition, meaning, and communication, this paper presents a dynamic, metasemantic-metapragmatic taxonomy for grounding and conceptualizing human-like multimodal communicative alignment. The framework is rooted in contemporary developments of the three basic communicative capacities initially identified by American logician and pragmatist philosopher Charles Sanders Peirce: iconic (sensory and perceptual qualities), indexical (contextual and sociocultural associations), and rule-like (symbolic and intuitive reasoning). Expanding on these developments, I introduce the concept of indexical contextualization and propose the principle of "contextualization directionality" for characterizing the crucial metapragmatic capacity for maintaining, navigating, or transitioning between semantic and pragmatic modes of multimodal communication. I contend that current cognitive-social computational and engineering methodologies disproportionately emphasize the semantic/metasemantic domain, overlooking the pivotal role of metapragmatic indexicality in traversing the semantic-pragmatic spectrum of communication. The framework's broader implications for intentionality, identity, affect, and ethics in within-modal and cross-modal human-machine alignment are also discussed.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
- (6 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)