AITopics | dublin

Collaborating Authors

dublin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

Zhang, Ruoxuan, Wen, Bin, Xie, Hongxia, Yao, Yi, Zuo, Songhan, Jiang-Lin, Jian-Yu, Shuai, Hong-Han, Cheng, Wen-Huang

arXiv.org Artificial IntelligenceDec-8-2025

Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.

artificial intelligence, machine learning, proceedings, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3746027.3755174

2512.0354

Country: Asia > China (0.28)

Genre: Workflow (0.98)

Industry:

Health & Medicine > Therapeutic Area (0.46)
Health & Medicine > Consumer Health (0.46)
Education > Educational Technology > Audio & Video (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

SyMuPe: Affective and Controllable Symbolic Music Performance

Borovik, Ilya, Gavrilev, Dmitrii, Viro, Vladimir

arXiv.org Artificial IntelligenceNov-6-2025

Emotions are fundamental to the creation and perception of music performances. However, achieving human-like expression and emotion through machine learning models for performance rendering remains a challenging task. In this work, we present SyMuPe, a novel framework for developing and training affective and controllable symbolic piano performance models. Our flagship model, PianoFlow, uses conditional flow matching trained to solve diverse multi-mask performance inpainting tasks. By design, it supports both unconditional generation and infilling of music performance features. For training, we use a curated, cleaned dataset of 2,968 hours of aligned musical scores and expressive MIDI performances. For text and emotion control, we integrate a piano performance emotion classifier and tune PianoFlow with the emotion-weighted Flan-T5 text embeddings provided as conditional inputs. Objective and subjective evaluations against transformer-based baselines and existing models show that PianoFlow not only outperforms other approaches, but also achieves performance quality comparable to that of human-recorded and transcribed MIDI samples. For emotion control, we present and analyze samples generated under different text conditioning scenarios. The developed model can be integrated into interactive applications, contributing to the creation of more accessible and engaging music performance systems.

machine learning, natural language, pianoflow, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3746027.3755871

2511.03425

Country: Europe > Austria (0.46)

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures

Raja, Rahul, Vats, Arpita

arXiv.org Artificial IntelligenceOct-24-2025

Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia data.

large language model, machine learning, question answering, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3746274.3760393

2510.20193

Country: North America > United States (0.68)

Genre: Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Teaching AI to Feel: A Collaborative, Full-Body Exploration of Emotive Communication

Tütüncü, Esen K., Lemus, Lissette, Pilcher, Kris, Sprengel, Holger, Sabater-Mir, Jordi

arXiv.org Artificial IntelligenceSep-29-2025

Commonaiverse is an interactive installation exploring human emotions through full-body motion tracking and real-time AI feedback. Participants engage in three phases: Teaching, Exploration and the Cosmos Phase, collaboratively expressing and interpreting emotions with the system. The installation integrates MoveNet for precise motion tracking and a multi-recommender AI system to analyze emotional states dynamically, responding with adaptive audiovisual outputs. By shifting from top-down emotion classification to participant-driven, culturally diverse definitions, we highlight new pathways for inclusive, ethical affective computing. We discuss how this collaborative, out-of-the-box approach pushes multimedia research beyond single-user facial analysis toward a more embodied, co-created paradigm of emotional AI. Furthermore, we reflect on how this reimagined framework fosters user agency, reduces bias, and opens avenues for advanced interactive applications.

artificial intelligence, machine learning, participant, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3746027.3758186

2509.22168

Country:

Europe (0.70)
North America > United States (0.28)

Genre: Research Report (0.64)

Industry:

Information Technology > Security & Privacy (0.68)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.91)
Information Technology > Artificial Intelligence > Vision > Face Recognition (0.70)

Add feedback

Pre-Forgettable Models: Prompt Learning as a Native Mechanism for Unlearning

Hendrix, Rutger, Patanè, Giovanni, Russo, Leonardo G., Carnemolla, Simone, Bellitto, Giovanni, Salanitri, Federica Proietto, Spampinato, Concetto, Pennisi, Matteo

arXiv.org Artificial IntelligenceSep-22-2025

Foundation models have transformed multimedia analysis by enabling robust and transferable representations across diverse modalities and tasks. However, their static deployment conflicts with growing societal and regulatory demands -- particularly the need to unlearn specific data upon request, as mandated by privacy frameworks such as the GDPR. Traditional unlearning approaches, including retraining, activation editing, or distillation, are often computationally expensive, fragile, and ill-suited for real-time or continuously evolving systems. In this paper, we propose a paradigm shift: rethinking unlearning not as a retroactive intervention but as a built-in capability. We introduce a prompt-based learning framework that unifies knowledge acquisition and removal within a single training phase. Rather than encoding information in model weights, our approach binds class-level semantics to dedicated prompt tokens. This design enables instant unlearning simply by removing the corresponding prompt -- without retraining, model modification, or access to original data. Experiments demonstrate that our framework preserves predictive performance on retained classes while effectively erasing forgotten ones. Beyond utility, our method exhibits strong privacy and security guarantees: it is resistant to membership inference attacks, and prompt removal prevents any residual knowledge extraction, even under adversarial conditions. This ensures compliance with data protection principles and safeguards against unauthorized access to forgotten information, making the framework suitable for deployment in sensitive and regulated environments. Overall, by embedding removability into the architecture itself, this work establishes a new foundation for designing modular, scalable and ethically responsive AI models.

accuracy, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3746027.3758171

2509.1523

Country: Europe (0.48)

Genre: Research Report > New Finding (0.46)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.69)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

TinyServe: Query-Aware Cache Selection for Efficient LLM Serving

Liu, Dong, Yu, Yanxuan

arXiv.org Artificial IntelligenceSep-17-2025

Serving large language models (LLMs) efficiently remains challenging due to the high memory and latency overhead of key-value (KV) cache access during autoregressive decoding. We present \textbf{TinyServe}, a lightweight and extensible serving system for deploying tiny LLMs (e.g., TinyLLaMA, GPT2-345M) with support for structured KV sparsity, plugin-based token selection, and hardware-efficient attention kernels. Unlike prior simulation frameworks, TinyServe executes real-time decoding with configurable sparsity strategies and fine-grained instrumentation. To reduce decoding cost, we introduce a \textit{query-aware page selection} mechanism that leverages bounding-box metadata to estimate attention relevance between the query and KV cache blocks. This enables selective KV loading with minimal overhead and no model modifications. Our fused CUDA kernel integrates page scoring, sparse memory access, and masked attention in a single pass. Experiments show that TinyServe achieves up to \textbf{3.4x} speedup and over \textbf{2x} memory savings with negligible accuracy drop. Additional analysis of cache reuse, page hit rate, and multi-GPU scaling confirms its practicality as an efficient system-level design for LLM training and inference research on resource-constrained hardware.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2509.12211

Country: North America > United States (0.28)

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching

Liu, Jiacheng, Zou, Chang, Lyu, Yuanhuiyi, Ren, Fei, Wang, Shaobo, Li, Kaixin, Zhang, Linfeng

arXiv.org Artificial IntelligenceSep-16-2025

Diffusion models have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. These models face two fundamental challenges: strict temporal dependencies preventing parallelization, and computationally intensive forward passes required at each denoising step. Drawing inspiration from speculative decoding in large language models, we present SpeCa, a novel 'Forecast-then-verify' acceleration framework that effectively addresses both limitations. SpeCa's core innovation lies in introducing Speculative Sampling to diffusion models, predicting intermediate features for subsequent timesteps based on fully computed reference timesteps. Our approach implements a parameter-free verification mechanism that efficiently evaluates prediction reliability, enabling real-time decisions to accept or reject each prediction while incurring negligible computational overhead. Furthermore, SpeCa introduces sample-adaptive computation allocation that dynamically modulates resources based on generation complexity, allocating reduced computation for simpler samples while preserving intensive processing for complex instances. Experiments demonstrate 6.34x acceleration on FLUX with minimal quality degradation (5.5% drop), 7.3x speedup on DiT while preserving generation fidelity, and 79.84% VBench score at 6.1x acceleration for HunyuanVideo. The verification mechanism incurs minimal overhead (1.67%-3.5% of full inference costs), establishing a new paradigm for efficient diffusion model inference while maintaining generation quality even at aggressive acceleration ratios. Our codes have been released in Github: \textbf{https://github.com/Shenyi-Z/Cache4Diffusion}

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3746027.3755331

2509.11628

Country: Asia > China (0.94)

Genre:

Research Report (1.00)
Workflow (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding

Wang, Shuai, Najdenkoska, Ivona, Zhu, Hongyi, Rudinac, Stevan, Kackovic, Monika, Wijnberg, Nachoem, Worring, Marcel

arXiv.org Artificial IntelligenceSep-8-2025

Understanding visual art requires reasoning across multiple perspectives -- cultural, historical, and stylistic -- beyond mere object recognition. While recent multimodal large language models (MLLMs) perform well on general image captioning, they often fail to capture the nuanced interpretations that fine art demands. We propose ArtRAG, a novel, training-free framework that combines structured knowledge with retrieval-augmented generation (RAG) for multi-perspective artwork explanation. ArtRAG automatically constructs an Art Context Knowledge Graph (ACKG) from domain-specific textual sources, organizing entities such as artists, movements, themes, and historical events into a rich, interpretable graph. At inference time, a multi-granular structured retriever selects semantically and topologically relevant subgraphs to guide generation. This enables MLLMs to produce contextually grounded, culturally informed art descriptions. Experiments on the SemArt and Artpedia datasets show that ArtRAG outperforms several heavily trained baselines. Human evaluations further confirm that ArtRAG generates coherent, insightful, and culturally enriched interpretations.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2505.0602

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

EHVC: Efficient Hierarchical Reference and Quality Structure for Neural Video Coding

Liao, Junqi, Wu, Yaojun, Lin, Chaoyi, Deng, Zhipin, Li, Li, Liu, Dong, Sun, Xiaoyan

arXiv.org Artificial IntelligenceSep-5-2025

Neural video codecs (NVCs), leveraging the power of end-to-end learning, have demonstrated remarkable coding efficiency improvements over traditional video codecs. Recent research has begun to pay attention to the quality structures in NVCs, optimizing them by introducing explicit hierarchical designs. However, less attention has been paid to the reference structure design, which fundamentally should be aligned with the hierarchical quality structure. In addition, there is still significant room for further optimization of the hierarchical quality structure. To address these challenges in NVCs, we propose EHVC, an efficient hierarchical neural video codec featuring three key innovations: (1) a hierarchical multi-reference scheme that draws on traditional video codec design to align reference and quality structures, thereby addressing the reference-quality mismatch; (2) a lookahead strategy to utilize an encoder-side context from future frames to enhance the quality structure; (3) a layer-wise quality scale with random quality training strategy to stabilize quality structures during inference. With these improvements, EHVC achieves significantly superior performance to the state-of-the-art NVCs. Code will be released in: https://github.com/bytedance/NEVC.

artificial intelligence, quality structure, reference structure, (16 more...)

arXiv.org Artificial Intelligence

2509.04118

Country: Asia > China (0.48)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence (0.47)

Add feedback

PG-Agent: An Agent Powered by Page Graph

Chen, Weizhi, Wang, Ziwei, Yang, Leyang, Zhou, Sheng, Tang, Xiaoxuan, Bu, Jiajun, Li, Yong, Jiang, Wei

arXiv.org Artificial IntelligenceSep-5-2025

Graphical User Interface (GUI) agents possess significant commercial and social value, and GUI agents powered by advanced multimodal large language models (MLLMs) have demonstrated remarkable potential. Currently, existing GUI agents usually utilize sequential episodes of multi-step operations across pages as the prior GUI knowledge, which fails to capture the complex transition relationship between pages, making it challenging for the agents to deeply perceive the GUI environment and generalize to new scenarios. Therefore, we design an automated pipeline to transform the sequential episodes into page graphs, which explicitly model the graph structure of the pages that are naturally connected by actions. To fully utilize the page graphs, we further introduce Retrieval-Augmented Generation (RAG) technology to effectively retrieve reliable perception guidelines of GUI from them, and a tailored multi-agent framework PG-Agent with task decomposition strategy is proposed to be injected with the guidelines so that it can generalize to unseen scenarios. Extensive experiments on various benchmarks demonstrate the effectiveness of PG-Agent, even with limited episodes for page graph construction.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2509.03536

Country:

North America > United States (0.46)
Asia > China (0.29)
North America > Mexico (0.28)

Genre: Research Report (0.84)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback