AITopics | Pavlovic, Vladimir

Collaborating Authors

Pavlovic, Vladimir

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

EAZY: Eliminating Hallucinations in LVLMs by Zeroing out Hallucinatory Image Tokens

Che, Liwei, Liu, Tony Qingze, Jia, Jing, Qin, Weiyi, Tang, Ruixiang, Pavlovic, Vladimir

arXiv.org Artificial IntelligenceMar-10-2025

Despite their remarkable potential, Large Vision-Language Models (LVLMs) still face challenges with object hallucination, a problem where their generated outputs mistakenly incorporate objects that do not actually exist. Although most works focus on addressing this issue within the language-model backbone, our work shifts the focus to the image input source, investigating how specific image tokens contribute to hallucinations. Our analysis reveals a striking finding: a small subset of image tokens with high attention scores are the primary drivers of object hallucination. By removing these hallucinatory image tokens (only 1.5% of all image tokens), the issue can be effectively mitigated. This finding holds consistently across different models and datasets. Building on this insight, we introduce EAZY, a novel, training-free method that automatically identifies and Eliminates hAllucinations by Zeroing out hallucinatorY image tokens. We utilize EAZY for unsupervised object hallucination detection, achieving 15% improvement compared to previous methods. Additionally, EAZY demonstrates remarkable effectiveness in mitigating hallucinations while preserving model utility and seamlessly adapting to various LVLM architectures.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.07772

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Transportation (1.00)
Leisure & Entertainment > Sports (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

CASIM: Composite Aware Semantic Injection for Text to Motion Generation

Chang, Che-Jui, Liu, Qingze Tony, Zhou, Honglu, Pavlovic, Vladimir, Kapadia, Mubbasir

arXiv.org Artificial IntelligenceFeb-4-2025

Recent advances in generative modeling and tokenization have driven significant progress in text-to-motion generation, leading to enhanced quality and realism in generated motions. However, effectively leveraging textual information for conditional motion generation remains an open challenge. We observe that current approaches, primarily relying on fixed-length text embeddings (e.g., CLIP) for global semantic injection, struggle to capture the composite nature of human motion, resulting in suboptimal motion quality and controllability. To address this limitation, we propose the Composite Aware Semantic Injection Mechanism (CASIM), comprising a composite-aware semantic encoder and a text-motion aligner that learns the dynamic correspondence between text and motion tokens. Notably, CASIM is model and representation-agnostic, readily integrating with both autoregressive and diffusion-based methods. Experiments on HumanML3D and KIT benchmarks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores across state-of-the-art methods. Qualitative analyses further highlight the superiority of our composite-aware approach over fixed-length semantic injection, enabling precise motion control from text prompts and stronger generalization to unseen text inputs.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.02063

Country: North America > United States (0.46)

Genre: Research Report > Promising Solution (0.34)

Industry: Government > Regional Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

Physics-Based Dynamic Models Hybridisation Using Physics-Informed Neural Networks

Lalic, Branislava, Cuong, Dinh Viet, Petric, Mina, Pavlovic, Vladimir, Sremac, Ana Firanj, Roantree, Mark

arXiv.org Artificial IntelligenceDec-24-2024

Physics-based dynamic models (PBDMs) are simplified representations of complex dynamical systems. PBDMs take specific processes within a complex system and assign a fragment of variables and an accompanying set of parameters to depict the processes. As this often leads to suboptimal parameterisation of the system, a key challenge requires refining the empirical parameters and variables to reduce uncertainties while maintaining the model's explainability and enhancing its predictive accuracy. We demonstrate that a hybrid mosquito population dynamics model, which integrates a PBDM with Physics-Informed Neural Networks (PINN), retains the explainability of the PBDM by incorporating the PINN-learned model parameters in place of its empirical counterparts. Specifically, we address the limitations of traditional PBDMs by modelling the parameters of larva and pupa development rates using a PINN that encodes complex, learned interactions of air temperature, precipitation and humidity. Our results demonstrate improved mosquito population simulations including the difficult-to-predict mosquito population peaks. This opens the possibility of hybridisation concept application on other complex systems based on PBDMs such as cancer growth to address the challenges posed by scarce and noisy data, and to numerical weather prediction and climate modelling to overcome the gap between physics-based and data-driven weather prediction models. Keywords: hybridisation, physics-based dynamic models, physics-informed neural networks (PINN), hybrid dynamic model, mosquito population modelling 1 Introduction Physics-based dynamic models (PBDMs) are widely used in research and technology, from predicting air temperature to modelling COVID-19 spread and cancer cell development.

artificial intelligence, machine learning, neural network, (18 more...)

arXiv.org Artificial Intelligence

2412.07514

Country:

North America > United States (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Model-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

TrajDiffuse: A Conditional Diffusion Model for Environment-Aware Trajectory Prediction

Qingze, null, Liu, null, Li, Danrui, Sohn, Samuel S., Yoon, Sejong, Kapadia, Mubbasir, Pavlovic, Vladimir

arXiv.org Artificial IntelligenceOct-14-2024

Accurate prediction of human or vehicle trajectories with good diversity that captures their stochastic nature is an essential task for many applications. However, many trajectory prediction models produce unreasonable trajectory samples that focus on improving diversity or accuracy while neglecting other key requirements, such as collision avoidance with the surrounding environment. In this work, we propose TrajDiffuse, a planning-based trajectory prediction method using a novel guided conditional diffusion model. We form the trajectory prediction problem as a denoising impaint task and design a map-based guidance term for the diffusion process. TrajDiffuse is able to generate trajectory predictions that match or exceed the accuracy and diversity of the SOTA, while adhering almost perfectly to environmental constraints. We demonstrate the utility of our model through experiments on the nuScenes and PFSD datasets and provide an extensive benchmark analysis against the SOTA methods.

artificial intelligence, machine learning, prediction, (17 more...)

arXiv.org Artificial Intelligence

2410.10804

Country:

North America > United States (0.46)
Europe (0.28)

Genre: Research Report (0.64)

Industry: Transportation (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.95)

Add feedback

CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation

Basioti, Kalliopi, Abdelsalam, Mohamed A., Fancellu, Federico, Pavlovic, Vladimir, Fazly, Afsaneh

arXiv.org Artificial IntelligenceJul-17-2024

Controllable Image Captioning (CIC) aims at generating natural language descriptions for an image, conditioned on information provided by end users, e.g., regions, entities or events of interest. However, available image-language datasets mainly contain captions that describe the entirety of an image, making them ineffective for training CIC models that can potentially attend to any subset of regions or relationships. To tackle this challenge, we propose a novel, fully automatic method to sample additional focused and visually grounded captions using a unified structured semantic representation built on top of the existing set of captions associated with an image. We leverage Abstract Meaning Representation (AMR), a cross-lingual graph-based semantic formalism, to encode all possible spatio-semantic relations between entities, beyond the typical spatial-relations-only focus of current methods. We use this Structured Semantic Augmentation (SSA) framework to augment existing image-caption datasets with the grounded controlled captions, increasing their spatial and semantic diversity and focal coverage. We then develop a new model, CIC-BART-SSA, specifically tailored for the CIC task, that sources its control signals from SSA-diversified datasets. We empirically show that, compared to SOTA CIC models, CIC-BART-SSA generates captions that are superior in diversity and text quality, are competitive in controllability, and, importantly, minimize the gap between broad and highly focused controlled captioning performance by efficiently generalizing to the challenging highly focused scenarios. Code is available at https://github.com/SamsungLabs/CIC-BART-SSA.

artificial intelligence, natural language, text processing, (3 more...)

arXiv.org Artificial Intelligence

2407.11393

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Vision (0.60)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.53)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.53)

Add feedback

Learning from Synthetic Human Group Activities

Chang, Che-Jui, Li, Danrui, Patel, Deep, Goel, Parth, Zhou, Honglu, Moon, Seonghyeon, Sohn, Samuel S., Yoon, Sejong, Pavlovic, Vladimir, Kapadia, Mubbasir

arXiv.org Artificial IntelligenceNov-27-2023

The study of complex human interactions and group activities has become a focal point in human-centric computer vision. However, progress in related tasks is often hindered by the challenges of obtaining large-scale labeled datasets from real-world scenarios. To address the limitation, we introduce M3Act, a synthetic data generator for multi-view multi-group multi-person human atomic actions and group activities. Powered by the Unity engine, M3Act features multiple semantic groups, highly diverse and photorealistic images, and a comprehensive set of annotations, which facilitates the learning of human-centered tasks across single-person, multi-person, and multi-group conditions. We demonstrate the advantages of M3Act across three core experiments using various input modalities. First, adding our synthetic data significantly improves the performance of MOTRv2 on DanceTrack, leading to a hop on the leaderboard from 10th to 2nd place. With M3Act, we achieve tracking results on par with MOTRv2*, which is trained with 62.5% more real-world data. Second, M3Act improves the benchmark performances on CAD2 by 5.59% and 7.43% on group activity and atomic action accuracy respectively. Moreover, M3Act opens new research for controllable 3D group activity generation. We define multiple metrics and propose a competitive baseline for the novel task.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2306.16772

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Add feedback

NP-SemiSeg: When Neural Processes meet Semi-Supervised Semantic Segmentation

Wang, Jianfeng, Massiceti, Daniela, Hu, Xiaolin, Pavlovic, Vladimir, Lukasiewicz, Thomas

arXiv.org Artificial IntelligenceAug-5-2023

Semi-supervised semantic segmentation involves assigning pixel-wise labels to unlabeled images at training time. This is useful in a wide range of real-world applications where collecting pixel-wise labels is not feasible in time or cost. Current approaches to semi-supervised semantic segmentation work by predicting pseudo-labels for each pixel from a class-wise probability distribution output by a model. If the predicted probability distribution is incorrect, however, this leads to poor segmentation results, which can have knock-on consequences in safety critical systems, like medical images or self-driving cars. It is, therefore, important to understand what a model does not know, which is mainly achieved by uncertainty quantification. Recently, neural processes (NPs) have been explored in semi-supervised image classification, and they have been a computationally efficient and effective method for uncertainty quantification. In this work, we move one step forward by adapting NPs to semi-supervised semantic segmentation, resulting in a new model called NP-SemiSeg. We experimentally evaluated NP-SemiSeg on the public benchmarks PASCAL VOC 2012 and Cityscapes, with different training settings, and the results verify its effectiveness.

artificial intelligence, machine learning, np-semiseg, (15 more...)

arXiv.org Artificial Intelligence

2308.02866

Country:

North America > United States (0.46)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (0.49)
Transportation > Ground > Road (0.34)
Information Technology > Robotics & Automation (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

An Information-Theoretic Approach for Estimating Scenario Generalization in Crowd Motion Prediction

Qiao, Gang, Hu, Kaidong, Moon, Seonghyeon, Sohn, Samuel S., Yoon, Sejong, Kapadia, Mubbasir, Pavlovic, Vladimir

arXiv.org Artificial IntelligenceNov-1-2022

Learning-based approaches to modeling crowd motion have become increasingly successful but require training and evaluation on large datasets, coupled with complex model selection and parameter tuning. To circumvent this tremendously time-consuming process, we propose a novel scoring method, which characterizes generalization of models trained on source crowd scenarios and applied to target crowd scenarios using a training-free, model-agnostic Interaction + Diversity Quantification score, ISDQ. The Interaction component aims to characterize the difficulty of scenario domains, while the diversity of a scenario domain is captured in the Diversity score. Both scores can be computed in a computation tractable manner. Our experimental results validate the efficacy of the proposed method on several simulated and real-world (source,target) generalization tasks, demonstrating its potential to select optimal domain pairs before training and testing a model.

agent, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2211.00817

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Recursive Inference for Variational Autoencoders

Kim, Minyoung, Pavlovic, Vladimir

arXiv.org Machine LearningNov-17-2020

Inference networks of traditional Variational Autoencoders (VAEs) are typically amortized, resulting in relatively inaccurate posterior approximation compared to instance-wise variational optimization. Recent semi-amortized approaches were proposed to address this drawback; however, their iterative gradient update procedures can be computationally demanding. To address these issues, in this paper we introduce an accurate amortized inference algorithm. We propose a novel recursive mixture estimation algorithm for VAEs that iteratively augments the current mixture with new components so as to maximally reduce the divergence between the variational and the true posteriors. Using the functional gradient approach, we devise an intuitive learning criteria for selecting a new mixture component: the new component has to improve the data likelihood (lower bound) and, at the same time, be as divergent from the current mixture distribution as possible, thus increasing representational diversity. Compared to recently proposed boosted variational inference (BVI), our method relies on amortized inference in contrast to BVI's non-amortized single optimization instance. A crucial benefit of our approach is that the inference at test time requires a single feed-forward pass through the mixture inference network, making it significantly faster than the semi-amortized approaches. We show that our approach yields higher test data likelihood than the state-of-the-art on several benchmark datasets.

deep learning, inference, neural network, (21 more...)

arXiv.org Machine Learning

2011.08544

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Ordinal-Content VAE: Isolating Ordinal-Valued Content Factors in Deep Latent Variable Models

Kim, Minyoung, Pavlovic, Vladimir

arXiv.org Artificial IntelligenceSep-7-2020

In deep representational learning, it is often desired to isolate a particular factor (termed {\em content}) from other factors (referred to as {\em style}). What constitutes the content is typically specified by users through explicit labels in the data, while all unlabeled/unknown factors are regarded as style. Recently, it has been shown that such content-labeled data can be effectively exploited by modifying the deep latent factor models (e.g., VAE) such that the style and content are well separated in the latent representations. However, the approach assumes that the content factor is categorical-valued (e.g., subject ID in face image data, or digit class in the MNIST dataset). In certain situations, the content is ordinal-valued, that is, the values the content factor takes are {\em ordered} rather than categorical, making content-labeled VAEs, including the latent space they infer, suboptimal. In this paper, we propose a novel extension of VAE that imposes a partially ordered set (poset) structure in the content latent space, while simultaneously making it aligned with the ordinal content values. To this end, instead of the iid Gaussian latent prior adopted in prior approaches, we introduce a conditional Gaussian spacing prior model. This model admits a tractable joint Gaussian prior, but also effectively places negligible density values on the content latent configurations that violate the poset constraint. To evaluate this model, we consider two specific ordinal structured problems: estimating a subject's age in a face image and elucidating the calorie amount in a food meal image. We demonstrate significant improvements in content-style separation over previous non-ordinal approaches.

artificial intelligence, dataset, neural network, (17 more...)

arXiv.org Artificial Intelligence

2009.03034

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Vision > Face Recognition (0.54)

Add feedback