AITopics | Hauptmann, Alexander

Collaborating Authors

Hauptmann, Alexander

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios

Wang, Kai, Li, Zekai, Cheng, Zhi-Qi, Khaki, Samir, Sajedi, Ahmad, Vedantam, Ramakrishna, Plataniotis, Konstantinos N, Hauptmann, Alexander, You, Yang

arXiv.org Artificial IntelligenceOct-22-2024

Dataset distillation has demonstrated strong performance on simple datasets like CIFAR, MNIST, and TinyImageNet but struggles to achieve similar results in more complex scenarios. In this paper, we propose EDF (emphasizes the discriminative features), a dataset distillation method that enhances key discriminative regions in synthetic images using Grad-CAM activation maps. Our approach is inspired by a key observation: in simple datasets, high-activation areas typically occupy most of the image, whereas in complex scenarios, the size of these areas is much smaller. Unlike previous methods that treat all pixels equally when synthesizing images, EDF uses Grad-CAM activation maps to enhance high-activation areas. From a supervision perspective, we downplay supervision signals that have lower losses, as they contain common patterns. Additionally, to help the DD community better explore complex scenarios, we build the Complex Dataset Distillation (Comp-DD) benchmark by meticulously selecting sixteen subsets, eight easy and eight hard, from ImageNet-1K. In particular, EDF consistently outperforms SOTA results in complex scenarios, such as ImageNet-1K subsets. Hopefully, more researchers will be inspired and encouraged to improve the practicality and efficacy of DD. Our code and benchmark will be made public at https://github.com/NUS-HPC-AI-Lab/EDF.

artificial intelligence, distillation, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2410.17193

Genre: Research Report (0.40)

Industry: Energy > Power Industry > Utilities (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Multimodal Reranking for Knowledge-Intensive Visual Question Answering

Wen, Haoyang, Zhuang, Honglei, Zamani, Hamed, Hauptmann, Alexander, Bendersky, Michael

arXiv.org Artificial IntelligenceJul-16-2024

Knowledge-intensive visual question answering requires models to effectively use external knowledge to help answer visual questions. A typical pipeline includes a knowledge retriever and an answer generator. However, a retriever that utilizes local information, such as an image patch, may not provide reliable question-candidate relevance scores. Besides, the two-tower architecture also limits the relevance score modeling of a retriever to select top candidates for answer generator reasoning. In this paper, we introduce an additional module, a multi-modal reranker, to improve the ranking quality of knowledge candidates for answer generation. Our reranking module takes multi-modal information from both candidates and questions and performs cross-item interaction for better relevance score modeling. Experiments on OK-VQA and A-OKVQA show that multi-modal reranker from distant supervision provides consistent improvements. We also find a training-testing discrepancy with reranking in answer generation, where performance improves if training knowledge candidates are similar to or noisier than those used in testing.

artificial intelligence, natural language, question answering, (17 more...)

arXiv.org Artificial Intelligence

2407.12277

Country:

Europe (0.68)
Asia > Middle East > Israel (0.14)
North America > United States > Louisiana (0.14)
(3 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)

Add feedback

Learning Visual-Semantic Subspace Representations for Propositional Reasoning

Moreira, Gabriel, Hauptmann, Alexander, Marques, Manuel, Costeira, João Paulo

arXiv.org Artificial IntelligenceMay-25-2024

Learning representations that capture rich semantic relationships and accommodate propositional calculus poses a significant challenge. Existing approaches are either contrastive, lacking theoretical guarantees, or fall short in effectively representing the partial orders inherent to rich visual-semantic hierarchies. In this paper, we propose a novel approach for learning visual representations that not only conform to a specified semantic structure but also facilitate probabilistic propositional reasoning. Our approach is based on a new nuclear norm-based loss. We show that its minimum encodes the spectral geometry of the semantics in a subspace lattice, where logical propositions can be represented by projection operators.

logic & formal reasoning, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2405.16213

Country:

North America > Canada (0.14)
Europe > Netherlands (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.34)

Add feedback

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Zhang, Ruohong, Gui, Liangke, Sun, Zhiqing, Feng, Yihao, Xu, Keyang, Zhang, Yuanhan, Fu, Di, Li, Chunyuan, Hauptmann, Alexander, Bisk, Yonatan, Yang, Yiming

arXiv.org Artificial IntelligenceApr-2-2024

Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM). However, in tasks involving video instruction-following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge. Previous studies have explored using large large multimodal models (LMMs) as reward models to guide preference modeling, but their ability to accurately assess the factuality of generated responses compared to corresponding videos has not been conclusively established. This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content, enabling language models to incorporate this information as supporting evidence for scoring video Question Answering (QA) predictions. Our approach demonstrates robust alignment with OpenAI GPT-4V model's reward mechanism, which directly takes video frames as input. Furthermore, we show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video QA tasks.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2404.01258

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

VICAN: Very Efficient Calibration Algorithm for Large Camera Networks

Moreira, Gabriel, Marques, Manuel, Costeira, João Paulo, Hauptmann, Alexander

arXiv.org Artificial IntelligenceMar-25-2024

The precise estimation of camera poses within large camera networks is a foundational problem in computer vision and robotics, with broad applications spanning autonomous navigation, surveillance, and augmented reality. In this paper, we introduce a novel methodology that extends state-of-the-art Pose Graph Optimization (PGO) techniques. Departing from the conventional PGO paradigm, which primarily relies on camera-camera edges, our approach centers on the introduction of a dynamic element - any rigid object free to move in the scene - whose pose can be reliably inferred from a single image. Specifically, we consider the bipartite graph encompassing cameras, object poses evolving dynamically, and camera-object relative transformations at each time step. This shift not only offers a solution to the challenges encountered in directly estimating relative poses between cameras, particularly in adverse environments, but also leverages the inclusion of numerous object poses to ameliorate and integrate errors, resulting in accurate camera pose estimates. Though our framework retains compatibility with traditional PGO solvers, its efficacy benefits from a custom-tailored optimization scheme. To this end, we introduce an iterative primal-dual algorithm, capable of handling large graphs. Empirical benchmarks, conducted on a new dataset of simulated indoor environments, substantiate the efficacy and efficiency of our approach.

artificial intelligence, graph, image understanding, (16 more...)

arXiv.org Artificial Intelligence

2405.10952

Country:

North America > United States (0.14)
Europe > Portugal (0.14)

Genre: Research Report (0.50)

Industry: Commercial Services & Supplies > Security & Alarm Services (0.61)

Technology: Information Technology > Artificial Intelligence > Vision > Image Understanding (0.61)

Add feedback

Hyperbolic vs Euclidean Embeddings in Few-Shot Learning: Two Sides of the Same Coin

Moreira, Gabriel, Marques, Manuel, Costeira, João Paulo, Hauptmann, Alexander

arXiv.org Artificial IntelligenceSep-18-2023

Recent research in representation learning has shown that hierarchical data lends itself to low-dimensional and highly informative representations in hyperbolic space. However, even if hyperbolic embeddings have gathered attention in image recognition, their optimization is prone to numerical hurdles. Further, it remains unclear which applications stand to benefit the most from the implicit bias imposed by hyperbolicity, when compared to traditional Euclidean features. In this paper, we focus on prototypical hyperbolic neural networks. In particular, the tendency of hyperbolic embeddings to converge to the boundary of the Poincar\'e ball in high dimensions and the effect this has on few-shot classification. We show that the best few-shot results are attained for hyperbolic embeddings at a common hyperbolic radius. In contrast to prior benchmark results, we demonstrate that better performance can be achieved by a fixed-radius encoder equipped with the Euclidean metric, regardless of the embedding dimension.

artificial intelligence, hyperbolic space, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2309.10013

Country: Europe > Portugal (0.14)

Genre: Research Report > New Finding (0.99)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)

Add feedback

Video Captioning with Guidance of Multimodal Latent Topics

Chen, Shizhe, Chen, Jia, Jin, Qin, Hauptmann, Alexander

arXiv.org Artificial IntelligenceFeb-14-2023

The topic diversity of open-domain videos leads to various vocabularies and linguistic expressions in describing video contents, and therefore, makes the video captioning task even more challenging. In this paper, we propose an unified caption framework, M&M TGM, which mines multimodal topics in unsupervised fashion from data and guides the caption decoder with these topics. Compared to pre-defined topics, the mined multimodal topics are more semantically and visually coherent and can reflect the topic distribution of videos better. We formulate the topic-aware caption generation as a multi-task learning problem, in which we add a parallel task, topic prediction, in addition to the caption task. For the topic prediction task, we use the mined topics as the teacher to train a student topic prediction model, which learns to predict the latent topics from multimodal contents of videos. The topic prediction provides intermediate supervision to the learning process. As for the caption task, we propose a novel topic-aware decoder to generate more accurate and detailed video descriptions with the guidance from latent topics. The entire learning procedure is end-to-end and it optimizes both tasks simultaneously. The results from extensive experiments conducted on the MSR-VTT and Youtube2Text datasets demonstrate the effectiveness of our proposed model. M&M TGM not only outperforms prior state-of-the-art methods on multiple evaluation metrics and on both benchmark datasets, but also achieves better generalization ability.

artificial intelligence, machine learning, video, (19 more...)

arXiv.org Artificial Intelligence

1708.09667

Genre: Research Report (1.00)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Event-Related Bias Removal for Real-time Disaster Events

Spiliopoulou, Evangelia, Maza, Salvador Medina, Hovy, Eduard, Hauptmann, Alexander

arXiv.org Artificial IntelligenceNov-1-2020

Social media has become an important tool to share information about crisis events such as natural disasters and mass attacks. Detecting actionable posts that contain useful information requires rapid analysis of huge volume of data in real-time. This poses a complex problem due to the large amount of posts that do not contain any actionable information. Furthermore, the classification of information in real-time systems requires training on out-of-domain data, as we do not have any data from a new emerging crisis. Prior work focuses on models pre-trained on similar event types. However, those models capture unnecessary event-specific biases, like the location of the event, which affect the generalizability and performance of the classifiers on new unseen data from an emerging new event. In our work, we train an adversarial neural model to remove latent event-specific biases and improve the performance on tweet importance classification.

deep learning, information, neural network, (22 more...)

arXiv.org Artificial Intelligence

2011.00681

Country:

Europe (1.00)
North America > United States > California (0.28)
Asia > Philippines > Luzon > National Capital Region (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Government > Regional Government > North America Government > United States Government (0.68)
Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Architecture > Real Time Systems (1.00)

Add feedback

Perceiving Physical Equation by Observing Visual Scenarios

Huang, Siyu, Cheng, Zhi-Qi, Li, Xi, Wu, Xiao, Zhang, Zhongfei, Hauptmann, Alexander

arXiv.org Artificial IntelligenceNov-29-2018

Inferring universal laws of the environment is an important ability of human intelligence as well as a symbol of general AI. In this paper, we take a step toward this goal such that we introduce a new challenging problem of inferring invariant physical equation from visual scenarios. For instance, teaching a machine to automatically derive the gravitational acceleration formula by watching a free-falling object. To tackle this challenge, we present a novel pipeline comprised of an Observer Engine and a Physicist Engine by respectively imitating the actions of an observer and a physicist in the real world. Generally, the Observer Engine watches the visual scenarios and then extracting the physical properties of objects. The Physicist Engine analyses these data and then summarizing the inherent laws of object dynamics. Specifically, the learned laws are expressed by mathematical equations such that they are more interpretable than the results given by common probabilistic models. Experiments on synthetic videos have shown that our pipeline is able to discover physical equations on various physical worlds with different visual appearances.

artificial intelligence, equation, neural network, (19 more...)

arXiv.org Artificial Intelligence

1811.12238

Country:

North America > United States (0.14)
North America > Canada (0.14)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.47)

Add feedback

An Event Reconstruction Tool for Conflict Monitoring Using Social Media

AAAI ConferencesFeb-14-2017

What happened during the Boston Marathon in 2013? Nowadays, at any major event, lots of people take videos and share them on social media. To fully understand exactly what happened in these major events, researchers and analysts often have to examine thousands of these videos manually. To reduce this manual effort, we present an investigative system that automatically synchronizes these videos to a global timeline and localizes them on a map. In addition to alignment in time and space, our system combines various functions for analysis, including gunshot detection, crowd size estimation, 3D reconstruction and person tracking. To our best knowledge, this is the first time a unified framework has been built for comprehensive event reconstruction for social media videos.

artificial intelligence, social media, video, (14 more...)

AAAI Conferences

Thirty-First AAAI Conference on Artificial Intelligence

Industry: Leisure & Entertainment (0.36)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Communications > Social Media (0.92)

Add feedback