AITopics | Belongie, Serge

Plotting

Belongie, Serge

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Multi-Modal Framing Analysis of News

Arora, Arnav, Yadav, Srishti, Antoniak, Maria, Belongie, Serge, Augenstein, Isabelle

arXiv.org Artificial IntelligenceApr-3-2025

Automated frame analysis of political communication is a popular task in computational social science that is used to study how authors select aspects of a topic to frame its reception. So far, such studies have been narrow, in that they use a fixed set of pre-defined frames and focus only on the text, ignoring the visual contexts in which those texts appear. Especially for framing in the news, this leaves out valuable information about editorial choices, which include not just the written article but also accompanying photographs. To overcome such limitations, we present a method for conducting multi-modal, multi-label framing analysis at scale using large (vision-)language models. Grounding our work in framing theory, we extract latent meaning embedded in images used to convey a certain point and contrast that to the text by comparing the respective frames used. We also identify highly partisan framing of topics with issue-specific frame analysis found in prior qualitative work. We demonstrate a method for doing scalable integrative framing analysis of both text and image in news, providing a more complete picture for understanding media bias.

computational linguistic, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.2096

Country:

Europe (1.00)
Asia > Middle East (0.93)
North America > United States > California (0.28)
(2 more...)

Genre: Research Report (0.82)

Industry:

Media > News (1.00)
Leisure & Entertainment (1.00)
Law > Criminal Law (1.00)
(7 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.87)

Add feedback

Gradient Imbalance in Direct Preference Optimization

Ma, Qinwei, Shi, Jingzhe, Jin, Can, Hwang, Jenq-Neng, Belongie, Serge, Li, Lei

arXiv.org Artificial IntelligenceFeb-28-2025

Direct Preference Optimization (DPO) has been proposed as a promising alternative to Proximal Policy Optimization (PPO) based Reinforcement Learning with Human Feedback (RLHF). However, empirical evaluations consistently reveal suboptimal performance in DPO compared to common RLHF pipelines. In this work, we conduct a systematic analysis of DPO's training dynamics and identify gradient imbalance as a critical limitation. We demonstrate theoretically and empirically that this imbalance perturbs optimization trajectories, destabilizes learning, and induces suboptimal convergence. To address this issue, we propose Balanced-DPO, a simple yet effective modification to the DPO objective that introduces a computationally efficient gradient reweighting mechanism. Our experiments demonstrate the effectiveness of Balanced-DPO, validating the theoretical findings and confirming that addressing gradient imbalance is key to improving DPO's performance, highlighting a promising direction for future research.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2502.20847

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

ChatMotion: A Multimodal Multi-Agent for Human Motion Analysis

Li, Lei, Jia, Sen, Wang, Jianhao, An, Zhaochong, Li, Jiaang, Hwang, Jenq-Neng, Belongie, Serge

arXiv.org Artificial IntelligenceFeb-27-2025

Advancements in Multimodal Large Language Models (MLLMs) have improved human motion understanding. However, these models remain constrained by their "instruct-only" nature, lacking interactivity and adaptability for diverse analytical perspectives. To address these challenges, we introduce ChatMotion, a multimodal multi-agent framework for human motion analysis. ChatMotion dynamically interprets user intent, decomposes complex tasks into meta-tasks, and activates specialized function modules for motion comprehension. It integrates multiple specialized modules, such as the MotionCore, to analyze human motion from various perspectives. Extensive experiments demonstrate ChatMotion's precision, adaptability, and user engagement for human motion understanding.

chatmotion, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2502.1818

Genre:

Research Report (0.64)
Workflow (0.46)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Bayesian Optimization for Controlled Image Editing via LLMs

Cai, Chengkun, Liu, Haoliang, Zhao, Xu, Jiang, Zhongyu, Zhang, Tianfang, Wu, Zongkai, Hwang, Jenq-Neng, Belongie, Serge, Li, Lei

arXiv.org Artificial IntelligenceFeb-26-2025

In the rapidly evolving field of image generation, achieving precise control over generated content and maintaining semantic consistency remain significant limitations, particularly concerning grounding techniques and the necessity for model fine-tuning. To address these challenges, we propose BayesGenie, an off-the-shelf approach that integrates Large Language Models (LLMs) with Bayesian Optimization to facilitate precise and user-friendly image editing. Our method enables users to modify images through natural language descriptions without manual area marking, while preserving the original image's semantic integrity. Unlike existing techniques that require extensive pre-training or fine-tuning, our approach demonstrates remarkable adaptability across various LLMs through its model-agnostic design. BayesGenie employs an adapted Bayesian optimization strategy to automatically refine the inference process parameters, achieving high-precision image editing with minimal user intervention. Through extensive experiments across diverse scenarios, we demonstrate that our framework significantly outperforms existing methods in both editing accuracy and semantic preservation, as validated using different LLMs including Claude3 and GPT-4.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2502.18116

Genre: Research Report > Experimental Study (0.46)

Industry: Media > Photography (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Explaining Context Length Scaling and Bounds for Language Models

Shi, Jingzhe, Ma, Qinwei, Liu, Hongyi, Zhao, Hang, Hwang, Jeng-Neng, Belongie, Serge, Li, Lei

arXiv.org Artificial IntelligenceFeb-9-2025

A wide variety of work is proposed to discuss the impact of context length: some shows long irrelevant context Long Context Language Models have drawn would worsen performance for LMs(Xu et al., 2024; great attention in the past few years. There has Levy et al., 2024); some shows long context would improve been work discussing the impact of long context performance in a way summarized as Scaling Laws(Xiong on Language Model performance: some find that et al., 2024); while work in other domains like time series long irrelevant context could harm performance, shows long relevant context would hurt performance while some experimentally summarize loss reduction (Shi et al., 2024). This calls for a more thorough understanding by relevant long context as Scaling Laws. of how context length affects Language Models' This calls for a more thorough understanding on performance.. how long context impact Language Modeling. In this work, we (1) propose a clean and effective Previously, theories have been proposed to explain the Scaling theoretical framework on explaining the impact Laws with respect to the data set and the size of the of context length to Language Modeling, from an model(Bahri et al., 2024; Sharma & Kaplan, 2020). However, Intrinsic Space perspective; and (2) conduct experiments these theories do not study how context length impact on natural language and synthetic data, Language Modeling, thus they cannot contribute directly to validating our proposed theoretical assumptions the problem.

context length, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2502.01481

Country: North America > Mexico > Mexico City (0.14)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Learning to Learn Weight Generation via Trajectory Diffusion

Guan, Yunchuan, Liu, Yu, Zhou, Ke, Shen, Zhiqi, Belongie, Serge, Hwang, Jenq-Neng, Li, Lei

arXiv.org Artificial IntelligenceFeb-3-2025

Diffusion-based algorithms have emerged as promising techniques for weight generation, particularly in scenarios like multi-task learning that require frequent weight updates. However, existing solutions suffer from limited cross-task transferability. In addition, they only utilize optimal weights as training samples, ignoring the value of other weights in the optimization process. To address these issues, we propose Lt-Di, which integrates the diffusion algorithm with meta-learning to generate weights for unseen tasks. Furthermore, we extend the vanilla diffusion algorithm into a trajectory diffusion algorithm to utilize other weights along the optimization trajectory. Trajectory diffusion decomposes the entire diffusion chain into multiple shorter ones, improving training and inference efficiency. We analyze the convergence properties of the weight generation paradigm and improve convergence efficiency without additional time overhead. Our experiments demonstrate Lt-Di's higher accuracy while reducing computational overhead across various tasks, including zero-shot and few-shot learning, multi-domain generalization, and large-scale language model fine-tuning.Our code is released at https://github.com/tuantuange/Lt-Di.

artificial intelligence, conference, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2502.01117

Country:

North America > United States > Hawaii (0.14)
North America > United States > Massachusetts (0.14)
North America > Canada > Ontario > Toronto (0.14)
(2 more...)

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes

Deng, Shiling, Belongie, Serge, Christensen, Peter Ebert

arXiv.org Artificial IntelligenceJan-23-2025

Memes have emerged as a powerful form of communication, integrating visual and textual elements to convey humor, satire, and cultural messages. Existing research has focused primarily on aspects such as emotion classification, meme generation, propagation, interpretation, figurative language, and sociolinguistics, but has often overlooked deeper meme comprehension and meme-text retrieval. To address these gaps, this study introduces ClassicMemes-50-templates (CM50), a large-scale dataset consisting of over 33,000 memes, centered around 50 popular meme templates. We also present an automated knowledge-grounded annotation pipeline leveraging large vision-language models to produce high-quality image captions, meme captions, and literary device labels overcoming the labor intensive demands of manual annotation. Additionally, we propose a meme-text retrieval CLIP model (mtrCLIP) that utilizes cross-modal embedding to enhance meme analysis, significantly improving retrieval performance. Our contributions include:(1) a novel dataset for large-scale meme study, (2) a scalable meme annotation framework, and (3) a fine-tuned CLIP for meme-text retrieval, all aimed at advancing the understanding and analysis of memes at scale.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2501.13851

Country:

Asia (0.67)
Europe (0.46)
North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Media (0.68)
Education (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Unlearning-based Neural Interpretations

Choi, Ching Lam, Duplessis, Alexandre, Belongie, Serge

arXiv.org Artificial IntelligenceOct-10-2024

Gradient-based interpretations often require an anchor point of comparison to avoid saturation in computing feature importance. We show that current baselines defined using static functions--constant mapping, averaging or blurring--inject harmful colour, texture or frequency assumptions that deviate from model behaviour. This leads to accumulation of irregular gradients, resulting in attribution maps that are biased, fragile and manipulable. Departing from the static approach, we propose UNI to compute an (un)learnable, debiased and adaptive baseline by perturbing the input towards an unlearning direction of steepest ascent. Our method discovers reliable baselines and succeeds in erasing salient features, which in turn locally smooths the high-curvature decision boundaries. Our analyses point to unlearning as a promising avenue for generating faithful, efficient and robust interpretations.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.08069

Country:

North America > United States > Massachusetts (0.28)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.64)

Industry:

Government (0.68)
Information Technology (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Do better language models have crisper vision?

Ruthardt, Jona, Burghouts, Gertjan J., Belongie, Serge, Asano, Yuki M.

arXiv.org Artificial IntelligenceOct-9-2024

How well do text-only Large Language Models (LLMs) grasp the visual world? As LLMs are increasingly used in computer vision, addressing this question becomes both fundamental and pertinent. However, existing studies have primarily focused on limited scenarios, such as their ability to generate visual content or cluster multimodal data. To this end, we propose the Visual Text Representation Benchmark (ViTeRB) to isolate key properties that make language models well-aligned with the visual world. With this, we identify large-scale decoder-based LLMs as ideal candidates for representing text in vision-centric contexts, counter to the current practice of utilizing text encoders. Building on these findings, we propose ShareLock, an ultra-lightweight CLIP-like model. By leveraging precomputable frozen features from strong vision and language models, ShareLock achieves an impressive 51% accuracy on ImageNet despite utilizing just 563k image-caption pairs. Moreover, training requires only 1 GPU hour (or 10 hours including the precomputation of features) - orders of magnitude less than prior methods. Code will be released.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.07173

Country:

Europe (0.46)
North America > United States (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Transportation (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Coarse-To-Fine Tensor Trains for Compact Visual Representations

Loeschcke, Sebastian, Wang, Dan, Leth-Espensen, Christian, Belongie, Serge, Kastoryano, Michael J., Benaim, Sagie

arXiv.org Artificial IntelligenceJun-6-2024

The ability to learn compact, high-quality, and easy-to-optimize representations for visual data is paramount to many applications such as novel view synthesis and 3D reconstruction. Recent work has shown substantial success in using tensor networks to design such compact and high-quality representations. However, the ability to optimize tensor-based representations, and in particular, the highly compact tensor train representation, is still lacking. This has prevented practitioners from deploying the full potential of tensor networks for visual data. To this end, we propose 'Prolongation Upsampling Tensor Train (PuTT)', a novel method for learning tensor train representations in a coarse-to-fine manner. Our method involves the prolonging or `upsampling' of a learned tensor train representation, creating a sequence of 'coarse-to-fine' tensor trains that are incrementally refined. We evaluate our representation along three axes: (1). compression, (2). denoising capability, and (3). image completion capability. To assess these axes, we consider the tasks of image fitting, 3D fitting, and novel view synthesis, where our method shows an improved performance compared to state-of-the-art tensor-based methods. For full results see our project webpage: https://sebulo.github.io/PuTT_website/

artificial intelligence, machine learning, representation, (15 more...)

arXiv.org Artificial Intelligence

2406.04332

Country:

Europe > Austria (0.28)
Asia > Japan > Honshū > Chūbu (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(3 more...)

Add feedback