AITopics | gpt-4v

Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials

Neural Information Processing SystemsMar-22-2026, 03:31:50 GMT

Physically realistic materials are pivotal in augmenting the realism of 3D assets across various applications and lighting conditions. However, existing 3D assets and generative models often lack authentic material properties. Manual assignment of materials using graphic software is a tedious and time-consuming task. In this paper, we exploit advancements in Multimodal Large Language Models (MLLMs), particularly GPT-4V, to present a novel approach, Make-it-Real: 1) We demonstrate that GPT-4V can effectively recognize and describe materials, allowing the construction of a detailed material library.

artificial intelligence, natural language, proceedings, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.97)

Add feedback

MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations

Neural Information Processing SystemsMar-22-2026, 01:38:22 GMT

Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLONGBENCH-DOC, a long-context, multi-modal benchmark comprising 1,082 expert-annotated questions. Distinct from previous datasets, it is constructed upon 135 lengthy PDF-formatted documents with an average of 47.5 pages and 21,214 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e., page number). Moreover, 33.7\% of the questions are cross-page questions requiring evidence across multiple pages.

artificial intelligence, natural language, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.84)

Add feedback

DeTik Zify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ Jonas Belouadi

Neural Information Processing SystemsFeb-16-2026, 22:08:58 GMT

Creating high-quality scientific figures can be time-consuming and challenging, even though sketching ideas on paper is relatively easy. Furthermore, recreating existing figures that are not stored in formats preserving semantic information is equally complex.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > Dominican Republic (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > Canada (0.04)
(15 more...)

Genre: Research Report > Experimental Study (0.93)

Industry:

Information Technology (0.67)
Leisure & Entertainment > Games (0.46)
Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

9a8d52eb05eb7b13f54b3d9eada667b7-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 10:59:38 GMT

computational linguistic, proceedings, sketch, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > Dominican Republic (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(17 more...)

Genre: Research Report > Experimental Study (0.93)

Industry:

Information Technology (0.67)
Leisure & Entertainment > Games (0.45)
Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

52764eb83bf0a0bd32766ce5c01612e5-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 02:36:59 GMT

gpt-4v, image 1, symbolize, (11 more...)

Neural Information Processing Systems

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
North America > United States > New York > Erie County > Amherst (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(6 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Communications (1.00)
(7 more...)

Add feedback

CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model

Li, Jiangtong, Zhu, Yiyun, Cheng, Dawei, Ding, Zhijun, Jiang, Changjun

arXiv.org Artificial IntelligenceJun-17-2025

Multimodal Large Language Models (MLLMs) have rapidly evolved with the growth of Large Language Models (LLMs) and are now applied in various fields. In finance, the integration of diverse modalities such as text, charts, and tables is crucial for accurate and efficient decision-making. Therefore, an effective evaluation system that incorporates these data types is essential for advancing financial application. In this paper, we introduce CFBenchmark-MM, a Chinese multimodal financial benchmark with over 9,000 image-question pairs featuring tables, histogram charts, line charts, pie charts, and structural diagrams. Additionally, we develop a staged evaluation system to assess MLLMs in handling multimodal information by providing different visual content step by step. Despite MLLMs having inherent financial knowledge, experimental results still show limited efficiency and robustness in handling multimodal financial context. Further analysis on incorrect responses reveals the misinterpretation of visual content and the misunderstanding of financial concepts are the primary issues. Our research validates the significant, yet underexploited, potential of MLLMs in financial analysis, highlighting the need for further development and domain-specific optimization to encourage the enhanced use in financial domain.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.13055

Country: Asia > China (0.68)

Genre: Research Report > New Finding (0.93)

Industry:

Energy (1.00)
Banking & Finance > Trading (1.00)
Transportation > Ground > Road (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Can ChatGPT Perform Image Splicing Detection? A Preliminary Study

Nath, Souradip

arXiv.org Artificial IntelligenceJun-9-2025

Multimodal Large Language Models (MLLMs) like GPT-4V are capable of reasoning across text and image modalities, showing promise in a variety of complex vision-language tasks. In this preliminary study, we investigate the out-of-the-box capabilities of GPT-4V in the domain of image forensics, specifically, in detecting image splicing manipulations. Without any task-specific fine-tuning, we evaluate GPT-4V using three prompting strategies: Zero-Shot (ZS), Few-Shot (FS), and Chain-of-Thought (CoT), applied over a curated subset of the CASIA v2.0 splicing dataset. Our results show that GPT-4V achieves competitive detection performance in zero-shot settings (more than 85% accuracy), with CoT prompting yielding the most balanced trade-off across authentic and spliced images. Qualitative analysis further reveals that the model not only detects low-level visual artifacts but also draws upon real-world contextual knowledge such as object scale, semantic consistency, and architectural facts, to identify implausible composites. While GPT-4V lags behind specialized state-of-the-art splicing detection models, its generalizability, interpretability, and encyclopedic reasoning highlight its potential as a flexible tool in image forensics.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.05358

Country: North America > United States > Arizona (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials

Neural Information Processing SystemsMay-27-2025, 13:23:36 GMT

Physically realistic materials are pivotal in augmenting the realism of 3D assets across various applications and lighting conditions. However, existing 3D assets and generative models often lack authentic material properties. Manual assignment of materials using graphic software is a tedious and time-consuming task. In this paper, we exploit advancements in Multimodal Large Language Models (MLLMs), particularly GPT-4V, to present a novel approach, Make-it-Real: 1) We demonstrate that GPT-4V can effectively recognize and describe materials, allowing the construction of a detailed material library. Make-it-Real offers a streamlined integration into the 3D content creation workflow, showcasing its utility as an essential tool for developers of 3D assets.

make-it-real, multimodal model, realistic material, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations

Neural Information Processing SystemsMay-27-2025, 12:36:51 GMT

Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLONGBENCH-DOC, a long-context, multi- modal benchmark comprising 1,082 expert-annotated questions. Distinct from previous datasets, it is constructed upon 135 lengthy PDF-formatted documents with an average of 47.5 pages and 21,214 textual tokens.

lvlm, mmlongbench-doc, visualization, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.90)

Add feedback

Detecting Offensive Memes with Social Biases in Singapore Context Using Multimodal Large Language Models

Yuxuan, Cao, Jiayang, Wu, Chuen, Alistair Cheong Liang, Guanrong, Bryan Shan, Jen, Theodore Lee Chong, Shen, Sherman Chann Zhi

arXiv.org Artificial IntelligenceMar-8-2025

Traditional online content moderation systems struggle to classify modern multimodal means of communication, such as memes, a highly nuanced and information-dense medium. This task is especially hard in a culturally diverse society like Singapore, where low-resource languages are used and extensive knowledge on local context is needed to interpret online content. We curate a large collection of 112K memes labeled by GPT-4V for fine-tuning a VLM to classify offensive memes in Singapore context. We show the effectiveness of fine-tuned VLMs on our dataset, and propose a pipeline containing OCR, translation and a 7-billion parameter-class VLM. Our solutions reach 80.62% accuracy and 0.8192 AUROC on a held-out test set, and can greatly aid human in moderating online contents. The dataset, code, and model weights have been open-sourced at https://github.com/aliencaocao/vlm-for-memes-aisg.

dataset, meme, singapore, (11 more...)

arXiv.org Artificial Intelligence

2502.18101

Country:

Asia > Singapore > Central Region > Singapore (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry:

Information Technology (0.68)
Government > Regional Government (0.46)
Law > Civil Rights & Constitutional Law (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Filters

Collaborating Authors

gpt-4v

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials

MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations

DeTik Zify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ Jonas Belouadi

9a8d52eb05eb7b13f54b3d9eada667b7-Paper-Conference.pdf

52764eb83bf0a0bd32766ce5c01612e5-Paper-Datasets_and_Benchmarks_Track.pdf

CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model

Can ChatGPT Perform Image Splicing Detection? A Preliminary Study

Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials

MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations

Detecting Offensive Memes with Social Biases in Singapore Context Using Multimodal Large Language Models