AITopics | Jin, Peng

Collaborating Authors

Jin, Peng

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

Zhang, Hongyu, Deng, Yufan, Yuan, Shenghai, Jin, Peng, Cheng, Zesen, Zhao, Yian, Liu, Chang, Chen, Jie

arXiv.org Artificial IntelligenceMar-18-2025

Text-to-video (T2V) generation has made significant strides with diffusion models. However, existing methods still struggle with accurately binding attributes, determining spatial relationships, and capturing complex action interactions between multiple subjects. To address these limitations, we propose MagicComp, a training-free method that enhances compositional T2V generation through dual-phase refinement. Specifically, (1) During the Conditioning Stage: We introduce the Semantic Anchor Disambiguation to reinforces subject-specific semantics and resolve inter-subject ambiguity by progressively injecting the directional vectors of semantic anchors into original text embedding; (2) During the Denoising Stage: We propose Dynamic Layout Fusion Attention, which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation. Furthermore, MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures. Extensive experiments on T2V-CompBench and VBench demonstrate that MagicComp outperforms state-of-the-art methods, highlighting its potential for applications such as complex prompt-based and trajectory-controllable video generation. Project page: https://hong-yu-zhang.github.io/MagicComp-Page/.

large language model, machine learning, preprint arxiv, (18 more...)

arXiv.org Artificial Intelligence

2503.14428

Country: Asia > China (0.28)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Niu, Yuwei, Ning, Munan, Zheng, Mengren, Lin, Bin, Jin, Peng, Liao, Jiaqi, Ning, Kunpeng, Zhu, Bin, Yuan, Li

arXiv.org Artificial IntelligenceMar-10-2025

Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose $\textbf{WISE}$, the first benchmark specifically designed for $\textbf{W}$orld Knowledge-$\textbf{I}$nformed $\textbf{S}$emantic $\textbf{E}$valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce $\textbf{WiScore}$, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at https://github.com/PKU-YuanGroup/WISE.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2503.07265

Country:

Asia > China (0.14)
Asia > Japan (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

MoH: Multi-Head Attention as Mixture-of-Head Attention

Jin, Peng, Zhu, Bo, Yuan, Li, Yan, Shuicheng

arXiv.org Artificial IntelligenceOct-15-2024

In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.11842

Country: Asia (0.46)

Genre: Research Report > New Finding (0.68)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

Jin, Peng, Zhu, Bo, Yuan, Li, Yan, Shuicheng

arXiv.org Artificial IntelligenceOct-9-2024

In this work, we aim to simultaneously enhance the effectiveness and efficiency of Mixture-of-Experts (MoE) methods. To achieve this, we propose MoE++, a general and heterogeneous MoE framework that integrates both Feed-Forward Network (FFN) and zero-computation experts. Specifically, we introduce three types of zero-computation experts: the zero expert, copy expert, and constant expert, which correspond to discard, skip, and replace operations, respectively. This design offers three key advantages: (i) Low Computing Overhead: Unlike the uniform mixing mechanism for all tokens within vanilla MoE, MoE++ allows each token to engage with a dynamic number of FFNs, be adjusted by constant vectors, or even skip the MoE layer entirely. Moreover, we leverage gating residuals, enabling each token to consider the pathway taken in the previous layer when selecting the appropriate experts. Extensive experimental results demonstrate that MoE++ achieves better performance while delivering 1.1 2.1 expert forward throughput Large Language Models (LLMs) (Brown et al., 2020; OpenAI, 2022; Ouyang et al., 2022; Chowdhery et al., 2023; Achiam et al., 2023) have achieved substantial advancements, primarily attributed to the expansion of training data and a significant increase in model parameters. Therefore, the Mixtureof-Experts (MoE) architecture (Jacobs et al., 1991; Zhou et al., 2022; Roller et al., 2021), which allows for parameter scaling while keeping computational costs manageable, has become a preferred solution. The recent incorporation of MoE architectures into Transformers (Vaswani et al., 2017) has enabled the effective scaling of language models to impressive sizes, resulting in exceptional performance (Team, 2024; Dai et al., 2024; Jiang et al., 2024; Shen et al., 2024; Wei et al., 2024). These achievements underscore the significant potential and promise of MoE language models. Most existing Mixture-of-Experts (MoE) methods (Du et al., 2022; Fedus et al., 2022; Lewis et al., 2021; Rajbhandari et al., 2022) typically activate a fixed number of Feed-Forward Networks (FFNs) for all tokens. In many works (Lepikhin et al., 2021; Xue et al., 2024), each token selects the top We define expert throughput as the throughput of FFN experts and zero-computation experts (if present).

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.07348

Country: Asia > China (0.14)

Genre: Research Report > New Finding (0.87)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

Wan, Zhongwei, Wu, Ziang, Liu, Che, Huang, Jinfa, Zhu, Zhihong, Jin, Peng, Wang, Longyue, Yuan, Li

arXiv.org Artificial IntelligenceJun-26-2024

Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temporal and spatial relationships and related textual contexts. The predominance of image tokens means traditional optimizations for LLMs' KV caches are unsuitable for multimodal long-context settings, and no prior works have addressed this challenge. In this work, we introduce LOOK-M, a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size while maintaining performance comparable to a full cache. We observe that during prompt prefill, the model prioritizes more textual attention over image features, and based on the multimodal interaction observation, a new proposed text-prior method is explored to compress the KV cache. Furthermore, to mitigate the degradation of image contextual information, we propose several compensatory strategies using KV pairs merging. LOOK-M demonstrates that with a significant reduction in KV Cache memory usage, such as reducing it by 80% in some cases, it not only achieves up to 1.5x faster decoding but also maintains or even enhances performance across a variety of long context multimodal tasks.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.18139

Genre: Research Report (0.82)

Industry: Health & Medicine (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

LLMBind: A Unified Modality-Task Integration Framework

Zhu, Bin, Ning, Munan, Jin, Peng, Lin, Bin, Huang, Jinfa, Song, Qi, Zhang, Junwu, Tang, Zhenyu, Pan, Mingjun, Zhou, Xing, Yuan, Li

arXiv.org Artificial IntelligenceApr-18-2024

In the multi-modal domain, the dependence of various models on specific input formats leads to user confusion and hinders progress. To address this challenge, we introduce \textbf{LLMBind}, a novel framework designed to unify a diverse array of multi-modal tasks. By harnessing a Mixture-of-Experts (MoE) Large Language Model (LLM), LLMBind processes multi-modal inputs and generates task-specific tokens, enabling the invocation of corresponding models to accomplish tasks. This unique approach empowers LLMBind to interpret inputs and generate outputs across various modalities, including image, text, video, and audio. Furthermore, we have constructed an interaction dataset comprising 400k instructions, which unlocks the ability of LLMBind for interactive visual generation and editing tasks. Extensive experimentation demonstrates that LLMBind achieves very superior performance across diverse tasks and outperforms existing models in user evaluations conducted in real-world scenarios. Moreover, the adaptability of LLMBind allows for seamless integration with the latest models and extension to new modality tasks, highlighting its potential to serve as a unified AI agent for modeling universal modalities.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2402.14891

Country:

Oceania > Australia (0.17)
Asia > China (0.14)
Europe > Netherlands (0.14)

Genre: Research Report (0.82)

Industry: Media (0.69)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Gao, Peng, Zhang, Renrui, Liu, Chris, Qiu, Longtian, Huang, Siyuan, Lin, Weifeng, Zhao, Shitian, Geng, Shijie, Lin, Ziyi, Jin, Peng, Zhang, Kaipeng, Shao, Wenqi, Xu, Chao, He, Conghui, He, Junjun, Shao, Hao, Lu, Pan, Li, Hongsheng, Qiao, Yu

arXiv.org Artificial IntelligenceFeb-8-2024

We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

large language model, machine learning, sphinx-x, (14 more...)

arXiv.org Artificial Intelligence

2402.05935

Country: Europe (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

$\mathbf{\mathbb{E}^{FWI}}$: Multi-parameter Benchmark Datasets for Elastic Full Waveform Inversion of Geophysical Properties

Feng, Shihang, Wang, Hanchen, Deng, Chengyuan, Feng, Yinan, Liu, Yanhua, Zhu, Min, Jin, Peng, Chen, Yinpeng, Lin, Youzuo

arXiv.org Artificial IntelligenceSep-7-2023

Elastic geophysical properties (such as P- and S-wave velocities) are of great importance to various subsurface applications like CO$_2$ sequestration and energy exploration (e.g., hydrogen and geothermal). Elastic full waveform inversion (FWI) is widely applied for characterizing reservoir properties. In this paper, we introduce $\mathbf{\mathbb{E}^{FWI}}$, a comprehensive benchmark dataset that is specifically designed for elastic FWI. $\mathbf{\mathbb{E}^{FWI}}$ encompasses 8 distinct datasets that cover diverse subsurface geologic structures (flat, curve, faults, etc). The benchmark results produced by three different deep learning methods are provided. In contrast to our previously presented dataset (pressure recordings) for acoustic FWI (referred to as OpenFWI), the seismic dataset in $\mathbf{\mathbb{E}^{FWI}}$ has both vertical and horizontal components. Moreover, the velocity maps in $\mathbf{\mathbb{E}^{FWI}}$ incorporate both P- and S-wave velocities. While the multicomponent data and the added S-wave velocity make the data more realistic, more challenges are introduced regarding the convergence and computational cost of the inversion. We conduct comprehensive numerical experiments to explore the relationship between P-wave and S-wave velocities in seismic data. The relation between P- and S-wave velocities provides crucial insights into the subsurface properties such as lithology, porosity, fluid content, etc. We anticipate that $\mathbf{\mathbb{E}^{FWI}}$ will facilitate future research on multiparameter inversions and stimulate endeavors in several critical research topics of carbon-zero and new energy exploration. All datasets, codes and relevant information can be accessed through our website at https://efwi-lanl.github.io/

artificial intelligence, machine learning, survey article, (21 more...)

arXiv.org Artificial Intelligence

2306.12386

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry:

Energy > Oil & Gas > Upstream (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Scientific Computing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Does Full Waveform Inversion Benefit from Big Data?

Jin, Peng, Feng, Yinan, Feng, Shihang, Wang, Hanchen, Chen, Yinpeng, Consolvo, Benjamin, Liu, Zicheng, Lin, Youzuo

arXiv.org Artificial IntelligenceJul-28-2023

This paper investigates the impact of big data on deep learning models for full waveform inversion (FWI). While it is well known that big data can boost the performance of deep learning models in many tasks, its effectiveness has not been validated for FWI. To address this gap, we present an empirical study that investigates how deep learning models in FWI behave when trained on OpenFWI, a collection of large-scale, multi-structural datasets published recently. Particularly, we train and evaluate the FWI models on a combination of 10 2D subsets in OpenFWI that contain 470K data pairs in total. Our experiments demonstrate that larger datasets lead to better performance and generalization of deep learning models for FWI. We further demonstrate that model capacity needs to scale in accordance with data size for optimal improvement.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2307.15388

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.94)

Industry: Energy > Oil & Gas > Upstream (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

OpenFWI: Large-Scale Multi-Structural Benchmark Datasets for Seismic Full Waveform Inversion

Deng, Chengyuan, Feng, Shihang, Wang, Hanchen, Zhang, Xitong, Jin, Peng, Feng, Yinan, Zeng, Qili, Chen, Yinpeng, Lin, Youzuo

arXiv.org Artificial IntelligenceJun-23-2023

Full waveform inversion (FWI) is widely used in geophysics to reconstruct high-resolution velocity maps from seismic data. The recent success of data-driven FWI methods results in a rapidly increasing demand for open datasets to serve the geophysics community. We present OpenFWI, a collection of large-scale multi-structural benchmark datasets, to facilitate diversified, rigorous, and reproducible research on FWI. In particular, OpenFWI consists of 12 datasets (2.1TB in total) synthesized from multiple sources. It encompasses diverse domains in geophysics (interface, fault, CO2 reservoir, etc.), covers different geological subsurface structures (flat, curve, etc.), and contains various amounts of data samples (2K - 67K). It also includes a dataset for 3D FWI. Moreover, we use OpenFWI to perform benchmarking over four deep learning methods, covering both supervised and unsupervised learning regimes. Along with the benchmarks, we implement additional experiments, including physics-driven methods, complexity analysis, generalization study, uncertainty quantification, and so on, to sharpen our understanding of datasets and methods. The studies either provide valuable insights into the datasets and the performance, or uncover their current limitations. We hope OpenFWI supports prospective research on FWI and inspires future open-source efforts on AI for science. All datasets and related information can be accessed through our website at https://openfwi-lanl.github.io/

artificial intelligence, deep learning, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2111.02926

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Energy > Oil & Gas > Upstream (1.00)
Government > Regional Government > North America Government > United States Government (0.93)

Technology:

Information Technology > Scientific Computing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback