Goto

Collaborating Authors

 portrait



10 captivating images from National Geographic's Photo Ark

Popular Science

Since 2006, the project has photographed 17,000 species in the world's zoos, aquariums, and wildlife sanctuaries. Photographs from the Photo Ark will be featured in the inaugural exhibition at the National Geographic Museum of Exploration in Washington D.C. Breakthroughs, discoveries, and DIY tips sent every weekday. A picture is said to be worth a thousand words, but some photographs are worth 17,000. Well, 17,000 species, that is. For's Photo Ark project, photographer Joel Sartore is documenting all species living in the world's zoos, aquariums, and wildlife sanctuaries.


Think How Your Teammates Think: Active Inference Can Benefit Decentralized Execution

Wu, Hao, Song, Shoucheng, Yao, Chang, Han, Sheng, Wan, Huaiyu, Lin, Youfang, Lv, Kai

arXiv.org Artificial Intelligence

In multi-agent systems, explicit cognition of teammates' decision logic serves as a critical factor in facilitating coordination. Communication (i.e., ``\textit{Tell}'') can assist in the cognitive development process by information dissemination, yet it is inevitably subject to real-world constraints such as noise, latency, and attacks. Therefore, building the understanding of teammates' decisions without communication remains challenging. To address this, we propose a novel non-communication MARL framework that realizes the construction of cognition through local observation-based modeling (i.e., \textit{``Think''}). Our framework enables agents to model teammates' \textbf{active inference} process. At first, the proposed method produces three teammate portraits: perception-belief-action. Specifically, we model the teammate's decision process as follows: 1) Perception: observing environments; 2) Belief: forming beliefs; 3) Action: making decisions. Then, we selectively integrate the belief portrait into the decision process based on the accuracy and relevance of the perception portrait. This enables the selection of cooperative teammates and facilitates effective collaboration. Extensive experiments on the SMAC, SMACv2, MPE, and GRF benchmarks demonstrate the superior performance of our method.


PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person Generation

Pan, Ting, Wang, Ye, Jing, Peiguang, Ma, Rui, Yi, Zili, Liu, Yu

arXiv.org Artificial Intelligence

Personalized dual-person portrait customization has considerable potential applications, such as preserving emotional memories and facilitating wedding photography planning. However, the absence of a benchmark dataset hinders the pursuit of high-quality customization in dual-person portrait generation. In this paper, we propose the PairHuman dataset, which is the first large-scale benchmark dataset specifically designed for generating dual-person portraits that meet high photographic standards. The PairHuman dataset contains more than 100K images that capture a variety of scenes, attire, and dual-person interactions, along with rich metadata, including detailed image descriptions, person localization, human keypoints, and attribute tags. We also introduce DHumanDiff, which is a baseline specifically crafted for dual-person portrait generation that features enhanced facial consistency and simultaneously balances in personalized person generation and semantic-driven scene creation. Finally, the experimental results demonstrate that our dataset and method produce highly customized portraits with superior visual quality that are tailored to human preferences. Our dataset is publicly available at https://github.com/annaoooo/PairHuman.


SCALEX: Scalable Concept and Latent Exploration for Diffusion Models

Zeng, E. Zhixuan, Chen, Yuhao, Wong, Alexander

arXiv.org Artificial Intelligence

Image generation models frequently encode social biases, including stereotypes tied to gender, race, and profession. Existing methods for analyzing these biases in diffusion models either focus narrowly on predefined categories or depend on manual interpretation of latent directions. These constraints limit scalability and hinder the discovery of subtle or unanticipated patterns. W e introduce SCALEX, a framework for scalable and automated exploration of diffusion model latent spaces. SCALEX extracts semantically meaningful directions from H-space using only natural language prompts, enabling zero-shot interpretation without retraining or labelling. This allows systematic comparison across arbitrary concepts and large-scale discovery of internal model associations. W e show that SCALEX detects gender bias in profession prompts, ranks semantic alignment across identity descriptors, and reveals clustered conceptual structure without supervision. By linking prompts to latent directions directly, SCALEX makes bias analysis in diffusion models more scalable, interpretable, and extensible than prior approaches.


Unsupervised Transformation Learning via Convex Relaxations

Neural Information Processing Systems

Our goal is to extract meaningful transformations from raw images, such as varying the thickness of lines in handwriting or the lighting in a portrait. We propose an unsupervised approach to learn such transformations by attempting to reconstruct an image from a linear combination of transformations of its nearest neighbors. On handwritten digits and celebrity portraits, we show that even with linear transformations, our method generates visually high-quality modified images. Moreover, since our method is semiparametric and does not model the data distribution, the learned transformations extrapolate off the training data and can be applied to new types of images.



PuLID: Pure and Lightning ID Customization via Contrastive Alignment Zinan Guo Y anze Wu Zhuowei Chen Lang Chen Peng Zhang Qian He ByteDance Inc

Neural Information Processing Systems

Experiments show that PuLID achieves superior performance in both ID fidelity and editability. Another attractive property of PuLID is that the image elements ( e.g., background, lighting, composition, and style) before and after the ID insertion are kept as consistent as possible.


Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Tong, Jingqi, Mou, Yurong, Li, Hangcheng, Li, Mingzhe, Yang, Yongzhuo, Zhang, Ming, Chen, Qiguang, Liang, Tianyi, Hu, Xiaomeng, Zheng, Yining, Chen, Xinchi, Zhao, Jun, Huang, Xuanjing, Qiu, Xipeng

arXiv.org Artificial Intelligence

"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.


See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement

Wang, Jinting, Wang, Jun, Cheng, Hei Victor, Liu, Li

arXiv.org Artificial Intelligence

Abstract--Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. T o generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner . Experimental results demonstrate that our method outperforms existing approaches on the HDTF, V oxCeleb, and A VSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input. UDIO-driven talking face generation aims to animate a target portrait image to create realistic talking videos given a driving audio speech. This technique finds wide application in various practical scenarios, including high-quality film and animation production, virtual assistants, interactive educational content creation, and realistic character animation. Recently, significant advancements have been made in this field with the development of generative models. Existing talking face generation methods mainly focus on creating animated videos from a reference portrait [1]-[5]. Still, there is a dilemma: users are concerned about privacy breaches when using real portrait images [6]. FaceChain [6] made the first attempt to liberate the source face and directly infer the synchronized portrait using disentangled identity features from speech. However, the generated virtual face fails to preserve identity consistency.