Wang, Shan
Inorganic Catalyst Efficiency Prediction Based on EAPCR Model: A Deep Learning Solution for Multi-Source Heterogeneous Data
Liu, Zhangdi, An, Ling, Song, Mengke, Yu, Zhuohang, Wang, Shan, Qi, Kezhen, Zhang, Zhenyu, Zhou, Chichun
The design of inorganic catalysts and the prediction of their catalytic efficiency are fundamental challenges in chemistry and materials science. Traditional catalyst evaluation methods primarily rely on machine learning techniques; however, these methods often struggle to process multi-source heterogeneous data, limiting both predictive accuracy and generalization. To address these limitations, this study introduces the Embedding-Attention-Permutated CNN-Residual (EAPCR) deep learning model. EAPCR constructs a feature association matrix using embedding and attention mechanisms and enhances predictive performance through permutated CNN architectures and residual connections. This approach enables the model to accurately capture complex feature interactions across various catalytic conditions, leading to precise efficiency predictions. EAPCR serves as a powerful tool for computational researchers while also assisting domain experts in optimizing catalyst design, effectively bridging the gap between data-driven modeling and experimental applications. We evaluate EAPCR on datasets from TiO2 photocatalysis, thermal catalysis, and electrocatalysis, demonstrating its superiority over traditional machine learning methods (e.g., linear regression, random forest) as well as conventional deep learning models (e.g., ANN, NNs). Across multiple evaluation metrics (MAE, MSE, R2, and RMSE), EAPCR consistently outperforms existing approaches. These findings highlight the strong potential of EAPCR in inorganic catalytic efficiency prediction. As a versatile deep learning framework, EAPCR not only improves predictive accuracy but also establishes a solid foundation for future large-scale model development in inorganic catalysis.
Increasing SLAM Pose Accuracy by Ground-to-Satellite Image Registration
Zhang, Yanhao, Shi, Yujiao, Wang, Shan, Vora, Ankit, Perincherry, Akhil, Chen, Yongbo, Li, Hongdong
Vision-based localization for autonomous driving has been of great interest among researchers. When a pre-built 3D map is not available, the techniques of visual simultaneous localization and mapping (SLAM) are typically adopted. Due to error accumulation, visual SLAM (vSLAM) usually suffers from long-term drift. This paper proposes a framework to increase the localization accuracy by fusing the vSLAM with a deep-learning-based ground-to-satellite (G2S) image registration method. In this framework, a coarse (spatial correlation bound check) to fine (visual odometry consistency check) method is designed to select the valid G2S prediction. The selected prediction is then fused with the SLAM measurement by solving a scaled pose graph problem. To further increase the localization accuracy, we provide an iterative trajectory fusion pipeline. The proposed framework is evaluated on two well-known autonomous driving datasets, and the results demonstrate the accuracy and robustness in terms of vehicle localization.
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
Yu, Tianyu, Hu, Jinyi, Yao, Yuan, Zhang, Haoye, Zhao, Yue, Wang, Chongyi, Wang, Shan, Pan, Yinxv, Xue, Jiao, Li, Dahai, Liu, Zhiyuan, Zheng, Hai-Tao, Sun, Maosong
Recent Multimodal Large Language Models (MLLMs) exhibit impressive abilities to perceive images and follow open-ended instructions. The capabilities of MLLMs depend on two crucial factors: the model architecture to facilitate the feature alignment of visual modules and large language models; the multimodal instruction tuning datasets for human instruction following. (i) For the model architecture, most existing models introduce an external bridge module to connect vision encoders with language models, which needs an additional feature-alignment pre-training. In this work, we discover that compact pre-trained vision language models can inherently serve as ``out-of-the-box'' bridges between vision and language. Based on this, we propose Muffin framework, which directly employs pre-trained vision-language models to act as providers of visual signals. (ii) For the multimodal instruction tuning datasets, existing methods omit the complementary relationship between different datasets and simply mix datasets from different tasks. Instead, we propose UniMM-Chat dataset which explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions. We merge information describing the same image from diverse datasets and transforms it into more knowledge-intensive conversation data. Experimental results demonstrate the effectiveness of the Muffin framework and UniMM-Chat dataset. Muffin achieves state-of-the-art performance on a wide range of vision-language tasks, significantly surpassing state-of-the-art models like LLaVA and InstructBLIP. Our model and dataset are all accessible at https://github.com/thunlp/muffin.
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Hu, Jinyi, Yao, Yuan, Wang, Chongyi, Wang, Shan, Pan, Yinxu, Chen, Qianyu, Yu, Tianyu, Wu, Hanghao, Zhao, Yue, Zhang, Haoye, Han, Xu, Lin, Yankai, Xue, Jiao, Li, Dahai, Liu, Zhiyuan, Sun, Maosong
Recently there has been a significant surge in multimodal learning in terms of both image-to-text and text-to-image generation. However, the success is typically limited to English, leaving other languages largely behind. Building a competitive counterpart in other languages is highly challenging due to the low-resource nature of non-English multimodal data (i.e., lack of large-scale, high-quality image-text data). In this work, we propose MPM, an effective training paradigm for training large multimodal models in low-resource languages. MPM demonstrates that Multilingual language models can Pivot zero-shot Multimodal learning across languages. Specifically, based on a strong multilingual large language model, multimodal models pretrained on English-only image-text data can well generalize to other languages in a zero-shot manner for both image-to-text and text-to-image generation, even surpassing models trained on image-text data in native languages. Taking Chinese as a practice of MPM, we build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese. To facilitate future research, we open-source codes and model weights at https://github.com/OpenBMB/VisCPM.git.
Model Calibration in Dense Classification with Adaptive Label Perturbation
Liu, Jiawei, Ye, Changkun, Wang, Shan, Cui, Ruikai, Zhang, Jing, Zhang, Kaihao, Barnes, Nick
For safety-related applications, it is crucial to produce trustworthy deep neural networks whose prediction is associated with confidence that can represent the likelihood of correctness for subsequent decision-making. Existing dense binary classification models are prone to being over-confident. To improve model calibration, we propose Adaptive Stochastic Label Perturbation (ASLP) which learns a unique label perturbation level for each training image. ASLP employs our proposed Self-Calibrating Binary Cross Entropy (SC-BCE) loss, which unifies label perturbation processes including stochastic approaches (like DisturbLabel), and label smoothing, to correct calibration while maintaining classification rates. ASLP follows Maximum Entropy Inference of classic statistical mechanics to maximise prediction entropy with respect to missing information. It performs this while: (1) preserving classification accuracy on known data as a conservative solution, or (2) specifically improves model calibration degree by minimising the gap between the prediction accuracy and expected confidence of the target training label. Extensive results demonstrate that ASLP can significantly improve calibration degrees of dense binary classification models on both in-distribution and out-of-distribution data. The code is available on https://github.com/Carlisle-Liu/ASLP.
Samplable Anonymous Aggregation for Private Federated Data Analysis
Talwar, Kunal, Wang, Shan, McMillan, Audra, Jina, Vojta, Feldman, Vitaly, Basile, Bailey, Cahill, Aine, Chan, Yi Sheng, Chatzidakis, Mike, Chen, Junye, Chick, Oliver, Chitnis, Mona, Ganta, Suman, Goren, Yusuf, Granqvist, Filip, Guo, Kristine, Jacobs, Frederic, Javidbakht, Omid, Liu, Albert, Low, Richard, Mascenik, Dan, Myers, Steve, Park, David, Park, Wonhee, Parsa, Gianni, Pauly, Tommy, Priebe, Christian, Rishi, Rehan, Rothblum, Guy, Scaria, Michael, Song, Linmao, Song, Congzheng, Tarbe, Karl, Vogt, Sebastian, Winstrom, Luke, Zhou, Shundong
Learning aggregate population trends can allow for better data-driven decisions, and application of machine learning can improve user experience. Compared to learning from public curated datasets, learning from a larger population offers several benefits. As an example, a next-word prediction model trained on words typed by users (a) can better fit the actual distribution of language used on devices, (b) can adapt faster to shifts in distribution, and (c) can more faithfully represent smaller sub-populations that may not be well-represented in curated datasets. At the same time, training such models may involve sensitive user data.
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X
Zheng, Qinkai, Xia, Xiao, Zou, Xu, Dong, Yuxiao, Wang, Shan, Xue, Yufei, Wang, Zihan, Shen, Lei, Wang, Andi, Li, Yang, Su, Teng, Yang, Zhilin, Tang, Jie
Large pre-trained code generation models, such as OpenAI Codex, can generate syntax- and function-correct code, making the coding of programmers more productive and our pursuit of artificial general intelligence closer. In this paper, we introduce CodeGeeX, a multilingual model with 13 billion parameters for code generation. CodeGeeX is pre-trained on 850 billion tokens of 23 programming languages as of June 2022. Our extensive experiments suggest that CodeGeeX outperforms multilingual code models of similar scale for both the tasks of code generation and translation on HumanEval-X. Building upon HumanEval (Python only), we develop the HumanEval-X benchmark for evaluating multilingual models by hand-writing the solutions in C++, Java, JavaScript, and Go. In addition, we build CodeGeeX-based extensions on Visual Studio Code, JetBrains, and Cloud Studio, generating 4.7 billion tokens for tens of thousands of active users per week. Our user study demonstrates that CodeGeeX can help to increase coding efficiency for 83.4% of its users. Finally, CodeGeeX is publicly accessible and in Sep. 2022, we open-sourced its code, model weights (the version of 850B tokens), API, extensions, and HumanEval-X at https://github.com/THUDM/CodeGeeX.
Resource-aware Probability-based Collaborative Odor Source Localization Using Multiple UAVs
Wang, Shan, Sun, Sheng, Liu, Min, Gao, Bo, Wang, Yuwei
Benefitting from UAVs' characteristics of flexible deployment and controllable movement in 3D space, odor source localization with multiple UAVs has been a hot research area in recent years. Considering the limited resources and insufficient battery capacities of UAVs, it is necessary to fast locate the odor source with low-complexity computation and minimal interaction under complicated environmental states. To this end, we propose a multi-UAV collaboration based odor source localization (\textit{MUC-OSL}) method, where source estimation and UAV navigation are iteratively performed, aiming to accelerate the searching process and reduce the resource consumption of UAVs. Specifically, in the source estimation phase, we present a collaborative particle filter algorithm on the basis of UAVs' cognitive difference and Gaussian fitting to improve source estimation accuracy. In the following navigation phase, an adaptive path planning algorithm is designed based on Partially Observable Markov Decision Process (POMDP) to distributedly determine the subsequent flying direction and moving steps of each UAV. The results of experiments conducted on two simulation platforms demonstrate that \textit{MUC-OSL} outperforms existing efforts in terms of mean search time and success rate, and effectively reduces the resource consumption of UAVs.