Not enough data to create a plot.
Try a different view from the menu above.
Wang, Yuanyuan
Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs
Ling Team, null, Zeng, Binwei, Huang, Chao, Zhang, Chao, Tian, Changxin, Chen, Cong, Jin, Dingnan, Yu, Feng, Zhu, Feng, Yuan, Feng, Wang, Fakang, Wang, Gangshan, Zhai, Guangyao, Zhang, Haitao, Li, Huizhong, Zhou, Jun, Liu, Jia, Fang, Junpeng, Ou, Junjie, Hu, Jun, Luo, Ji, Zhang, Ji, Liu, Jian, Sha, Jian, Qian, Jianxue, Wu, Jiewei, Zhao, Junping, Li, Jianguo, Feng, Jubao, Di, Jingchao, Xu, Junming, Yao, Jinghua, Xu, Kuan, Du, Kewei, Li, Longfei, Liang, Lei, Yu, Lu, Tang, Li, Ju, Lin, Xu, Peng, Cui, Qing, Liu, Song, Li, Shicheng, Song, Shun, Yan, Song, Cai, Tengwei, Chen, Tianyi, Guo, Ting, Huang, Ting, Feng, Tao, Wu, Tao, Wu, Wei, Zhang, Xiaolu, Yang, Xueming, Zhao, Xin, Hu, Xiaobo, Lin, Xin, Zhao, Yao, Wang, Yilong, Guo, Yongzhen, Wang, Yuanyuan, Yang, Yue, Cao, Yang, Fu, Yuhao, Xiong, Yi, Li, Yanzhe, Li, Zhe, Zhang, Zhiqiang, Liu, Ziqi, Huan, Zhaoxin, Wen, Zujie, Sun, Zhenhang, Du, Zhuoxuan, He, Zhengyu
In this technical report, we tackle the challenges of training large-scale Mixture of Experts (MoE) models, focusing on overcoming cost inefficiency and resource limitations prevalent in such systems. To address these issues, we present two differently sized MoE large language models (LLMs), namely Ling-Lite and Ling-Plus (referred to as "Bailing" in Chinese, spelled B\v{a}il\'ing in Pinyin). Ling-Lite contains 16.8 billion parameters with 2.75 billion activated parameters, while Ling-Plus boasts 290 billion parameters with 28.8 billion activated parameters. Both models exhibit comparable performance to leading industry benchmarks. This report offers actionable insights to improve the efficiency and accessibility of AI development in resource-constrained settings, promoting more scalable and sustainable technologies. Specifically, to reduce training costs for large-scale MoE models, we propose innovative methods for (1) optimization of model architecture and training processes, (2) refinement of training anomaly handling, and (3) enhancement of model evaluation efficiency. Additionally, leveraging high-quality data generated from knowledge graphs, our models demonstrate superior capabilities in tool use compared to other models. Ultimately, our experimental findings demonstrate that a 300B MoE LLM can be effectively trained on lower-performance devices while achieving comparable performance to models of a similar scale, including dense and MoE models. Compared to high-performance devices, utilizing a lower-specification hardware system during the pre-training phase demonstrates significant cost savings, reducing computing costs by approximately 20%. The models can be accessed at https://huggingface.co/inclusionAI.
How Certain are Uncertainty Estimates? Three Novel Earth Observation Datasets for Benchmarking Uncertainty Quantification in Machine Learning
Wang, Yuanyuan, Song, Qian, Wasif, Dawood, Shahzad, Muhammad, Koller, Christoph, Bamber, Jonathan, Zhu, Xiao Xiang
Uncertainty quantification (UQ) is essential for assessing the reliability of Earth observation (EO) products. However, the extensive use of machine learning models in EO introduces an additional layer of complexity, as those models themselves are inherently uncertain. While various UQ methods do exist for machine learning models, their performance on EO datasets remains largely unevaluated. A key challenge in the community is the absence of the ground truth for uncertainty, i.e. how certain the uncertainty estimates are, apart from the labels for the image/signal. This article fills this gap by introducing three benchmark datasets specifically designed for UQ in EO machine learning models. These datasets address three common problem types in EO: regression, image segmentation, and scene classification. They enable a transparent comparison of different UQ methods for EO machine learning models. We describe the creation and characteristics of each dataset, including data sources, preprocessing steps, and label generation, with a particular focus on calculating the reference uncertainty. We also showcase baseline performance of several machine learning models on each dataset, highlighting the utility of these benchmarks for model development and comparison. Overall, this article offers a valuable resource for researchers and practitioners working in artificial intelligence for EO, promoting a more accurate and reliable quality measure of the outputs of machine learning models. The dataset and code are accessible via https://gitlab.lrz.de/ai4eo/WG_Uncertainty.
Identifiability Analysis of Linear ODE Systems with Hidden Confounders
Wang, Yuanyuan, Huang, Biwei, Huang, Wei, Geng, Xi, Gong, Mingming
The identifiability analysis of linear Ordinary Differential Equation (ODE) systems is a necessary prerequisite for making reliable causal inferences about these systems. While identifiability has been well studied in scenarios where the system is fully observable, the conditions for identifiability remain unexplored when latent variables interact with the system. This paper aims to address this gap by presenting a systematic analysis of identifiability in linear ODE systems incorporating hidden confounders. Specifically, we investigate two cases of such systems. In the first case, latent confounders exhibit no causal relationships, yet their evolution adheres to specific functional forms, such as polynomial functions of time $t$. Subsequently, we extend this analysis to encompass scenarios where hidden confounders exhibit causal dependencies, with the causal structure of latent variables described by a Directed Acyclic Graph (DAG). The second case represents a more intricate variation of the first case, prompting a more comprehensive identifiability analysis. Accordingly, we conduct detailed identifiability analyses of the second system under various observation conditions, including both continuous and discrete observations from single or multiple trajectories. To validate our theoretical results, we perform a series of simulations, which support and substantiate our findings.
Diff-CXR: Report-to-CXR generation through a disease-knowledge enhanced diffusion model
Huang, Peng, Guo, Bowen, Liang, Shuyu, Fu, Junhu, Wang, Yuanyuan, Guo, Yi
Text-To-Image (TTI) generation is significant for controlled and diverse image generation with broad potential applications. Although current medical TTI methods have made some progress in report-to-Chest-Xray (CXR) generation, their generation performance may be limited due to the intrinsic characteristics of medical data. In this paper, we propose a novel disease-knowledge enhanced Diffusion-based TTI learning framework, named Diff-CXR, for medical report-to-CXR generation. First, to minimize the negative impacts of noisy data on generation, we devise a Latent Noise Filtering Strategy that gradually learns the general patterns of anomalies and removes them in the latent space. Then, an Adaptive Vision-Aware Textual Learning Strategy is designed to learn concise and important report embeddings in a domain-specific Vision-Language Model, providing textual guidance for Chest-Xray generation. Finally, by incorporating the general disease knowledge into the pretrained TTI model via a delicate control adapter, a disease-knowledge enhanced diffusion model is introduced to achieve realistic and precise report-to-CXR generation. Experimentally, our Diff-CXR outperforms previous SOTA medical TTI methods by 33.4\% / 8.0\% and 23.8\% / 56.4\% in the FID and mAUC score on MIMIC-CXR and IU-Xray, with the lowest computational complexity at 29.641 GFLOPs. Downstream experiments on three thorax disease classification benchmarks and one CXR-report generation benchmark demonstrate that Diff-CXR is effective in improving classical CXR analysis methods. Notably, models trained on the combination of 1\% real data and synthetic data can achieve a competitive mAUC score compared to models trained on all data, presenting promising clinical applications.
Consistency Model is an Effective Posterior Sample Approximation for Diffusion Inverse Solvers
Xu, Tongda, Zhu, Ziran, Li, Jian, He, Dailan, Wang, Yuanyuan, Sun, Ming, Li, Ling, Qin, Hongwei, Wang, Yan, Liu, Jingjing, Zhang, Ya-Qin
Diffusion Inverse Solvers (DIS) are designed to sample from the conditional distribution $p_{\theta}(X_0|y)$, with a predefined diffusion model $p_{\theta}(X_0)$, an operator $f(\cdot)$, and a measurement $y=f(x'_0)$ derived from an unknown image $x'_0$. Existing DIS estimate the conditional score function by evaluating $f(\cdot)$ with an approximated posterior sample drawn from $p_{\theta}(X_0|X_t)$. However, most prior approximations rely on the posterior means, which may not lie in the support of the image distribution, thereby potentially diverge from the appearance of genuine images. Such out-of-support samples may significantly degrade the performance of the operator $f(\cdot)$, particularly when it is a neural network. In this paper, we introduces a novel approach for posterior approximation that guarantees to generate valid samples within the support of the image distribution, and also enhances the compatibility with neural network-based operators $f(\cdot)$. We first demonstrate that the solution of the Probability Flow Ordinary Differential Equation (PF-ODE) with an initial value $x_t$ yields an effective posterior sample $p_{\theta}(X_0|X_t=x_t)$. Based on this observation, we adopt the Consistency Model (CM), which is distilled from PF-ODE, for posterior sampling. Furthermore, we design a novel family of DIS using only CM. Through extensive experiments, we show that our proposed method for posterior sample approximation substantially enhance the effectiveness of DIS for neural network operators $f(\cdot)$ (e.g., in semantic segmentation). Additionally, our experiments demonstrate the effectiveness of the new CM-based inversion techniques. The source code is provided in the supplementary material.
On the Foundations of Earth and Climate Foundation Models
Zhu, Xiao Xiang, Xiong, Zhitong, Wang, Yi, Stewart, Adam J., Heidler, Konrad, Wang, Yuanyuan, Yuan, Zhenghang, Dujardin, Thomas, Xu, Qingsong, Shi, Yilei
These authors contributed equally to this work. Abstract Foundation models have enormous potential in advancing Earth and climate sciences, however, current approaches may not be optimal as they focus on a few basic features of a desirable Earth and climate foundation model. Crafting the ideal Earth foundation model, we define eleven features which would allow such a foundation model to be beneficial for any geoscientific downstream application in an environmental-and human-centric manner. We further shed light on the way forward to achieve the ideal model and to evaluate Earth foundation models. What comes after foundation models? Energy efficient adaptation, adversarial defenses, and interpretability are among the emerging directions. In the past decade in particular, we have witnessed a paradigm shift from single-purpose models to general-purpose models, and from supervised pre-training to self-supervised pre-training. The majority of FMs like CLIP and GPT focus on the image and text domains. In this work, we specifically focus on "data" and "downstream tasks" relating to the Earth and its climate system, as shown in Figure 1. We choose to limit the scope of our work to the Earth's surface and atmosphere for three reasons. First, the Earth's surface and troposphere are our home, and include the majority of processes that directly impact and are impacted by human activity.
Unified learning-based lossy and lossless JPEG recompression
Zhang, Jianghui, Wang, Yuanyuan, Guo, Lina, Luo, Jixiang, Xu, Tongda, Wang, Yan, Wang, Zhi, Qin, Hongwei
JPEG is still the most widely used image compression algorithm. Most image compression algorithms only consider uncompressed original image, while ignoring a large number of already existing JPEG images. Recently, JPEG recompression approaches have been proposed to further reduce the size of JPEG files. However, those methods only consider JPEG lossless recompression, which is just a special case of the rate-distortion theorem. In this paper, we propose a unified lossly and lossless JPEG recompression framework, which consists of learned quantization table and Markovian hierarchical variational autoencoders. Experiments show that our method can achieve arbitrarily low distortion when the bitrate is close to the upper bound, namely the bitrate of the lossless compression model. To the best of our knowledge, this is the first learned method that bridges the gap between lossy and lossless recompression of JPEG images.
Generator Identification for Linear SDEs with Additive and Multiplicative Noise
Wang, Yuanyuan, Geng, Xi, Huang, Wei, Huang, Biwei, Gong, Mingming
In this paper, we present conditions for identifying the generator of a linear stochastic differential equation (SDE) from the distribution of its solution process with a given fixed initial state. These identifiability conditions are crucial in causal inference using linear SDEs as they enable the identification of the post-intervention distributions from its observational distribution. Specifically, we derive a sufficient and necessary condition for identifying the generator of linear SDEs with additive noise, as well as a sufficient condition for identifying the generator of linear SDEs with multiplicative noise. We show that the conditions derived for both types of SDEs are generic. Moreover, we offer geometric interpretations of the derived identifiability conditions to enhance their understanding. To validate our theoretical results, we perform a series of simulations, which support and substantiate the established findings.
DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification
Wang, Yuanyuan, Zhang, Yang, Wu, Zhiyong, Yang, Zhihan, Wei, Tao, Zou, Kun, Meng, Helen
Data augmentation is vital to the generalization ability and robustness of deep neural networks (DNNs) models. Existing augmentation methods for speaker verification manipulate the raw signal, which are time-consuming and the augmented samples lack diversity. In this paper, we present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification, which can generate diversified training samples in speaker embedding space with negligible extra computing cost. Firstly, we augment training samples by perturbing speaker embeddings along semantic directions, which are obtained from speaker-wise covariance matrices. Secondly, accurate covariance matrices are estimated from robust speaker embeddings during training, so we introduce difficultyaware additive margin softmax (DAAM-Softmax) to obtain optimal speaker embeddings. Finally, we assume the number of augmented samples goes to infinity and derive a closed-form upper bound of the expected loss with DASA, which achieves compatibility and efficiency. Extensive experiments demonstrate the proposed approach can achieve a remarkable performance improvement. The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
Cooperation Learning Enhanced Colonic Polyp Segmentation Based on Transformer-CNN Fusion
Wang, Yuanyuan, Deng, Zhaohong, Lou, Qiongdan, Hu, Shudong, Choi, Kup-sze, Wang, Shitong
Traditional segmentation methods for colonic polyps are mainly designed based on low-level features. They could not accurately extract the location of small colonic polyps. Although the existing deep learning methods can improve the segmentation accuracy, their effects are still unsatisfied. To meet the above challenges, we propose a hybrid network called Fusion-Transformer-HardNetMSEG (i.e., Fu-TransHNet) in this study. Fu-TransHNet uses deep learning of different mechanisms to fuse each other and is enhanced with multi-view collaborative learning techniques. Firstly, the Fu-TransHNet utilizes the Transformer branch and the CNN branch to realize the global feature learning and local feature learning, respectively. Secondly, a fusion module is designed to integrate the features from two branches. The fusion module consists of two parts: 1) the Global-Local Feature Fusion (GLFF) part and 2) the Dense Fusion of Multi-scale features (DFM) part. The former is built to compensate the feature information mission from two branches at the same scale; the latter is constructed to enhance the feature representation. Thirdly, the above two branches and fusion modules utilize multi-view cooperative learning techniques to obtain their respective weights that denote their importance and then make a final decision comprehensively. Experimental results showed that the Fu-TransHNet network was superior to the existing methods on five widely used benchmark datasets. In particular, on the ETIS-LaribPolypDB dataset containing many small-target colonic polyps, the mDice obtained by Fu-TransHNet were 12.4% and 6.2% higher than the state-of-the-art methods HardNet-MSEG and TransFuse-s, respectively.