Meng, Qingye
CNCast: Leveraging 3D Swin Transformer and DiT for Enhanced Regional Weather Forecasting
Liang, Hongli, Zhang, Yuanting, Meng, Qingye, He, Shuangshuang, Yuan, Xingyuan
This study introduces a cutting-edge regional weather forecasting model based on the SwinTransformer 3D architecture. This model is specifically designed to deliver precise hourly weather predictions ranging from 1 hour to 5 days, significantly improving the reliability and practicality of short-term weather forecasts. Our model has demonstrated generally superior performance when compared to Pangu, a well-established global model. The evaluation indicates that our model excels in predicting most weather variables, highlighting its potential as a more effective alternative in the field of limited area modeling. A noteworthy feature of this model is the integration of enhanced boundary conditions, inspired by traditional numerical weather prediction (NWP) techniques. This integration has substantially improved the model's predictive accuracy. Additionally, the model includes an innovative approach for diagnosing hourly total precipitation at a high spatial resolution of approximately 5 kilometers. This is achieved through a latent diffusion model, offering an alternative method for generating high-resolution precipitation data.
MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections
Xiao, Da, Meng, Qingye, Li, Shengping, Yuan, Xingyuan
We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving the performance of Transformers trained with 1.8X-2.4X compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code in JAX and PyTorch and pre-trained models are available at https://github.com/Caiyun-AI/MUDDFormer .
Benchmarking and Understanding Compositional Relational Reasoning of LLMs
Ni, Ruikang, Xiao, Da, Meng, Qingye, Li, Xiangyu, Zheng, Shihui, Liang, Hongliang
Compositional relational reasoning (CRR) is a hallmark of human intelligence, but we lack a clear understanding of whether and how existing transformer large language models (LLMs) can solve CRR tasks. To enable systematic exploration of the CRR capability of LLMs, we first propose a new synthetic benchmark called Generalized Associative Recall (GAR) by integrating and generalizing the essence of several tasks in mechanistic interpretability (MI) study in a unified framework. Evaluation shows that GAR is challenging enough for existing LLMs, revealing their fundamental deficiency in CRR. Meanwhile, it is easy enough for systematic MI study. Then, to understand how LLMs solve GAR tasks, we use attribution patching to discover the core circuits reused by Vicuna-33B across different tasks and a set of vital attention heads. Intervention experiments show that the correct functioning of these heads significantly impacts task performance. Especially, we identify two classes of heads whose activations represent the abstract notion of true and false in GAR tasks respectively. They play a fundamental role in CRR across various models and tasks. The dataset and code are available at https://github.com/Caiyun-AI/GAR.
Improving Transformers with Dynamically Composable Multi-Head Attention
Xiao, Da, Meng, Qingye, Li, Shengping, Yuan, Xingyuan
Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads work independently, causing problems such as low-rank bottleneck of attention score matrices and head redundancy. We propose Dynamically Composable Multi-Head Attention (DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model by dynamically composing attention heads. At the core of DCMHA is a $\it{Compose}$ function that transforms the attention score and weight matrices in an input-dependent way. DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer. DCFormer significantly outperforms Transformer on different architectures and model scales in language modeling, matching the performance of models with ~1.7x-2.0x compute. For example, DCPythia-6.9B outperforms open source Pythia-12B on both pretraining perplexity and downstream task evaluation. The code and models are available at https://github.com/Caiyun-AI/DCFormer.
InterHT: Knowledge Graph Embeddings by Interaction between Head and Tail Entities
Wang, Baoxin, Meng, Qingye, Wang, Ziyue, Zhao, Honghong, Wu, Dayong, Che, Wanxiang, Wang, Shijin, Chen, Zhigang, Liu, Cong
Knowledge graph embedding (KGE) models learn the representation of entities and relations in knowledge graphs. Distance-based methods show promising performance on link prediction task, which predicts the result by the distance between two entity representations. However, most of these methods represent the head entity and tail entity separately, which limits the model capacity. We propose two novel distance-based methods named InterHT and InterHT+ that allow the head and tail entities to interact better and get better entity representation. Experimental results show that our proposed method achieves the best results on ogbl-wikikg2 dataset.