Jiang, Fan
Analysis of Learning-based Offshore Wind Power Prediction Models with Various Feature Combinations
Fang, Linhan, Jiang, Fan, Toms, Ann Mary, Li, Xingpeng
Accurate wind speed prediction is crucial for designing and selecting sites for offshore wind farms. This paper investigates the effectiveness of various machine learning models in predicting offshore wind power for a site near the Gulf of Mexico by analyzing meteorological data. After collecting and preprocessing meteorological data, nine different input feature combinations were designed to assess their impact on wind power predictions at multiple heights. The results show that using wind speed as the output feature improves prediction accuracy by approximately 10% compared to using wind power as the output. In addition, the improvement of multi-feature input compared with single-feature input is not obvious mainly due to the poor correlation among key features and limited generalization ability of models. These findings underscore the importance of selecting appropriate output features and highlight considerations for using machine learning in wind power forecasting, offering insights that could guide future wind power prediction models and conversion techniques.
Few-Shot Multilingual Open-Domain QA from 5 Examples
Jiang, Fan, Drummond, Tom, Cohn, Trevor
Recent approaches to multilingual open-domain question answering (MLODQA) have achieved promising results given abundant language-specific training data. However, the considerable annotation cost limits the application of these methods for underrepresented languages. We introduce a \emph{few-shot learning} approach to synthesise large-scale multilingual data from large language models (LLMs). Our method begins with large-scale self-supervised pre-training using WikiData, followed by training on high-quality synthetic multilingual data generated by prompting LLMs with few-shot supervision. The final model, \textsc{FsModQA}, significantly outperforms existing few-shot and supervised baselines in MLODQA and cross-lingual and monolingual retrieval. We further show our method can be extended for effective zero-shot adaptation to new languages through a \emph{cross-lingual prompting} strategy with only English-supervised data, making it a general and applicable solution for MLODQA tasks without costly large-scale annotation.
Franken-Adapter: Cross-Lingual Adaptation of LLMs by Embedding Surgery
Jiang, Fan, Yu, Honglin, Chung, Grace, Cohn, Trevor
The capabilities of Large Language Models (LLMs) in low-resource languages lag far behind those in English, making their universal accessibility a significant challenge. To alleviate this, we present $\textit{Franken-Adapter}$, a modular language adaptation approach for decoder-only LLMs with embedding surgery. Our method begins by creating customized vocabularies for target languages and performing language adaptation through embedding tuning on multilingual data. These pre-trained embeddings are subsequently integrated with LLMs that have been instruction-tuned on English alignment data to enable zero-shot cross-lingual transfer. Our experiments on $\texttt{Gemma2}$ models with up to 27B parameters demonstrate improvements of up to 20% across 96 languages, spanning both discriminative and generative tasks, with minimal regressions ($<$1%) in English. Further in-depth analysis reveals the critical role of customizing tokenizers in enhancing language adaptation, while boosting inference efficiency. Additionally, we show the versatility of our method by achieving a 14% improvement over a math-optimized LLM across 20 languages, offering a modular solution to transfer reasoning abilities across languages post hoc.
Human-In-the-Loop Software Development Agents
Takerngsaksiri, Wannita, Pasuksmit, Jirat, Thongtanunam, Patanamon, Tantithamthavorn, Chakkrit, Zhang, Ruixiong, Jiang, Fan, Li, Jing, Cook, Evan, Chen, Kun, Wu, Ming
Recently, Large Language Models (LLMs)-based multi-agent paradigms for software engineering are introduced to automatically resolve software development tasks (e.g., from a given issue to source code). However, existing work is evaluated based on historical benchmark datasets, rarely considers human feedback at each stage of the automated software development process, and has not been deployed in practice. In this paper, we introduce a Human-in-the-loop LLM-based Agents framework (HULA) for software development that allows software engineers to refine and guide LLMs when generating coding plans and source code for a given task. We design, implement, and deploy the HULA framework into Atlassian JIRA for internal uses. Through a multi-stage evaluation of the HULA framework, Atlassian software engineers perceive that HULA can minimize the overall development time and effort, especially in initiating a coding plan and writing code for straightforward tasks. On the other hand, challenges around code quality remain a concern in some cases. We draw lessons learned and discuss opportunities for future work, which will pave the way for the advancement of LLM-based agents in software development.
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
Sun, Xingwu, Chen, Yanfeng, Huang, Yiqing, Xie, Ruobing, Zhu, Jiaqi, Zhang, Kai, Li, Shuaipeng, Yang, Zhen, Han, Jonny, Shu, Xiaobo, Bu, Jiahao, Chen, Zhongzhi, Huang, Xuemeng, Lian, Fengzong, Yang, Saiyong, Yan, Jianfeng, Zeng, Yuyuan, Ren, Xiaoqin, Yu, Chao, Wu, Lulu, Mao, Yue, Xia, Jun, Yang, Tao, Zheng, Suncong, Wu, Kan, Jiao, Dian, Xue, Jinbao, Zhang, Xipeng, Wu, Decheng, Liu, Kai, Wu, Dengpeng, Xu, Guanghui, Chen, Shaohua, Chen, Shuang, Feng, Xiao, Hong, Yigeng, Zheng, Junqiang, Xu, Chengcheng, Li, Zongwei, Kuang, Xiong, Hu, Jianglu, Chen, Yiqi, Deng, Yuchi, Li, Guiyang, Liu, Ao, Zhang, Chenchen, Hu, Shihui, Zhao, Zilong, Wu, Zifan, Ding, Yao, Wang, Weichao, Liu, Han, Wang, Roberts, Fei, Hao, Yu, Peijie, Zhao, Ze, Cao, Xun, Wang, Hai, Xiang, Fusheng, Huang, Mengyuan, Xiong, Zhiyuan, Hu, Bin, Hou, Xuebin, Jiang, Lei, Ma, Jianqiang, Wu, Jiajia, Deng, Yaping, Shen, Yi, Wang, Qian, Liu, Weijie, Liu, Jie, Chen, Meng, Dong, Liang, Jia, Weiwen, Chen, Hu, Liu, Feifei, Yuan, Rui, Xu, Huilin, Yan, Zhenxiang, Cao, Tengfei, Hu, Zhichao, Feng, Xinhua, Du, Dong, Yu, Tinghao, Tao, Yangyu, Zhang, Feng, Zhu, Jianchen, Xu, Chengzhong, Li, Xirui, Zha, Chong, Ouyang, Wen, Xia, Yinben, Li, Xiang, He, Zekun, Chen, Rongpeng, Song, Jiawei, Chen, Ruibin, Jiang, Fan, Zhao, Chongqing, Wang, Bo, Gong, Hao, Gan, Rong, Hu, Winston, Kang, Zhanhui, Yang, Yong, Liu, Yuhong, Wang, Di, Jiang, Jie
In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large
Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision
Jiang, Fan, Drummond, Tom, Cohn, Trevor
Cross-lingual open domain question answering (CLQA) is a complex problem, comprising cross-lingual retrieval from a multilingual knowledge base, followed by answer generation in the query language. Both steps are usually tackled by separate models, requiring substantial annotated datasets, and typically auxiliary resources, like machine translation systems to bridge between languages. In this paper, we show that CLQA can be addressed using a single encoder-decoder model. To effectively train this model, we propose a self-supervised method based on exploiting the cross-lingual link structure within Wikipedia. We demonstrate how linked Wikipedia pages can be used to synthesise supervisory signals for cross-lingual retrieval, through a form of cloze query, and generate more natural questions to supervise answer generation. Together, we show our approach, \texttt{CLASS}, outperforms comparable methods on both supervised and zero-shot language adaptation settings, including those using machine translation.
UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer
Liu, Ji, Tang, Dehua, Huang, Yuanxian, Zhang, Li, Zeng, Xiaocheng, Li, Dong, Lu, Mingjie, Peng, Jinzhang, Wang, Yu, Jiang, Fan, Tian, Lu, Sirasao, Ashish
Traditional channel-wise pruning methods by reducing network channels struggle to effectively prune efficient CNN models with depth-wise convolutional layers and certain efficient modules, such as popular inverted residual blocks. Prior depth pruning methods by reducing network depths are not suitable for pruning some efficient models due to the existence of some normalization layers. Moreover, finetuning subnet by directly removing activation layers would corrupt the original model weights, hindering the pruned model from achieving high performance. To address these issues, we propose a novel depth pruning method for efficient models. Our approach proposes a novel block pruning strategy and progressive training method for the subnet. Additionally, we extend our pruning method to vision transformer models. Experimental results demonstrate that our method consistently outperforms existing depth pruning methods across various pruning configurations. We obtained three pruned ConvNeXtV1 models with our method applying on ConvNeXtV1, which surpass most SOTA efficient models with comparable inference performance. Our method also achieves state-of-the-art pruning performance on the vision transformer model.
Boot and Switch: Alternating Distillation for Zero-Shot Dense Retrieval
Jiang, Fan, Xu, Qiongkai, Drummond, Tom, Cohn, Trevor
Neural 'dense' retrieval models are state of the art for many datasets, however these models often exhibit limited domain transfer ability. Existing approaches to adaptation are unwieldy, such as requiring explicit supervision, complex model architectures, or massive external models. We present $\texttt{ABEL}$, a simple but effective unsupervised method to enhance passage retrieval in zero-shot settings. Our technique follows a straightforward loop: a dense retriever learns from supervision signals provided by a reranker, and subsequently, the reranker is updated based on feedback from the improved retriever. By iterating this loop, the two components mutually enhance one another's performance. Experimental results demonstrate that our unsupervised $\texttt{ABEL}$ model outperforms both leading supervised and unsupervised retrievers on the BEIR benchmark. Meanwhile, it exhibits strong adaptation abilities to tasks and domains that were unseen during training. By either fine-tuning $\texttt{ABEL}$ on labelled data or integrating it with existing supervised dense retrievers, we achieve state-of-the-art results.\footnote{Source code is available at \url{https://github.com/Fantabulous-J/BootSwitch}.}
Noisy Self-Training with Synthetic Queries for Dense Retrieval
Jiang, Fan, Drummond, Tom, Cohn, Trevor
Although existing neural retrieval models reveal promising results when training data is abundant and the performance keeps improving as training data increases, collecting high-quality annotated data is prohibitively costly. To this end, we introduce a novel noisy self-training framework combined with synthetic queries, showing that neural retrievers can be improved in a self-evolution manner with no reliance on any external models. Experimental results show that our method improves consistently over existing methods on both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval benchmarks. Extra analysis on low-resource settings reveals that our method is data efficient and outperforms competitive baselines, with as little as 30% of labelled training data. Further extending the framework for reranker training demonstrates that the proposed method is general and yields additional gains on tasks of diverse domains.\footnote{Source code is available at \url{https://github.com/Fantabulous-J/Self-Training-DPR}}
Robust Indoor Localization with Ranging-IMU Fusion
Jiang, Fan, Caruso, David, Dhekne, Ashutosh, Qu, Qi, Engel, Jakob Julian, Dong, Jing
Indoor wireless ranging localization is a promising approach for low-power and high-accuracy localization of wearable devices. A primary challenge in this domain stems from non-line of sight propagation of radio waves. This study tackles a fundamental issue in wireless ranging: the unpredictability of real-time multipath determination, especially in challenging conditions such as when there is no direct line of sight. We achieve this by fusing range measurements with inertial measurements obtained from a low cost Inertial Measurement Unit (IMU). For this purpose, we introduce a novel asymmetric noise model crafted specifically for non-Gaussian multipath disturbances. Additionally, we present a novel Levenberg-Marquardt (LM)-family trust-region adaptation of the iSAM2 fusion algorithm, which is optimized for robust performance for our ranging-IMU fusion problem. We evaluate our solution in a densely occupied real office environment. Our proposed solution can achieve temporally consistent localization with an average absolute accuracy of $\sim$0.3m in real-world settings. Furthermore, our results indicate that we can achieve comparable accuracy even with infrequent (1Hz) range measurements.