Su, Jiahao
FullStack Bench: Evaluating LLMs as Full Stack Coders
Bytedance-Seed-Foundation-Code-Team, null, :, null, Cheng, Yao, Chen, Jianfeng, Chen, Jie, Chen, Li, Chen, Liyu, Chen, Wentao, Chen, Zhengyu, Geng, Shijie, Li, Aoyan, Li, Bo, Li, Bowen, Li, Linyi, Liu, Boyi, Liu, Jerry, Liu, Kaibo, Liu, Qi, Liu, Shukai, Liu, Siyao, Liu, Tianyi, Liu, Tingkai, Liu, Yongfei, Long, Rui, Mai, Jing, Ning, Guanghan, Peng, Z. Y., Shen, Kai, Su, Jiahao, Su, Jing, Sun, Tao, Sun, Yifan, Tao, Yunzhe, Wang, Guoyin, Wang, Siwei, Wang, Xuwu, Wang, Yite, Wang, Zihan, Xia, Jinxiang, Xiang, Liang, Xiao, Xia, Xiao, Yongsheng, Xi, Chenguang, Xin, Shulin, Xu, Jingjing, Xu, Shikun, Yang, Hongxia, Yang, Jack, Yang, Yingxiang, Yuan, Jianbo, Zhang, Jun, Zhang, Yufeng, Zhang, Yuyu, Zheng, Shen, Zhu, He, Zhu, Ming
As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming languages and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our FullStack Bench and SandboxFusion.
conv_einsum: A Framework for Representation and Fast Evaluation of Multilinear Operations in Convolutional Tensorial Neural Networks
Rabbani, Tahseen, Su, Jiahao, Liu, Xiaoyu, Chan, David, Sangston, Geoffrey, Huang, Furong
Modern ConvNets continue to achieve state-of-the-art results over a vast array of vision and image classification tasks, but at the cost of increasing parameters. One strategy for compactifying a network without sacrificing much expressive power is to reshape it into a tensorial neural network (TNN), which is a higher-order tensorization of its layers, followed by a factorization, such as a CP-decomposition, which strips a weight down to its critical basis components. Passes through TNNs can be represented as sequences of multilinear operations (MLOs), where the evaluation path can greatly affect the number of floating point operations (FLOPs) incurred. While functions such as the popular einsum can evaluate simple MLOs such as contractions, existing implementations cannot process multi-way convolutions, resulting in scant assessments of how optimal evaluation paths through tensorized convolutional layers can improve training speed. In this paper, we develop a unifying framework for representing tensorial convolution layers as einsum-like strings and a meta-algorithm conv_einsum which is able to evaluate these strings in a FLOPs-minimizing manner. Comprehensive experiments, using our open-source implementation, over a wide range of models, tensor decompositions, and diverse tasks, demonstrate that conv_einsum significantly increases both computational and memory-efficiency of convolutional TNNs.
LEMON: Lossless model expansion
Wang, Yite, Su, Jiahao, Lu, Hanlin, Xie, Cong, Liu, Tianyi, Yuan, Jianbo, Lin, Haibin, Sun, Ruoyu, Yang, Hongxia
Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intensive to obtain. To tackle this inefficiency, we present $\textbf{L}$ossl$\textbf{E}$ss $\textbf{MO}$del Expansio$\textbf{N}$ (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts. This is followed by model training with an optimized learning rate scheduler tailored explicitly for the scaled models, substantially reducing the training time compared to training from scratch. Notably, LEMON is versatile, ensuring compatibility with various network structures, including models like Vision Transformers and BERT. Our empirical results demonstrate that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch.
Reviving Shift Equivariance in Vision Transformers
Ding, Peijian, Soselia, Davit, Armstrong, Thomas, Su, Jiahao, Huang, Furong
Shift equivariance is a fundamental principle that governs how we perceive the world -- our recognition of an object remains invariant with respect to shifts. Transformers have gained immense popularity due to their effectiveness in both language and vision tasks. While the self-attention operator in vision transformers (ViT) is permutation-equivariant and thus shift-equivariant, patch embedding, positional encoding, and subsampled attention in ViT variants can disrupt this property, resulting in inconsistent predictions even under small shift perturbations. Although there is a growing trend in incorporating the inductive bias of convolutional neural networks (CNNs) into vision transformers, it does not fully address the issue. We propose an adaptive polyphase anchoring algorithm that can be seamlessly integrated into vision transformer models to ensure shift-equivariance in patch embedding and subsampled attention modules, such as window attention and global subsampled attention. Furthermore, we utilize depth-wise convolution to encode positional information. Our algorithms enable ViT, and its variants such as Twins to achieve 100% consistency with respect to input shift, demonstrate robustness to cropping, flipping, and affine transformations, and maintain consistent predictions even when the original models lose 20 percentage points on average when shifted by just a few pixels with Twins' accuracy dropping from 80.57% to 62.40%.
Adversarial Auto-Augment with Label Preservation: A Representation Learning Principle Guided Approach
Yang, Kaiwen, Sun, Yanchao, Su, Jiahao, He, Fengxiang, Tian, Xinmei, Huang, Furong, Zhou, Tianyi, Tao, Dacheng
Data augmentation is a critical contributing factor to the success of deep learning but heavily relies on prior domain knowledge which is not always available. Recent works on automatic data augmentation learn a policy to form a sequence of augmentation operations, which are still pre-defined and restricted to limited options. In this paper, we show that a prior-free autonomous data augmentation's objective can be derived from a representation learning principle that aims to preserve the minimum sufficient information of the labels. Given an example, the objective aims at creating a distant "hard positive example" as the augmentation, while still preserving the original label. We then propose a practical surrogate to the objective that can be optimized efficiently and integrated seamlessly into existing methods for a broad class of machine learning tasks, e.g., supervised, semi-supervised, and noisy-label learning. Unlike previous works, our method does not require training an extra generative model but instead leverages the intermediate layer representations of the end-task model for generating data augmentations. In experiments, we show that our method consistently brings non-trivial improvements to the three aforementioned learning tasks from both efficiency and final performance, either or not combined with strong pre-defined augmentations, e.g., on medical images when domain knowledge is unavailable and the existing augmentation techniques perform poorly.
Tensorized Spectrum Preserving Compression for Neural Networks
Su, Jiahao, Li, Jingling, Bhattacharjee, Bobby, Huang, Furong
Modern neural networks can have tens of millions of parameters, and are often ill-suited for smartphones or IoT devices. In this paper, we describe an efficient mechanism for compressing large networks by {\em tensorizing\/} network layers: i.e. mapping layers on to high-order matrices, for which we introduce new tensor decomposition methods. Compared to previous compression methods, some of which use tensor decomposition, our techniques preserve more of the networks invariance structure. Coupled with a new data reconstruction-based learning method, we show that tensorized compression outperforms existing techniques for both convolutional and fully-connected layers on state-of-the art networks.