Peng, Hanyu
Faster Algorithms for Generalized Mean Densest Subgraph Problem
Fan, Chenglin, Li, Ping, Peng, Hanyu
The densest subgraph of a large graph usually refers to some subgraph with the highest average degree, which has been extended to the family of $p$-means dense subgraph objectives by~\citet{veldt2021generalized}. The $p$-mean densest subgraph problem seeks a subgraph with the highest average $p$-th-power degree, whereas the standard densest subgraph problem seeks a subgraph with a simple highest average degree. It was shown that the standard peeling algorithm can perform arbitrarily poorly on generalized objective when $p>1$ but uncertain when $0
Copula for Instance-wise Feature Selection and Ranking
Peng, Hanyu, Fang, Guanhua, Li, Ping
The identification of feature correlations can minimize the redundancy of features. Yet, in the literature of instance-wise Instance-wise feature selection and ranking methods feature selection and ranking methods [Chen et al., 2018, can achieve a good selection of task-friendly Yoon et al., 2019, Abid et al., 2019, Masoomi et al., 2020, features for each sample in the context of neural Wu and Liu, 2018] that follow the context of neural networks, networks. However, existing approaches that the dependencies between features has not been considered assume feature subsets to be independent are imperfect manifestly. For instance, L2X [Chen et al., 2018] performs when considering the dependency between a feature selection for maximizing the mutual information features. To address this limitation, we propose between selected feature subsets and corresponding outputs.
Dataset Pruning: Reducing Training Data by Examining Generalization Influence
Yang, Shuo, Xie, Zeke, Peng, Hanyu, Xu, Min, Sun, Mingming, Li, Ping
The great success of deep learning heavily relies on increasingly larger training data, which comes at a price of huge computational and infrastructural costs. This poses crucial questions that, do all training data contribute to model's performance? How much does each individual training sample or a sub-training-set affect the model's generalization, and how to construct the smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance? To answer these, we propose dataset pruning, an optimization-based sample selection method that can (1) examine the influence of removing a particular set of training samples on model's generalization ability with theoretical guarantee, and (2) construct the smallest subset of training data that yields strictly constrained generalization gap. The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40% training examples on the CIFAR-10 dataset, halves the convergence time with only 1.3% test accuracy decrease, which is superior to previous score-based sample selection methods.
MetaTPTrans: A Meta Learning Approach for Multilingual Code Representation Learning
Pian, Weiguo, Peng, Hanyu, Tang, Xunzhu, Sun, Tiezhu, Tian, Haoye, Habib, Andrew, Klein, Jacques, Bissyandé, Tegawendé F.
Representation learning of source code is essential for applying machine learning to software engineering tasks. Learning code representation from a multilingual source code dataset has been shown to be more effective than learning from single-language datasets separately, since more training data from multilingual dataset improves the model's ability to extract language-agnostic information from source code. However, existing multilingual training overlooks the language-specific information which is crucial for modeling source code across different programming languages, while only focusing on learning a unified model with shared parameters among different languages for language-agnostic information modeling. To address this problem, we propose MetaTPTrans, a meta learning approach for multilingual code representation learning. MetaTPTrans generates different parameters for the feature extractor according to the specific programming language type of the input code snippet, enabling the model to learn both language-agnostic and language-specific information with dynamic parameters in the feature extractor. We conduct experiments on the code summarization and code completion tasks to verify the effectiveness of our approach. The results demonstrate the superiority of our approach with significant improvements on state-of-the-art baselines.