Plotting

 Wang, Ke


COSET: A Benchmark for Evaluating Neural Program Embeddings

arXiv.org Machine Learning

Neural program embedding can be helpful in analyzing large software, a task that is challenging for traditional logic-based program analyses due to their limited scalability. A key focus of recent machine-learning advances in this area is on modeling program semantics instead of just syntax. Unfortunately evaluating such advances is not obvious, as program semantics does not lend itself to straightforward metrics. In this paper, we introduce a benchmarking framework called COSET for standardizing the evaluation of neural program embeddings. COSET consists of a diverse dataset of programs in source-code format, labeled by human experts according to a number of program properties of interest. A point of novelty is a suite of program transformations included in COSET. These transformations when applied to the base dataset can simulate natural changes to program code due to optimization and refactoring and can serve as a "debugging" tool for classification mistakes. We conducted a pilot study on four prominent models--TreeLSTM [1], gated graph neural network (GGNN) [2], AST-Path neural network (APNN) [3], and DYPRO [4]. We found that COSET is useful in identifying the strengths and limitations of each model and in pinpointing specific syntactic and semantic characteristics of programs that pose challenges.


Enhanced Expressive Power and Fast Training of Neural Networks by Random Projections

arXiv.org Machine Learning

Random projections are able to perform dimension reduction efficiently for datasets with nonlinear low-dimensional structures. One well-known example is that random matrices embed sparse vectors into a low-dimensional subspace nearly isometrically, known as the restricted isometric property in compressed sensing. In this paper, we explore some applications of random projections in deep neural networks. We provide the expressive power of fully connected neural networks when the input data are sparse vectors or form a low-dimensional smooth manifold. We prove that the number of neurons required for approximating a Lipschitz function with a prescribed precision depends on the sparsity or the dimension of the manifold and weakly on the dimension of the input vector. The key in our proof is that random projections embed stably the set of sparse vectors or a low-dimensional smooth manifold into a low-dimensional subspace. Based on this fact, we also propose some new neural network models, where at each layer the input is first projected onto a low-dimensional subspace by a random projection and then the standard linear connection and non-linear activation are applied. In this way, the number of parameters in neural networks is significantly reduced, and therefore the training of neural networks can be accelerated without too much performance loss.


AI Benchmark: Running Deep Neural Networks on Android Smartphones

arXiv.org Artificial Intelligence

Over the last years, the computational power of mobile devices such as smartphones and tablets has grown dramatically, reaching the level of desktop computers available not long ago. While standard smartphone apps are no longer a problem for them, there is still a group of tasks that can easily challenge even high-end devices, namely running artificial intelligence algorithms. In this paper, we present a study of the current state of deep learning in the Android ecosystem and describe available frameworks, programming models and the limitations of running AI on smartphones. We give an overview of the hardware acceleration resources available on four main mobile chipset platforms: Qualcomm, HiSilicon, MediaTek and Samsung. Additionally, we present the real-world performance results of different mobile SoCs collected with AI Benchmark that are covering all main existing hardware configurations.


Dynamic Neural Program Embedding for Program Repair

arXiv.org Artificial Intelligence

Neural program embeddings have shown much promise recently for a variety of program analysis tasks, including program synthesis, program repair, fault localization, etc. However, most existing program embeddings are based on syntactic features of programs, such as raw token sequences or abstract syntax trees. Unlike images and text, a program has an unambiguous semantic meaning that can be difficult to capture by only considering its syntax (i.e. syntactically similar pro- grams can exhibit vastly different run-time behavior), which makes syntax-based program embeddings fundamentally limited. This paper proposes a novel semantic program embedding that is learned from program execution traces. Our key insight is that program states expressed as sequential tuples of live variable values not only captures program semantics more precisely, but also offer a more natural fit for Recurrent Neural Networks to model. We evaluate different syntactic and semantic program embeddings on predicting the types of errors that students make in their submissions to an introductory programming class and two exercises on the CodeHunt education platform. Evaluation results show that our new semantic program embedding significantly outperforms the syntactic program embeddings based on token sequences and abstract syntax trees. In addition, we augment a search-based program repair system with the predictions obtained from our se- mantic embedding, and show that search efficiency is also significantly improved.


Random perturbation and matrix sparsification and completion

arXiv.org Machine Learning

We discuss general perturbation inequalities when the perturbation is random. As applications, we obtain several new results concerning two important problems: matrix sparsification and matrix completion.


MIDA: Multiple Imputation using Denoising Autoencoders

arXiv.org Machine Learning

Missing data is a significant problem impacting all domains. State-of-the-art framework for minimizing missing data bias is multiple imputation, for which the choice of an imputation model remains nontrivial. We propose a multiple imputation model based on overcomplete deep denoising autoencoders. Our proposed model is capable of handling different data types, missingness patterns, missingness proportions and distributions. Evaluation on several real life datasets show our proposed model significantly outperforms current state-of-the-art methods under varying conditions while simultaneously improving end of the line analytics.


Recovering Loss to Followup Information Using Denoising Autoencoders

arXiv.org Machine Learning

Loss to followup is a significant issue in healthcare and has serious consequences for a study's validity and cost. Methods available at present for recovering loss to followup information are restricted by their expressive capabilities and struggle to model highly non-linear relations and complex interactions. In this paper we propose a model based on overcomplete denoising autoencoders to recover loss to followup information. Designed to work with high volume data, results on various simulated and real life datasets show our model is appropriate under varying dataset and loss to followup conditions and outperforms the state-of-the-art methods by a wide margin ($\ge 20\%$ in some scenarios) while preserving the dataset utility for final analysis.


Automated Geometry Theorem Proving for Human-Readable Proofs

AAAI Conferences

Geometry reasoning and proof form a major and challenging component in the K-121 mathematics curriculum. Although several computerized systems exist that help students learn and practice general geometry concepts, they do not target geometry proof problems, which are more advanced and difficult. Powerful geometry theorem provers also exist, however they typically employ advanced algebraic methods and generate complex, difficult to understand proofs, and thus do not meet general K-12 studentsโ€™ educational needs. This paper tackles these weaknesses of prior systems by introducing a geometry proof system, iGeoTutor, capable of generating human-readable elementary proofs, i.e. proofs using standard Euclidean axioms. We have gathered 77 problems in total from various sources, including ones unsolvable by other systems and from Math competitions. iGeoTutor solves all but two problems in under two minutes each, and more importantly, demonstrates a much more effective and intelligent proof search than prior systems. We have also conducted a pilot study with 12 high school students, and the results show that iGeoTutor provides a clear benefit in helping students learn geometry proofs. We are in active discussions with Khan Academy and local high schools for possible adoption of iGeo-Tutor in real learning environments.


Nothing Is Absolute

AAAI Conferences

This paper argues that approximate knowledge might be better for an AI system; because it is better in dealing with contradictory bases. This argument is based on the hypothesis that categorization is the basic means that we comprehend the world, and this is also the way we abstract specific instances into general rules. However, during abstracting process we might lack some information, which may cause our theories about the world to be incomplete. In this case, if our theories about the world are too certain, we would be unable to predict facts and relations with lower likelihood. And I will demonstrate that if we admit that our knowledge is approximate, despite the incompleteness we will be able to predict facts and relations that are of lower likelihood through Pattern Matching.


CCRank: Parallel Learning to Rank with Cooperative Coevolution

AAAI Conferences

We propose CCRank, the first parallel algorithm for learning to rank, targeting simultaneous improvement in learning accuracy and efficiency. CCRank is based on cooperative coevolution (CC), a divide-and-conquer framework that has demonstrated high promise in function optimization for problems with large search space and complex structures. Moreover, CC naturally allows parallelization of sub-solutions to the decomposed subproblems, which can substantially boost learning efficiency. With CCRank, we investigate parallel CC in the context of learning to rank. Extensive experiments on benchmarks in comparison with the state-of-the-art algorithms show that CCRank gains in both accuracy and efficiency.