AITopics | Sarkar, Vivek

MISIM: A Novel Code Similarity System

Ye, Fangke, Zhou, Shengtian, Venkat, Anand, Marcus, Ryan, Tatbul, Nesime, Tithi, Jesmin Jahan, Hasabnis, Niranjan, Petersen, Paul, Mattson, Timothy, Kraska, Tim, Dubey, Pradeep, Sarkar, Vivek, Gottschlich, Justin

arXiv.org Machine LearningOct-8-2020

Code similarity systems are integral to a range of applications from code recommendation to automated software defect correction. We argue that code similarity is now a first-order problem that must be solved. To begin to address this, we present machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware semantic structure, which is designed to aid in lifting semantic meaning from code syntax. Second, MISIM provides a neural-based code similarity scoring algorithm, which can be implemented with various neural network architectures with learned parameters. We compare MISIM to three state-of-the-art code similarity systems: (i)code2vec, (ii)Neural Code Comprehension, and (iii)Aroma. In our experimental evaluation across 328,155 programs (over 18 million lines of code), MISIM has 1.5x to 43.4x better accuracy than all three systems.

configuration, deep learning, neural network, (22 more...)

arXiv.org Machine Learning

2006.05265

Country:

North America > United States (0.47)
Asia > Middle East > Qatar (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)

Add feedback

Topkapi: Parallel and Fast Sketches for Finding Top-K Frequent Elements

Mandal, Ankush, Jiang, He, Shrivastava, Anshumali, Sarkar, Vivek

Neural Information Processing SystemsDec-31-2018

Identifying the top-K frequent items is one of the most common and important operations in large data processing systems. As a result, several solutions have been proposed to solve this problem approximately. In this paper, we identify that in modern distributed settings with both multi-node as well as multi-core parallelism, existing algorithms, although theoretically sound, are suboptimal from the performance perspective. In particular, for identifying top-K frequent items, Count-Min Sketch (CMS) has fantastic update time but lack the important property of reducibility which is needed for exploiting available massive data parallelism. On the other end, popular Frequent algorithm (FA) leads to reducible summaries but the update costs are significant. In this paper, we present Topkapi, a fast and parallel algorithm for finding top-K frequent items, which gives the best of both worlds, i.e., it is reducible as well as efficient update time similar to CMS. Topkapi possesses strong theoretical guarantees and leads to significant performance gains due to increased parallelism, relative to past work.

algorithm, artificial intelligence, data mining, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence (0.93)
Information Technology > Data Science > Data Mining (0.46)
Information Technology > Architecture > Distributed Systems (0.34)

Add feedback

Topkapi: Parallel and Fast Sketches for Finding Top-K Frequent Elements

Mandal, Ankush, Jiang, He, Shrivastava, Anshumali, Sarkar, Vivek

Neural Information Processing SystemsDec-31-2018

Identifying the top-K frequent items is one of the most common and important operations in large data processing systems. As a result, several solutions have been proposed to solve this problem approximately. In this paper, we identify that in modern distributed settings with both multi-node as well as multi-core parallelism, existing algorithms, although theoretically sound, are suboptimal from the performance perspective. In particular, for identifying top-K frequent items, Count-Min Sketch (CMS) has fantastic update time but lack the important property of reducibility which is needed for exploiting available massive data parallelism. On the other end, popular Frequent algorithm (FA) leads to reducible summaries but the update costs are significant. In this paper, we present Topkapi, a fast and parallel algorithm for finding top-K frequent items, which gives the best of both worlds, i.e., it is reducible as well as efficient update time similar to CMS. Topkapi possesses strong theoretical guarantees and leads to significant performance gains due to increased parallelism, relative to past work.

algorithm, artificial intelligence, data mining, (19 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Industry: Information Technology (0.34)

Technology: