AITopics

Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.

information retrieval, machine learning, programming language, (19 more...)

2306.16171

Country:

North America > United States > New York > New York County > New York City (0.15)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
(6 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Software Engineering (1.00)
Information Technology > Security & Privacy (1.00)
(5 more...)

Montesuma, Eduardo Fernandes, Mboula, Fred Ngolè, Souloumiac, Antoine

Recent Advances in Optimal Transport for Machine Learning

Recently, Optimal Transport has been proposed as a probabilistic framework in Machine Learning for comparing and manipulating probability distributions. This is rooted in its rich history and theory, and has offered new solutions to different problems in machine learning, such as generative modeling and transfer learning. In this survey we explore contributions of Optimal Transport for Machine Learning over the period 2012 -- 2022, focusing on four sub-fields of Machine Learning: supervised, unsupervised, transfer and reinforcement learning. We further highlight the recent development in computational Optimal Transport, and its interplay with Machine Learning practice.

artificial intelligence, machine learning, wasserstein distance, (12 more...)

2306.16156

Country:

Asia > Middle East > Jordan (0.04)
South America > Brazil > Ceará > Fortaleza (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(4 more...)

Genre: Overview (1.00)

Industry: Education (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.67)

NarraSum: A Large-Scale Dataset for Abstractive Narrative Summarization

Zhao, Chao, Brahman, Faeze, Song, Kaiqiang, Yao, Wenlin, Yu, Dian, Chaturvedi, Snigdha

Narrative summarization aims to produce a distilled version of a narrative to describe its most salient events and characters. Summarizing a narrative is challenging as it requires an understanding of event causality and character behaviors. To encourage research in this direction, we propose NarraSum, a large-scale narrative summarization dataset. It contains 122K narrative documents, which are collected from plot descriptions of movies and TV episodes with diverse genres, and their corresponding abstractive summaries. Experiments show that there is a large performance gap between humans and the state-of-the-art summarization models on NarraSum. We hope that this dataset will promote future research in summarization, as well as broader studies of natural language understanding and generation. The dataset is available at https://github.com/zhaochaocs/narrasum.

computational linguistic, machine learning, natural language, (18 more...)

2212.01476

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Dominican Republic (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(16 more...)

Genre:

Research Report (0.64)
Overview (0.46)

Industry:

Media > Television (1.00)
Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Irshad, Faran, Karsch, Stefan, Döpp, Andreas

Leveraging Trust for Joint Multi-Objective and Multi-Fidelity Optimization

In the pursuit of efficient optimization of expensive-to-evaluate systems, this paper investigates a novel approach to Bayesian multi-objective and multi-fidelity (MOMF) optimization. Traditional optimization methods, while effective, often encounter prohibitively high costs in multi-dimensional optimizations of one or more objectives. Multi-fidelity approaches offer potential remedies by utilizing multiple, less costly information sources, such as low-resolution simulations. However, integrating these two strategies presents a significant challenge. We suggest the innovative use of a trust metric to support simultaneous optimization of multiple objectives and data sources. Our method modifies a multi-objective optimization policy to incorporate the trust gain per evaluation cost as one objective in a Pareto optimization problem, enabling simultaneous MOMF at lower costs. We present and compare two MOMF optimization methods: a holistic approach selecting both the input parameters and the trust parameter jointly, and a sequential approach for benchmarking. Through benchmarks on synthetic test functions, our approach is shown to yield significant cost reductions - up to an order of magnitude compared to pure multi-objective optimization. Furthermore, we find that joint optimization of the trust and objective domains outperforms addressing them in sequential manner. We validate our results using the use case of optimizing laser-plasma acceleration simulations, demonstrating our method's potential in Pareto optimization of high-cost black-box functions. Implementing these methods in existing Bayesian frameworks is simple, and they can be readily extended to batch optimization. With their capability to handle various continuous or discrete fidelity dimensions, our techniques offer broad applicability in solving simulation problems in fields such as plasma physics and fluid dynamics.

artificial intelligence, machine learning, optimization, (20 more...)

2112.13901

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > Illinois (0.04)
Asia > Japan > Kyūshū & Okinawa > Kyūshū > Fukuoka Prefecture > Fukuoka (0.04)

Genre:

Overview (1.00)
Research Report > New Finding (0.34)

Industry: Transportation > Air (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

A Comprehensive Introduction of Visual-Inertial Navigation

Ning, Yangyang

In this article, a tutorial introduction to visual-inertial navigation(VIN) is presented. Visual and inertial perception are two complementary sensing modalities. Cameras and inertial measurement units (IMU) are the corresponding sensors for these two modalities. The low cost and light weight of camera-IMU sensor combinations make them ubiquitous in robotic navigation. Visual-inertial Navigation is a state estimation problem, that estimates the ego-motion and local environment of the sensor platform. This paper presents visual-inertial navigation in the classical state estimation framework, first illustrating the estimation problem in terms of state variables and system models, including related quantities representations (Parameterizations), IMU dynamic and camera measurement models, and corresponding general probabilistic graphical models (Factor Graph). Secondly, we investigate the existing model-based estimation methodologies, these involve filter-based and optimization-based frameworks and related on-manifold operations. We also discuss the calibration of some relevant parameters, also initialization of state of interest in optimization-based frameworks. Then the evaluation and improvement of VIN in terms of accuracy, efficiency, and robustness are discussed. Finally, we briefly mention the recent development of learning-based methods that may become alternatives to traditional model-based methods.

artificial intelligence, machine learning, natural language, (19 more...)

2307.11758

Country:

North America > United States (0.46)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre:

Instructional Material > Course Syllabus & Notes (0.68)
Overview (0.45)

Industry:

Transportation (1.00)
Energy > Oil & Gas > Upstream (0.87)
Media > Photography (0.67)

Technology:

Information Technology > Sensing and Signal Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.92)
(3 more...)

Issa, Zacharia, Horvath, Blanka

Non-parametric online market regime detection and regime clustering for multidimensional and path-dependent data structures

arXiv.org Machine LearningJun-27-2023

In this work we present a non-parametric online market regime detection method for multidimensional data structures using a path-wise two-sample test derived from a maximum mean discrepancy-based similarity metric on path space that uses rough path signatures as a feature map. The latter similarity metric has been developed and applied as a discriminator in recent generative models for small data environments, and has been optimised here to the setting where the size of new incoming data is particularly small, for faster reactivity. On the same principles, we also present a path-wise method for regime clustering which extends our previous work [HIM21]. The presented regime clustering techniques, as in [HIM21], were designed as ex-ante market analysis tools that can identify periods of approximatively similar market activity, but the new results also apply to path-wise, high dimensional-, and to non-Markovian settings as well as to data structures that exhibit autocorrelation. We demonstrate our clustering tools on easily verifiable synthetic datasets of increasing complexity, and also show how the outlined regime detection techniques can be used as fast on-line automatic regime change detectors or as outlier detection tools, including a fully automated pipeline. Finally, we apply the fine-tuned algorithms to real-world historical data including high-dimensional baskets of equities and the recent price evolution of crypto assets, and we show that our methodology swiftly and accurately indicated historical periods of market turmoil.

data mining, machine learning, regime, (19 more...)

arXiv.org Machine Learning

2306.15835

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States (0.14)

Genre:

Overview (1.00)
Research Report > New Finding (0.66)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Bansal, Parikshit, Sharma, Amit

Large Language Models as Annotators: Enhancing Generalization of NLP Models at Minimal Cost

State-of-the-art supervised NLP models achieve high accuracy but are also susceptible to failures on inputs from low-data regimes, such as domains that are not represented in training data. As an approximation to collecting ground-truth labels for the specific domain, we study the use of large language models (LLMs) for annotating inputs and improving the generalization of NLP models. Specifically, given a budget for LLM annotations, we present an algorithm for sampling the most informative inputs to annotate and retrain the NLP model. We find that popular active learning strategies such as uncertainty-based sampling do not work well. Instead, we propose a sampling strategy based on the difference in prediction scores between the base model and the finetuned NLP model, utilizing the fact that most NLP models are finetuned from a base model. Experiments with classification (semantic similarity) and ranking (semantic search) tasks show that our sampling strategy leads to significant gains in accuracy for both the training and target domains.

annotation, conditional informativeness, similarity, (14 more...)

2306.15766

Country:

Asia > India (0.14)
North America > Mexico (0.04)
North America > Cuba > Guantánamo Province > Guantánamo (0.04)
(3 more...)

Genre:

Overview (0.67)
Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

A Unified View of Deep Learning for Reaction and Retrosynthesis Prediction: Current Status and Future Challenges

Meng, Ziqiao, Zhao, Peilin, Yu, Yang, King, Irwin

Reaction and retrosynthesis prediction are fundamental tasks in computational chemistry that have recently garnered attention from both the machine learning and drug discovery communities. Various deep learning approaches have been proposed to tackle these problems, and some have achieved initial success. In this survey, we conduct a comprehensive investigation of advanced deep learning-based models for reaction and retrosynthesis prediction. We summarize the design mechanisms, strengths, and weaknesses of state-of-the-art approaches. Then, we discuss the limitations of current solutions and open challenges in the problem itself. Finally, we present promising directions to facilitate future research. To our knowledge, this paper is the first comprehensive and systematic survey that seeks to provide a unified understanding of reaction and retrosynthesis prediction.

artificial intelligence, machine learning, prediction, (16 more...)

2306.1589

Country:

Asia > China > Hong Kong (0.04)
North America > United States > Maryland > Baltimore (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Overview (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Gurung, Dev, Pokhrel, Shiva Raj, Li, Gang

Quantum Federated Learning: Analysis, Design and Implementation Challenges

Abstract--Quantum Federated Learning (QFL) has gained significant attention due to quantum computing and machine learning advancements. As the demand for QFL continues to surge, there is a pressing need to comprehend its intricacies in distributed environments. This paper aims to provide a comprehensive overview of the current state of QFL, addressing a crucial knowledge gap in the existing literature. We develop ideas for new QFL frameworks, explore diverse use cases of applications, and consider the critical factors influencing their design. The technical contributions and limitations of various QFL research projects are examined while presenting future research directions and open questions for further exploration. Devices (1... n) send back trained models (φ It promises to reduce the computational complexity of machine learning tasks and improve model performance [1].

artificial intelligence, machine learning, qfl, (13 more...)

2306.15708

Genre: Overview (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

How Can Recommender Systems Benefit from Large Language Models: A Survey

Lin, Jianghao, Dai, Xinyi, Xi, Yunjia, Liu, Weiwen, Chen, Bo, Li, Xiangyang, Zhu, Chenxu, Guo, Huifeng, Yu, Yong, Tang, Ruiming, Zhang, Weinan

Recommender systems (RS) play important roles to match users' information needs for Internet applications. In natural language processing (NLP) domains, large language model (LLM) has shown astonishing emergent abilities (e.g., instruction following, reasoning), thus giving rise to the promising research direction of adapting LLM to RS for performance enhancements and user experience improvements. In this paper, we conduct a comprehensive survey on this research direction from an application-oriented view. We first summarize existing research works from two orthogonal perspectives: where and how to adapt LLM to RS. For the "WHERE" question, we discuss the roles that LLM could play in different stages of the recommendation pipeline, i.e., feature engineering, feature encoder, scoring/ranking function, and pipeline controller. For the "HOW" question, we investigate the training and inference strategies, resulting in two fine-grained taxonomy criteria, i.e., whether to tune LLMs or not, and whether to involve conventional recommendation model (CRM) for inference. Detailed analysis and general development trajectories are provided for both questions, respectively. Then, we highlight key challenges in adapting LLM to RS from three aspects, i.e., efficiency, effectiveness, and ethics. Finally, we summarize the survey and discuss the future prospects. We also actively maintain a GitHub repository for papers and other related resources in this rising direction: https://github.com/CHIANGEL/Awesome-LLM-for-RecSys.

large language model, machine learning, natural language, (19 more...)

2306.05817

Country:

Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Europe > Norway > Western Norway > Rogaland > Stavanger (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre:

Overview (1.00)
Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)