Overview
A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges
Zakeri-Nasrabadi, Morteza, Parsa, Saeed, Ramezani, Mohammad, Roy, Chanchal, Ekhtiarzadeh, Masoud
Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.
Recent Advances in Optimal Transport for Machine Learning
Montesuma, Eduardo Fernandes, Mboula, Fred Ngolè, Souloumiac, Antoine
Recently, Optimal Transport has been proposed as a probabilistic framework in Machine Learning for comparing and manipulating probability distributions. This is rooted in its rich history and theory, and has offered new solutions to different problems in machine learning, such as generative modeling and transfer learning. In this survey we explore contributions of Optimal Transport for Machine Learning over the period 2012 -- 2022, focusing on four sub-fields of Machine Learning: supervised, unsupervised, transfer and reinforcement learning. We further highlight the recent development in computational Optimal Transport, and its interplay with Machine Learning practice.
NarraSum: A Large-Scale Dataset for Abstractive Narrative Summarization
Zhao, Chao, Brahman, Faeze, Song, Kaiqiang, Yao, Wenlin, Yu, Dian, Chaturvedi, Snigdha
Narrative summarization aims to produce a distilled version of a narrative to describe its most salient events and characters. Summarizing a narrative is challenging as it requires an understanding of event causality and character behaviors. To encourage research in this direction, we propose NarraSum, a large-scale narrative summarization dataset. It contains 122K narrative documents, which are collected from plot descriptions of movies and TV episodes with diverse genres, and their corresponding abstractive summaries. Experiments show that there is a large performance gap between humans and the state-of-the-art summarization models on NarraSum. We hope that this dataset will promote future research in summarization, as well as broader studies of natural language understanding and generation. The dataset is available at https://github.com/zhaochaocs/narrasum.
Leveraging Trust for Joint Multi-Objective and Multi-Fidelity Optimization
Irshad, Faran, Karsch, Stefan, Döpp, Andreas
In the pursuit of efficient optimization of expensive-to-evaluate systems, this paper investigates a novel approach to Bayesian multi-objective and multi-fidelity (MOMF) optimization. Traditional optimization methods, while effective, often encounter prohibitively high costs in multi-dimensional optimizations of one or more objectives. Multi-fidelity approaches offer potential remedies by utilizing multiple, less costly information sources, such as low-resolution simulations. However, integrating these two strategies presents a significant challenge. We suggest the innovative use of a trust metric to support simultaneous optimization of multiple objectives and data sources. Our method modifies a multi-objective optimization policy to incorporate the trust gain per evaluation cost as one objective in a Pareto optimization problem, enabling simultaneous MOMF at lower costs. We present and compare two MOMF optimization methods: a holistic approach selecting both the input parameters and the trust parameter jointly, and a sequential approach for benchmarking. Through benchmarks on synthetic test functions, our approach is shown to yield significant cost reductions - up to an order of magnitude compared to pure multi-objective optimization. Furthermore, we find that joint optimization of the trust and objective domains outperforms addressing them in sequential manner. We validate our results using the use case of optimizing laser-plasma acceleration simulations, demonstrating our method's potential in Pareto optimization of high-cost black-box functions. Implementing these methods in existing Bayesian frameworks is simple, and they can be readily extended to batch optimization. With their capability to handle various continuous or discrete fidelity dimensions, our techniques offer broad applicability in solving simulation problems in fields such as plasma physics and fluid dynamics.
A Comprehensive Introduction of Visual-Inertial Navigation
In this article, a tutorial introduction to visual-inertial navigation(VIN) is presented. Visual and inertial perception are two complementary sensing modalities. Cameras and inertial measurement units (IMU) are the corresponding sensors for these two modalities. The low cost and light weight of camera-IMU sensor combinations make them ubiquitous in robotic navigation. Visual-inertial Navigation is a state estimation problem, that estimates the ego-motion and local environment of the sensor platform. This paper presents visual-inertial navigation in the classical state estimation framework, first illustrating the estimation problem in terms of state variables and system models, including related quantities representations (Parameterizations), IMU dynamic and camera measurement models, and corresponding general probabilistic graphical models (Factor Graph). Secondly, we investigate the existing model-based estimation methodologies, these involve filter-based and optimization-based frameworks and related on-manifold operations. We also discuss the calibration of some relevant parameters, also initialization of state of interest in optimization-based frameworks. Then the evaluation and improvement of VIN in terms of accuracy, efficiency, and robustness are discussed. Finally, we briefly mention the recent development of learning-based methods that may become alternatives to traditional model-based methods.
Non-parametric online market regime detection and regime clustering for multidimensional and path-dependent data structures
Issa, Zacharia, Horvath, Blanka
In this work we present a non-parametric online market regime detection method for multidimensional data structures using a path-wise two-sample test derived from a maximum mean discrepancy-based similarity metric on path space that uses rough path signatures as a feature map. The latter similarity metric has been developed and applied as a discriminator in recent generative models for small data environments, and has been optimised here to the setting where the size of new incoming data is particularly small, for faster reactivity. On the same principles, we also present a path-wise method for regime clustering which extends our previous work [HIM21]. The presented regime clustering techniques, as in [HIM21], were designed as ex-ante market analysis tools that can identify periods of approximatively similar market activity, but the new results also apply to path-wise, high dimensional-, and to non-Markovian settings as well as to data structures that exhibit autocorrelation. We demonstrate our clustering tools on easily verifiable synthetic datasets of increasing complexity, and also show how the outlined regime detection techniques can be used as fast on-line automatic regime change detectors or as outlier detection tools, including a fully automated pipeline. Finally, we apply the fine-tuned algorithms to real-world historical data including high-dimensional baskets of equities and the recent price evolution of crypto assets, and we show that our methodology swiftly and accurately indicated historical periods of market turmoil.
Large Language Models as Annotators: Enhancing Generalization of NLP Models at Minimal Cost
Bansal, Parikshit, Sharma, Amit
State-of-the-art supervised NLP models achieve high accuracy but are also susceptible to failures on inputs from low-data regimes, such as domains that are not represented in training data. As an approximation to collecting ground-truth labels for the specific domain, we study the use of large language models (LLMs) for annotating inputs and improving the generalization of NLP models. Specifically, given a budget for LLM annotations, we present an algorithm for sampling the most informative inputs to annotate and retrain the NLP model. We find that popular active learning strategies such as uncertainty-based sampling do not work well. Instead, we propose a sampling strategy based on the difference in prediction scores between the base model and the finetuned NLP model, utilizing the fact that most NLP models are finetuned from a base model. Experiments with classification (semantic similarity) and ranking (semantic search) tasks show that our sampling strategy leads to significant gains in accuracy for both the training and target domains.
A Unified View of Deep Learning for Reaction and Retrosynthesis Prediction: Current Status and Future Challenges
Meng, Ziqiao, Zhao, Peilin, Yu, Yang, King, Irwin
Reaction and retrosynthesis prediction are fundamental tasks in computational chemistry that have recently garnered attention from both the machine learning and drug discovery communities. Various deep learning approaches have been proposed to tackle these problems, and some have achieved initial success. In this survey, we conduct a comprehensive investigation of advanced deep learning-based models for reaction and retrosynthesis prediction. We summarize the design mechanisms, strengths, and weaknesses of state-of-the-art approaches. Then, we discuss the limitations of current solutions and open challenges in the problem itself. Finally, we present promising directions to facilitate future research. To our knowledge, this paper is the first comprehensive and systematic survey that seeks to provide a unified understanding of reaction and retrosynthesis prediction.
Quantum Federated Learning: Analysis, Design and Implementation Challenges
Gurung, Dev, Pokhrel, Shiva Raj, Li, Gang
Abstract--Quantum Federated Learning (QFL) has gained significant attention due to quantum computing and machine learning advancements. As the demand for QFL continues to surge, there is a pressing need to comprehend its intricacies in distributed environments. This paper aims to provide a comprehensive overview of the current state of QFL, addressing a crucial knowledge gap in the existing literature. We develop ideas for new QFL frameworks, explore diverse use cases of applications, and consider the critical factors influencing their design. The technical contributions and limitations of various QFL research projects are examined while presenting future research directions and open questions for further exploration. Devices (1... n) send back trained models (φ It promises to reduce the computational complexity of machine learning tasks and improve model performance [1].
How Can Recommender Systems Benefit from Large Language Models: A Survey
Lin, Jianghao, Dai, Xinyi, Xi, Yunjia, Liu, Weiwen, Chen, Bo, Li, Xiangyang, Zhu, Chenxu, Guo, Huifeng, Yu, Yong, Tang, Ruiming, Zhang, Weinan
Recommender systems (RS) play important roles to match users' information needs for Internet applications. In natural language processing (NLP) domains, large language model (LLM) has shown astonishing emergent abilities (e.g., instruction following, reasoning), thus giving rise to the promising research direction of adapting LLM to RS for performance enhancements and user experience improvements. In this paper, we conduct a comprehensive survey on this research direction from an application-oriented view. We first summarize existing research works from two orthogonal perspectives: where and how to adapt LLM to RS. For the "WHERE" question, we discuss the roles that LLM could play in different stages of the recommendation pipeline, i.e., feature engineering, feature encoder, scoring/ranking function, and pipeline controller. For the "HOW" question, we investigate the training and inference strategies, resulting in two fine-grained taxonomy criteria, i.e., whether to tune LLMs or not, and whether to involve conventional recommendation model (CRM) for inference. Detailed analysis and general development trajectories are provided for both questions, respectively. Then, we highlight key challenges in adapting LLM to RS from three aspects, i.e., efficiency, effectiveness, and ethics. Finally, we summarize the survey and discuss the future prospects. We also actively maintain a GitHub repository for papers and other related resources in this rising direction: https://github.com/CHIANGEL/Awesome-LLM-for-RecSys.