Goto

Collaborating Authors

 software package


Empirical Evaluation of AI-Assisted Software Package Selection: A Knowledge Graph Approach

Farshidi, Siamak, Saberhabibi, Amir, Eskafi, Behbod, Nikfarjam, Niloofar, Eskandari, Sadegh, Jansen, Slinger, Chaudron, Michel, Tekinerdogan, Bedir

arXiv.org Artificial Intelligence

Selecting third-party software packages in open-source ecosystems like Python is challenging due to the large number of alternatives and limited transparent evidence for comparison. Generative AI tools are increasingly used in development workflows, but their suggestions often overlook dependency evaluation, emphasize popularity over suitability, and lack reproducibility. This creates risks for projects that require transparency, long-term reliability, maintainability, and informed architectural decisions. This study formulates software package selection as a Multi-Criteria Decision-Making (MCDM) problem and proposes a data-driven framework for technology evaluation. Automated data pipelines continuously collect and integrate software metadata, usage trends, vulnerability information, and developer sentiment from GitHub, PyPI, and Stack Overflow. These data are structured into a decision model representing relationships among packages, domain features, and quality attributes. The framework is implemented in PySelect, a decision support system that uses large language models to interpret user intent and query the model to identify contextually appropriate packages. The approach is evaluated using 798,669 Python scripts from 16,887 GitHub repositories and a user study based on the Technology Acceptance Model. Results show high data extraction precision, improved recommendation quality over generative AI baselines, and positive user evaluations of usefulness and ease of use. This work introduces a scalable, interpretable, and reproducible framework that supports evidence-based software selection using MCDM principles, empirical data, and AI-assisted intent modeling.


PERC: a suite of software tools for the curation of cryoEM data with application to simulation, modelling and machine learning

Costa-Gomes, Beatriz, Greer, Joel, Juraschko, Nikolai, Parkhurst, James, Mirecka, Jola, Famili, Marjan, Rangel-Smith, Camila, Strickson, Oliver, Lowe, Alan, Basham, Mark, Burnley, Tom

arXiv.org Artificial Intelligence

Ease of access to data, tools and models expedites scientific research. In structural biology there are now numerous open repositories of experimental and simulated datasets. Being able to easily access and utilise these is crucial for allowing researchers to make optimal use of their research effort. The tools presented here are useful for collating existing public cryoEM datasets and/or creating new synthetic cryoEM datasets to aid the development of novel data processing and interpretation algorithms. In recent years, structural biology has seen the development of a multitude of machine-learning based algorithms for aiding numerous steps in the processing and reconstruction of experimental datasets and the use of these approaches has become widespread. Developing such techniques in structural biology requires access to large datasets which can be cumbersome to curate and unwieldy to make use of. In this paper we present a suite of Python software packages which we collectively refer to as PERC (profet, EMPIARreader and CAKED). These are designed to reduce the burden which data curation places upon structural biology research. The protein structure fetcher (profet) package allows users to conveniently download and cleave sequences or structures from the Protein Data Bank or Alphafold databases. EMPIARreader allows lazy loading of Electron Microscopy Public Image Archive datasets in a machine-learning compatible structure. The Class Aggregator for Key Electron-microscopy Data (CAKED) package is designed to seamlessly facilitate the training of machine learning models on electron microscopy data, including electron-cryo-microscopy-specific data augmentation and labelling. These packages may be utilised independently or as building blocks in workflows. All are available in open source repositories and designed to be easily extensible to facilitate more advanced workflows if required.


Terrier: A Deep Learning Repeat Classifier

Turnbull, Robert, Young, Neil D., Tescari, Edoardo, Skerratt, Lee F., Kosch, Tiffany A.

arXiv.org Artificial Intelligence

Repetitive DNA sequences underpin genome architecture and evolutionary processes, yet they remain challenging to classify accurately. Terrier is a deep learning model designed to overcome these challenges by classifying repetitive DNA sequences using a publicly available, curated repeat sequence library trained under the RepeatMasker schema. Existing tools often struggle to classify divergent taxa due to biases in reference libraries, limiting our understanding of repeat evolution and function. Terrier overcomes these challenges by leveraging deep learning for improved accuracy. Trained on RepBase, which includes over 100,000 repeat families -- four times more than Dfam -- Terrier maps 97.1% of RepBase sequences to RepeatMasker categories, offering the most comprehensive classification system available. When benchmarked against DeepTE, TERL, and TEclass2 in model organisms (rice and fruit flies), Terrier achieved superior accuracy while classifying a broader range of sequences. Further validation in non-model amphibian and flatworm genomes highlights its effectiveness in improving classification in non-model species, facilitating research on repeat-driven evolution, genomic instability, and phenotypic variation.


Autonomous Vehicles: Evolution of Artificial Intelligence and Learning Algorithms

Garikapati, Divya, Shetiya, Sneha Sudhir

arXiv.org Artificial Intelligence

The advent of autonomous vehicles has heralded a transformative era in transportation, reshaping the landscape of mobility through cutting-edge technologies. Central to this evolution is the integration of Artificial Intelligence (AI) and learning algorithms, propelling vehicles into realms of unprecedented autonomy. This paper provides a comprehensive exploration of the evolutionary trajectory of AI within autonomous vehicles, tracing the journey from foundational principles to the most recent advancements. Commencing with a current landscape overview, the paper delves into the fundamental role of AI in shaping the autonomous decision-making capabilities of vehicles. It elucidates the steps involved in the AI-powered development life cycle in vehicles, addressing ethical considerations and bias in AI-driven software development for autonomous vehicles. The study presents statistical insights into the usage and types of AI/learning algorithms over the years, showcasing the evolving research landscape within the automotive industry. Furthermore, the paper highlights the pivotal role of parameters in refining algorithms for both trucks and cars, facilitating vehicles to adapt, learn, and improve performance over time. It concludes by outlining different levels of autonomy, elucidating the nuanced usage of AI and learning algorithms, and automating key tasks at each level. Additionally, the document discusses the variation in software package sizes across different autonomy levels


A PNP ion channel deep learning solver with local neural network and finite element input data

Lee, Hwi, Chao, Zhen, Cobb, Harris, Liu, Yingjie, Xie, Dexuan

arXiv.org Artificial Intelligence

In this paper, a deep learning method for solving an improved one-dimensional Poisson-Nernst-Planck ion channel (PNPic) model, called the PNPic deep learning solver, is presented. In particular, it combines a novel local neural network scheme with an effective PNPic finite element solver. Since the input data of the neural network scheme only involves a small local patch of coarse grid solutions, which the finite element solver can quickly produce, the PNPic deep learning solver can be trained much faster than any corresponding conventional global neural network solvers. After properly trained, it can output a predicted PNPic solution in a much higher degree of accuracy than the low cost coarse grid solutions and can reflect different perturbation cases on the parameters, ion channel subregions, and interface and boundary values, etc. Consequently, the PNPic deep learning solver can generate a numerical solution with high accuracy for a family of PNPic models. As an initial study, two types of numerical tests were done by perturbing one and two parameters of the PNPic model, respectively, as well as the tests done by using a few perturbed interface positions of the model as training samples. These tests demonstrate that the PNPic deep learning solver can generate highly accurate PNPic numerical solutions.


Uncovering communities of pipelines in the task-fMRI analytical space

Germani, Elodie, Fromont, Elisa, Maumet, Camille

arXiv.org Artificial Intelligence

Functional magnetic resonance imaging analytical workflows are highly flexible with no definite consensus on how to choose a pipeline. While methods have been developed to explore this analytical space, there is still a lack of understanding of the relationships between the different pipelines. We use community detection algorithms to explore the pipeline space and assess its stability across different contexts. We show that there are subsets of pipelines that give similar results, especially those sharing specific parameters (e.g. number of motion regressors, software packages, etc.), with relative stability across groups of participants. By visualizing the differences between these subsets, we describe the effect of pipeline parameters and derive general relationships in the analytical space.


regulAS: A Bioinformatics Tool for the Integrative Analysis of Alternative Splicing Regulome using RNA-Seq data

Lipnitskaya, Sofya

arXiv.org Artificial Intelligence

The regulAS software package is a bioinformatics tool designed to support computational biology researchers in investigating regulatory mechanisms of splicing alterations through integrative analysis of large-scale RNA-Seq data from cancer and healthy human donors, characterized by TCGA and GTEx projects. This technical report provides a comprehensive overview of regulAS, focusing on its core functionality, basic modules, experiment configuration, further extensibility and customisation. The core functionality of regulAS enables the automation of computational experiments, efficient results storage and processing, and streamlined workflow management. Integrated basic modules extend regulAS with features such as RNA-Seq data retrieval from the public multi-omics UCSC Xena data repository, predictive modeling and feature ranking capabilities using the scikit-learn package, and flexible reporting generation for analysing gene expression profiles and relevant modulations of alternative splicing aberrations across tissues and cancer types. Experiment configuration is handled through YAML files with the Hydra and OmegaConf libraries, offering a user-friendly approach. Additionally, regulAS allows for the development and integration of custom modules to handle specialized tasks. In conclusion, regulAS provides an automated solution for alternative splicing and cancer biology studies, enhancing efficiency, reproducibility, and customization of experimental design, while the extensibility of the pipeline enables researchers to further tailor the software package to their specific needs. Source code is available under the MIT license at https://github.com/slipnitskaya/regulAS.


mldr.resampling: Efficient Reference Implementations of Multilabel Resampling Algorithms

Rivera, Antonio J., Dávila, Miguel A., Elizondo, David, del Jesus, María J., Charte, Francisco

arXiv.org Artificial Intelligence

MultiLabel Learning (MLL) [1] is one of the most common machine learning tasks today. It is based on the idea that each data sample is associated with a certain subset of labels. The full set of labels can be large, in many cases even having more labels than input features. As a result, it is common for some labels to occur in only a few samples, while others occur much more frequently. The label imbalance [2] in MLL is almost always present, and it is a serious obstacle to training good classifiers. Class imbalance is a very well-known problem in traditional learning tasks such as binary and multiclass classification. Hundreds of articles [3, 4, 5], conference papers [6] and books [7] have been devoted to studying it and proposing possible solutions. The most popular are data resampling, cost-sensitive learning and mixtures of these approaches [8, 9]. However, imbalanced learning in the MLL field presents some specific aspects that make it more difficult to deal with this problem.


Researchers made breakthrough in reconstruction for cryogenic electron tomography

#artificialintelligence

In a study published in Nature Communication recently, a team led by Prof. BI Guoqiang from the University of Science and Technology of China (USTC) and Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (CAS), together with collaborators from the United States, developed a software package named IsoNet for the isotropic reconstruction for cryogenic electron tomography (cryoET). Their work effectively solved the intrinsic "missing-wedge" problem and low signal-to-noise ratio problems in cryoET. Anisotropic resolution caused by the intrinsic "missing-wedge" problem has long been a challenge when using CryoET for the visualization of cellular structures. To solve this, the team developed IsoNet, a software package based on iterative self-supervised deep learning artificial neural network. Using the rotated cryoET tomographic 3D reconstruction data as the training set, their algorithm is able to perform missing-edge correction on the cryoET data. Simultaneously, a denoising process is added to the IsoNet, allowing the artificial neural network to recover missing information and denoise tomographic 3D data simultaneously.


DIAMBRA Arena: a New Reinforcement Learning Platform for Research and Experimentation

Palmas, Alessandro

arXiv.org Artificial Intelligence

The recent advances in reinforcement learning have led to effective methods able to obtain above human-level performances in very complex environments. However, once solved, these environments become less valuable, and new challenges with different or more complex scenarios are needed to support research advances. This work presents DIAMBRA Arena, a new platform for reinforcement learning research and experimentation, featuring a collection of high-quality environments exposing a Python API fully compliant with OpenAI Gym standard. They are episodic tasks with discrete actions and observations composed by raw pixels plus additional numerical values, all supporting both single player and two players mode, allowing to work on standard reinforcement learning, competitive multi-agent, human-agent competition, self-play, human-in-the-loop training and imitation learning. Software capabilities are demonstrated by successfully training multiple deep reinforcement learning agents with proximal policy optimization obtaining human-like behavior. Results confirm the utility of DIAMBRA Arena as a reinforcement learning research tool, providing environments designed to study some of the most challenging topics in the field.