maintainer
One Detector Fits All: Robust and Adaptive Detection of Malicious Packages from PyPI to Enterprises
Montaruli, Biagio, Compagna, Luca, Ponta, Serena Elisa, Balzarotti, Davide
The rise of supply chain attacks via malicious Python packages demands robust detection solutions. Current approaches, however, overlook two critical challenges: robustness against adversarial source code transformations and adaptability to the varying false positive rate (FPR) requirements of different actors, from repository maintainers (requiring low FPR) to enterprise security teams (higher FPR tolerance). We introduce a robust detector capable of seamless integration into both public repositories like PyPI and enterprise ecosystems. To ensure robustness, we propose a novel methodology for generating adversarial packages using fine-grained code obfuscation. Combining these with adversarial training (AT) enhances detector robustness by 2.5x. We comprehensively evaluate AT effectiveness by testing our detector against 122,398 packages collected daily from PyPI over 80 days, showing that AT needs careful application: it makes the detector more robust to obfuscations and allows finding 10% more obfuscated packages, but slightly decreases performance on non-obfuscated packages. We demonstrate production adaptability of our detector via two case studies: (i) one for PyPI maintainers (tuned at 0.1% FPR) and (ii) one for enterprise teams (tuned at 10% FPR). In the former, we analyze 91,949 packages collected from PyPI over 37 days, achieving a daily detection rate of 2.48 malicious packages with only 2.18 false positives. In the latter, we analyze 1,596 packages adopted by a multinational software company, obtaining only 1.24 false positives daily. These results show that our detector can be seamlessly integrated into both public repositories like PyPI and enterprise ecosystems, ensuring a very low time budget of a few minutes to review the false positives. Overall, we uncovered 346 malicious packages, now reported to the community.
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Alpes-Maritimes > Nice (0.04)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Software (0.68)
Measuring AI Ability to Complete Long Tasks
Kwa, Thomas, West, Ben, Becker, Joel, Deng, Amy, Garcia, Katharyn, Hasin, Max, Jawhar, Sami, Kinniment, Megan, Rush, Nate, Von Arx, Sydney, Bloom, Ryan, Broadley, Thomas, Du, Haoxing, Goodrich, Brian, Jurkovic, Nikola, Miles, Luke Harold, Nix, Seraphina, Lin, Tao, Parikh, Neev, Rein, David, Sato, Lucas Jun Koba, Wijk, Hjalmar, Ziegler, Daniel M., Barnes, Elizabeth, Chan, Lawrence
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models' time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results -- including their degree of external validity -- and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.
- Research Report > Experimental Study (0.46)
- Research Report > New Finding (0.34)
- Leisure & Entertainment (0.49)
- Information Technology (0.46)
A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications
Bolleman, Jerven, Emonet, Vincent, Altenhoff, Adrian, Bairoch, Amos, Blatter, Marie-Claude, Bridge, Alan, Duvaud, Severine, Gasteiger, Elisabeth, Kuznetsov, Dmitry, Moretti, Sebastien, Michel, Pierre-Andre, Morgat, Anne, Pagni, Marco, Redaschi, Nicole, Zahn-Zabal, Monique, de Farias, Tarcisio Mendes, Sima, Ana Claudia
Background. In the last decades, several life science resources have structured data using the same framework and made these accessible using the same query language to facilitate interoperability. Knowledge graphs have seen increased adoption in bioinformatics due to their advantages for representing data in a generic graph format. For example, yummydata.org catalogs more than 60 knowledge graphs accessible through SPARQL, a technical query language. Although SPARQL allows powerful, expressive queries, even across physically distributed knowledge graphs, formulating such queries is a challenge for most users. Therefore, to guide users in retrieving the relevant data, many of these resources provide representative examples. These examples can also be an important source of information for machine learning, if a sufficiently large number of examples are provided and published in a common, machine-readable and standardized format across different resources. Findings. We introduce a large collection of human-written natural language questions and their corresponding SPARQL queries over federated bioinformatics knowledge graphs (KGs) collected for several years across different research groups at the SIB Swiss Institute of Bioinformatics. The collection comprises more than 1000 example questions and queries, including 65 federated queries. We propose a methodology to uniformly represent the examples with minimal metadata, based on existing standards. Furthermore, we introduce an extensive set of open-source applications, including query graph visualizations and smart query editors, easily reusable by KG maintainers who adopt the proposed methodology. Conclusions. We encourage the community to adopt and extend the proposed methodology, towards richer KG metadata and improved Semantic Web services.
- North America > United States > Texas > Lavaca County (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
The Vision of Autonomic Computing: Can LLMs Make It a Reality?
Zhang, Zhiyang, Yang, Fangkai, Qin, Xiaoting, Zhang, Jue, Lin, Qingwei, Cheng, Gong, Zhang, Dongmei, Rajmohan, Saravan, Zhang, Qi
The Vision of Autonomic Computing (ACV), proposed over two decades ago, envisions computing systems that self-manage akin to biological organisms, adapting seamlessly to changing environments. Despite decades of research, achieving ACV remains challenging due to the dynamic and complex nature of modern computing systems. Recent advancements in Large Language Models (LLMs) offer promising solutions to these challenges by leveraging their extensive knowledge, language understanding, and task automation capabilities. This paper explores the feasibility of realizing ACV through an LLM-based multi-agent framework for microservice management. We introduce a five-level taxonomy for autonomous service maintenance and present an online evaluation benchmark based on the Sock Shop microservice demo project to assess our framework's performance. Our findings demonstrate significant progress towards achieving Level 3 autonomy, highlighting the effectiveness of LLMs in detecting and resolving issues within microservice architectures. This study contributes to advancing autonomic computing by pioneering the integration of LLMs into microservice management frameworks, paving the way for more adaptive and self-managing computing systems. The code will be made available at https://aka.ms/ACV-LLM.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (9 more...)
SINBAD: Saliency-informed detection of breakage caused by ad blocking
Chehade, Saiid El Hajj, Siby, Sandra, Troncoso, Carmela
Privacy-enhancing blocking tools based on filter-list rules tend to break legitimate functionality. Filter-list maintainers could benefit from automated breakage detection tools that allow them to proactively fix problematic rules before deploying them to millions of users. We introduce SINBAD, an automated breakage detector that improves the accuracy over the state of the art by 20%, and is the first to detect dynamic breakage and breakage caused by style-oriented filter rules. The success of SINBAD is rooted in three innovations: (1) the use of user-reported breakage issues in forums that enable the creation of a high-quality dataset for training in which only breakage that users perceive as an issue is included; (2) the use of 'web saliency' to automatically identify user-relevant regions of a website on which to prioritize automated interactions aimed at triggering breakage; and (3) the analysis of webpages via subtrees which enables fine-grained identification of problematic filter rules.
- South America > Peru > Ucayali Department (0.04)
- South America > Peru > Junín Department (0.04)
- South America > Peru > Cusco Department (0.04)
Public-private funding models in open source software development: A case study on scikit-learn
Governments are increasingly funding open source software (OSS) development to support software security, digital sovereignty, and national competitiveness in science and innovation, amongst others. However, little is known about how OSS developers evaluate the relative benefits and drawbacks of governmental funding for OSS. This study explores this question through a case study on scikit-learn, a Python library for machine learning, funded by public research grants, commercial sponsorship, micro-donations, and a 32 euro million grant announced in France's artificial intelligence strategy. Through 25 interviews with scikit-learn's maintainers and funders, this study makes two key contributions. First, it contributes empirical findings about the benefits and drawbacks of public and private funding in an impactful OSS project, and the governance protocols employed by the maintainers to balance the diverse interests of their community and funders. Second, it offers practical lessons on funding for OSS developers, governments, and companies based on the experience of scikit-learn. The paper concludes with key recommendations for practitioners and future research directions.
- Europe > France (0.67)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- (21 more...)
- Banking & Finance (1.00)
- Government > Regional Government > Europe Government (0.94)
- Information Technology > Security & Privacy (0.66)
Predicting the First Response Latency of Maintainers and Contributors in Pull Requests
Khatoonabadi, SayedHassan, Abdellatif, Ahmad, Costa, Diego Elias, Shihab, Emad
The success of a Pull Request (PR) depends on the responsiveness of the maintainers and the contributor during the review process. Being aware of the expected waiting times can lead to better interactions and managed expectations for both the maintainers and the contributor. In this paper, we propose a machine-learning approach to predict the first response latency of the maintainers following the submission of a PR, and the first response latency of the contributor after receiving the first response from the maintainers. We curate a dataset of 20 large and popular open-source projects on GitHub and extract 21 features to characterize projects, contributors, PRs, and review processes. Using these features, we then evaluate seven types of classifiers to identify the best-performing models. We also perform permutation feature importance and SHAP analyses to understand the importance and impact of different features on the predicted response latencies. Our best-performing models achieve an average improvement of 33% in AUC-ROC and 58% in AUC-PR for maintainers, as well as 42% in AUC-ROC and 95% in AUC-PR for contributors compared to a no-skilled classifier across the projects. Our findings indicate that PRs submitted earlier in the week, containing an average or slightly above-average number of commits, and with concise descriptions are more likely to receive faster first responses from the maintainers. Similarly, PRs with a lower first response latency from maintainers, that received the first response of maintainers earlier in the week, and containing an average or slightly above-average number of commits tend to receive faster first responses from the contributors. Additionally, contributors with a higher acceptance rate and a history of timely responses in the project are likely to both obtain and provide faster first responses.
Poisoning Web-Scale Training Datasets is Practical
Carlini, Nicholas, Jagielski, Matthew, Choquette-Choo, Christopher A., Paleka, Daniel, Pearce, Will, Anderson, Hyrum, Terzis, Andreas, Thomas, Kurt, Tramèr, Florian
Deep learning models are often trained on distributed, webscale datasets crawled from the internet. In this paper, we introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. Our first attack, split-view poisoning, exploits the mutable nature of internet content to ensure a dataset annotator's initial view of the dataset differs from the view downloaded by subsequent clients. By exploiting specific invalid trust assumptions, we show how we could have poisoned 0.01% of the LAION-400M or COYO-700M datasets for just $60 USD. Our second attack, frontrunning poisoning, targets web-scale datasets that periodically snapshot crowd-sourced content -- such as Wikipedia -- where an attacker only needs a time-limited window to inject malicious examples. In light of both attacks, we notify the maintainers of each affected dataset and recommended several low-overhead defenses.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (6 more...)
PyTorch dependency poisoned with malicious code • The Register
An unknown attacker used the PyPI code repository to get developers to download a compromised PyTorch dependency that included malicious code designed to steal system data. Developers who last week downloaded the nightly builds of the open source PyTorch framework also unknowingly installed a malicious version of the torchtriton dependency found in the Python Package Index, according to PyTorch's maintainers. In a blog post this week, PyTorch recommended those who installed the PyTorch nightly on Linux through pip between December 25 and December 30 to uninstall it and use the latest nightly binaries that were released after December 30. They said devs using the PyTorch stable packages were not affected by the malicious binary. However, the extent of the attack is unclear.
Semantically-enhanced Topic Recommendation System for Software Projects
Izadi, Maliheh, Nejati, Mahtab, Heydarnoori, Abbas
Software-related platforms have enabled their users to collaboratively label software entities with topics. Tagging software repositories with relevant topics can be exploited for facilitating various downstream tasks. For instance, a correct and complete set of topics assigned to a repository can increase its visibility. Consequently, this improves the outcome of tasks such as browsing, searching, navigation, and organization of repositories. Unfortunately, assigned topics are usually highly noisy, and some repositories do not have well-assigned topics. Thus, there have been efforts on recommending topics for software projects, however, the semantic relationships among these topics have not been exploited so far. We propose two recommender models for tagging software projects that incorporate the semantic relationship among topics. Our approach has two main phases; (1) we first take a collaborative approach to curate a dataset of quality topics specifically for the domain of software engineering and development. We also enrich this data with the semantic relationships among these topics and encapsulate them in a knowledge graph we call SED-KGraph. Then, (2) we build two recommender systems; The first one operates only based on the list of original topics assigned to a repository and the relationships specified in our knowledge graph. The second predictive model, however, assumes there are no topics available for a repository, hence it proceeds to predict the relevant topics based on both textual information of a software project and SED-KGraph. We built SED-KGraph in a crowd-sourced project with 170 contributors from both academia and industry. The experiment results indicate that our solutions outperform baselines that neglect the semantic relationships among topics by at least 25% and 23% in terms of ASR and MAP metrics.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Kansas > Rush County (0.04)
- North America > United States > Washington > King County > Redmond (0.04)
- (4 more...)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
- Information Technology > Software > Programming Languages (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)