Goto

Collaborating Authors

 npm


Which Is Better For Reducing Outdated and Vulnerable Dependencies: Pinning or Floating?

arXiv.org Artificial Intelligence

Developers consistently use version constraints to specify acceptable versions of the dependencies for their project. \emph{Pinning} dependencies can reduce the likelihood of breaking changes, but comes with a cost of manually managing the replacement of outdated and vulnerable dependencies. On the other hand, \emph{floating} can be used to automatically get bug fixes and security fixes, but comes with the risk of breaking changes. Security practitioners advocate \emph{pinning} dependencies to prevent against software supply chain attacks, e.g., malicious package updates. However, since \emph{pinning} is the tightest version constraint, \emph{pinning} is the most likely to result in outdated dependencies. Nevertheless, how the likelihood of becoming outdated or vulnerable dependencies changes across version constraint types is unknown. The goal of this study is to aid developers in making an informed dependency version constraint choice by empirically evaluating the likelihood of dependencies becoming outdated or vulnerable across version constraint types at scale. In this study, we first identify the trends in dependency version constraint usage and the patterns of version constraint type changes made by developers in the npm, PyPI, and Cargo ecosystems. We then modeled the dependency state transitions using survival analysis and estimated how the likelihood of becoming outdated or vulnerable changes when using \emph{pinning} as opposed to the rest of the version constraint types. We observe that among outdated and vulnerable dependencies, the most commonly used version constraint type is \emph{floating-minor}, with \emph{pinning} being the next most common. We also find that \emph{floating-major} is the least likely to result in outdated and \emph{floating-minor} is the least likely to result in vulnerable dependencies.


AgentHub: A Research Agenda for Agent Sharing Infrastructure

arXiv.org Artificial Intelligence

LLM-based agents are rapidly proliferating, yet the infrastructure for discovering, evaluating, and governing them remains fragmented compared to mature ecosystems like software package registries (e.g., npm) and model hubs (e.g., Hugging Face). Recent research and engineering works have begun to consider the requisite infrastructure, but so far they focus narrowly -- on distribution, naming, or protocol negotiation. However, considering broader software engineering requirements would improve open-source distribution and ease reuse. We therefore propose AgentHub, a research agenda for agent sharing. By framing the key challenges of capability clarity, lifecycle transparency, interoperability, governance, security, and workflow integration, AgentHub charts a community-wide agenda for building reliable and scalable agent ecosystems. Our vision is a future where agents can be shared, trusted, and composed as seamlessly as today's software libraries.


Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora

arXiv.org Artificial Intelligence

The performance of large language models (LLMs) is deeply influenced by the quality and composition of their training data. While much of the existing work has centered on English, there remains a gap in understanding how to construct effective training corpora for other languages. We explore scalable methods for building web-based corpora for LLMs. We apply them to build a new 120B token corpus in Portuguese that achieves competitive results to an industrial-grade corpus. Using a continual pretraining setup, we study how different data selection and preprocessing strategies affect LLM performance when transitioning a model originally trained in English to another language. Our findings demonstrate the value of language-specific filtering pipelines, including classifiers for education, science, technology, engineering, and mathematics (STEM), as well as toxic content. We show that adapting a model to the target language leads to performance improvements, reinforcing the importance of high-quality, language-specific data. While our case study focuses on Portuguese, our methods are applicable to other languages, offering insights for multilingual LLM development.


Together We Make Sense -- Learning Meta-Sense Embeddings from Pretrained Static Sense Embeddings

arXiv.org Artificial Intelligence

Sense embedding learning methods learn multiple vectors for a given ambiguous word, corresponding to its different word senses. For this purpose, different methods have been proposed in prior work on sense embedding learning that use different sense inventories, sense-tagged corpora and learning methods. However, not all existing sense embeddings cover all senses of ambiguous words equally well due to the discrepancies in their training resources. To address this problem, we propose the first-ever meta-sense embedding method -- Neighbour Preserving Meta-Sense Embeddings, which learns meta-sense embeddings by combining multiple independently trained source sense embeddings such that the sense neighbourhoods computed from the source embeddings are preserved in the meta-embedding space. Our proposed method can combine source sense embeddings that cover different sets of word senses. Experimental results on Word Sense Disambiguation (WSD) and Word-in-Context (WiC) tasks show that the proposed meta-sense embedding method consistently outperforms several competitive baselines.


GitHub - RubensZimbres/best-of-ml-python: ๐Ÿ† A ranked list of awesome machine learning Python libraries. Updated weekly.

#artificialintelligence

A ranked list of awesome machine learning Python libraries. This curated list contains 830 awesome open-source projects with a total of 2.6M stars grouped into 32 categories. All projects are ranked by a project-quality score, which is calculated based on various metrics automatically collected from GitHub and different package managers. If you like to add or update projects, feel free to open an issue, submit a pull request, or directly edit the projects.yaml. Discover other best-of lists or create your own.


Towards Semantic Communication Protocols: A Probabilistic Logic Perspective

arXiv.org Artificial Intelligence

Classical medium access control (MAC) protocols are interpretable, yet their task-agnostic control signaling messages (CMs) are ill-suited for emerging mission-critical applications. By contrast, neural network (NN) based protocol models (NPMs) learn to generate task-specific CMs, but their rationale and impact lack interpretability. To fill this void, in this article we propose, for the first time, a semantic protocol model (SPM) constructed by transforming an NPM into an interpretable symbolic graph written in the probabilistic logic programming language (ProbLog). This transformation is viable by extracting and merging common CMs and their connections while treating the NPM as a CM generator. By extensive simulations, we corroborate that the SPM tightly approximates its original NPM while occupying only 0.02% memory. By leveraging its interpretability and memory-efficiency, we demonstrate several SPM-enabled applications such as SPM reconfiguration for collision-avoidance, as well as comparing different SPMs via semantic entropy calculation and storing multiple SPMs to cope with non-stationary environments. Traditionally, cellular medium access control (MAC) protocols have been designed primarily for general purposes. Ko is with Inha University, Incheon, Korea (e-mail: swko@inha.ac.kr). This work has been submitted to the IEEE for possible publication. While handshaking rules and scheduling policies can partly be manipulated (e.g., grant-free access prioritization [2]), their control signaling messages (CMs) remain unchanged even when tasks and other environmental characteristics vary over time.


GitHub - ml-tooling/best-of-ml-python: ๐Ÿ† A ranked list of awesome machine learning Python libraries. Updated weekly.

#artificialintelligence

A ranked list of awesome machine learning Python libraries. This curated list contains 920 awesome open-source projects with a total of 3.4M stars grouped into 34 categories. All projects are ranked by a project-quality score, which is calculated based on various metrics automatically collected from GitHub and different package managers. If you like to add or update projects, feel free to open an issue, submit a pull request, or directly edit the projects.yaml. Discover other best-of lists or create your own.


The Benefits of AI and Machine Learning in Network Monitoring

#artificialintelligence

Artificial intelligence โ€“ also commonly known as AI โ€“ has revolutionized the technology world. Companies both inside and outside the tech circle are introducing AI into their work suite. AI takes the basic principles of computing and processing and applies intelligent environment analysis on top of it. For industries, AI analyzes the data they generate and provides them with insights based on its findings. AI can also apply machine learning to examine historical data in order to perform tasks without human input.


Understanding How Machines Learn, Through Prototyping โ€“ Big Tomorrow

#artificialintelligence

This is the second article in a larger series exploring the intersection of design and existing artificial intelligence technology through experiments, prototypes and concepts. We believe this is a critically important topic for the design community and beyond, so we're sharing what we learn along the way. Let's start by getting something out of the way: we're not machine learning experts -- we don't publish research about new algorithmic breakthroughs and we're not especially good at math. But we're curious about what to do with all the machine learning capability already floating around out in the world, and we're bullish about how far a'good enough' understanding can often take you. So how might non-experts begin to play with machine learning?


Speech Recognition Using Demi-Syllable Neural Prediction Model

Neural Information Processing Systems

The Neural Prediction Model is the speech recognition model based on pattern prediction by multilayer perceptrons. Its effectiveness was confirmed by the speaker-independent digit recognition experiments. This paper presents an improvement in the model and its application to large vocabulary speech recognition, based on subword units. The improvement involves an introduction of "backward prediction," which further improves the prediction accuracy of the original model with only "forward prediction". In application of the model to speaker-dependent large vocabulary speech recognition, the demi-syllable unit is used as a subword recognition unit.