Goto

Collaborating Authors

 South America


BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP

arXiv.org Artificial Intelligence

Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.


Computational Complexity of Statistics: New Insights from Low-Degree Polynomials

arXiv.org Machine Learning

Imagine trying to find a hidden k -vertex clique (fully connected subgraph) within an otherwise random n -vertex graph (network). While it is possible to find a hidden clique of size k log n by brute-force search, all known "fast" (polynomial-time) algorithms only work if the clique is much larger: k n . Is this an inherent limitation of fast algorithms or should we continue looking for a better one? Similar questions of computational complexity arise in many other statistical settings, such as community detection, clustering, and sparse PCA. While we lack the tools to prove definitively that fast algorithms require k n, this survey describes one sense in which we can prove this threshold is fundamental: all algorithms based on low-degree polynomials -- for instance, counting triangles in the graph would be a degree-3 polynomial -- provably fail (in an appropriate sense) when k n . Furthermore, these low-degree algorithms tend to capture the best tools in our algorithmic toolkit for problems of this style, so finding a fast algorithm for k n would seem to require a major breakthrough or may simply be impossible. This provides a lens for predicting and explaining the limitations of fast algorithms across many different settings.


Detecção da Psoríase Utilizando Visão Computacional: Uma Abordagem Comparativa Entre CNNs e Vision Transformers

arXiv.org Artificial Intelligence

This paper presents a comparison of the performance of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the task of multi-classifying images containing lesions of psoriasis and diseases similar to it. Models pre-trained on ImageNet were adapted to a specific data set. Both achieved high predictive metrics, but the ViTs stood out for their superior performance with smaller models. Dual Attention Vision Transformer-Base (DaViT-B) obtained the best results, with an f1-score of 96.4%, and is recommended as the most efficient architecture for automated psoriasis detection. This article reinforces the potential of ViTs for medical image classification tasks.


Preparing for kick-off at RoboCup2025: an interview with General Chair Marco Simões

AIHub

The Salvador Convention Center, where RoboCup 2025 will take place. RoboCup is an international scientific initiative with the goal of advancing the state of the art of intelligent robots, AI and automation. The annual RoboCup event, where teams gather from across the globe to take part in competitions across a number of leagues, will this year take place in Brazil, from 15-21 July. We spoke to Marco Simões, one of the General Chairs of RoboCup 2025 and President of RoboCup Brazil, to find out what plans they have for the event, some new initiatives, and how RoboCup has grown in Brazil over the past ten years. RoboCup will be held in Salvador, Brazil.


Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes

arXiv.org Machine Learning

Neural Processes (NPs) are a rapidly evolving class of models designed to directly model the posterior predictive distribution of stochastic processes. While early architectures were developed primarily as a scalable alternative to Gaussian Processes (GPs), modern NPs tackle far more complex and data hungry applications spanning geology, epidemiology, climate, and robotics. These applications have placed increasing pressure on the scalability of these models, with many architectures compromising accuracy for scalability. In this paper, we demonstrate that this tradeoff is often unnecessary, particularly when modeling fully or partially translation invariant processes. We propose a versatile new architecture, the Biased Scan Attention Transformer Neural Process (BSA-TNP), which introduces Kernel Regression Blocks (KRBlocks), group-invariant attention biases, and memory-efficient Biased Scan Attention (BSA). BSA-TNP is able to: (1) match or exceed the accuracy of the best models while often training in a fraction of the time, (2) exhibit translation invariance, enabling learning at multiple resolutions simultaneously, (3) transparently model processes that evolve in both space and time, (4) support high dimensional fixed effects, and (5) scale gracefully -- running inference with over 1M test points with 100K context points in under a minute on a single 24GB GPU.


Feature Shift Localization Network

arXiv.org Machine Learning

Feature shifts between data sources are present in many applications involving healthcare, biomedical, socioeconomic, financial, survey, and multi-sensor data, among others, where unharmonized heterogeneous data sources, noisy data measurements, or inconsistent processing and standardization pipelines can lead to erroneous features. Localizing shifted features is important to address the underlying cause of the shift and correct or filter the data to avoid degrading downstream analysis. While many techniques can detect distribution shifts, localizing the features originating them is still challenging, with current solutions being either inaccurate or not scalable to large and high-dimensional datasets. In this work, we introduce the Feature Shift Localization Network (FSL-Net), a neural network that can localize feature shifts in large and high-dimensional datasets in a fast and accurate manner. The network, trained with a large number of datasets, learns to extract the statistical properties of the datasets and can localize feature shifts from previously unseen datasets and shifts without the need for re-training. The code and ready-to-use trained model are available at https://github.com/AI-sandbox/FSL-Net.


LLM-BT-Terms: Back-Translation as a Framework for Terminology Standardization and Dynamic Semantic Embedding

arXiv.org Artificial Intelligence

The rapid expansion of English technical terminology presents a significant challenge to traditional expert-based standardization, particularly in rapidly developing areas such as artificial intelligence and quantum computing. Manual approaches face difficulties in maintaining consistent multilingual terminology. To address this, we introduce LLM-BT, a back-translation framework powered by large language models (LLMs) designed to automate terminology verification and standardization through cross-lingual semantic alignment. Our key contributions include: (1) term-level consistency validation: by performing English -> intermediate language -> English back-translation, LLM-BT achieves high term consistency across different models (such as GPT-4, DeepSeek, and Grok). Case studies demonstrate over 90 percent of terms are preserved either exactly or semantically; (2) multi-path verification workflow: we develop a novel pipeline described as Retrieve -> Generate -> Verify -> Optimize, which supports both serial paths (e.g., English -> Simplified Chinese -> Traditional Chinese -> English) and parallel paths (e.g., English -> Chinese / Portuguese -> English). BLEU scores and term-level accuracy indicate strong cross-lingual robustness, with BLEU scores exceeding 0.45 and Portuguese term accuracy reaching 100 percent; (3) back-translation as semantic embedding: we reinterpret back-translation as a form of dynamic semantic embedding that uncovers latent trajectories of meaning. In contrast to static embeddings, LLM-BT offers transparent, path-based embeddings shaped by the evolution of the models. This reframing positions back-translation as an active mechanism for multilingual terminology standardization, fostering collaboration between machines and humans - machines preserve semantic integrity, while humans provide cultural interpretation.


Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains, but their reliability is hindered by the outdated knowledge and hallucinations. Retrieval-Augmented Generation mitigates these issues by grounding LLMs with external knowledge; however, most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning. Knowledge graphs, which represent facts as relational triples, offer a more structured and compact alternative. Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering (KGQA), with a significant proportion adopting the retrieve-then-reasoning paradigm. In this framework, graph-based retrievers have demonstrated strong empirical performance, yet they still face challenges in generalization ability. In this work, we propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA. RAPL addresses these limitations through three aspects: (1) a two-stage labeling strategy that combines heuristic signals with parametric models to provide causally grounded supervision; (2) a model-agnostic graph transformation approach to capture both intra- and inter-triple interactions, thereby enhancing representational capacity; and (3) a path-based reasoning strategy that facilitates learning from the injected rational knowledge, and supports downstream reasoner through structured inputs. Empirically, RAPL outperforms state-of-the-art methods by $2.66\%-20.34\%$, and significantly reduces the performance gap between smaller and more powerful LLM-based reasoners, as well as the gap under cross-dataset settings, highlighting its superior retrieval capability and generalizability. Codes are available at: https://github.com/tianyao-aka/RAPL.


Revisiting Graph Projections for Effective Complementary Product Recommendation

arXiv.org Artificial Intelligence

Complementary product recommendation is a powerful strategy to improve customer experience and retail sales. However, recommending the right product is not a simple task because of the noisy and sparse nature of user-item interactions. In this work, we propose a simple yet effective method to predict a list of complementary products given a query item, based on the structure of a directed weighted graph projected from the user-item bipartite graph. We revisit bipartite graph projections for recommender systems and propose a novel approach for inferring complementarity relationships from historical user-item interactions. We compare our model with recent methods from the literature and show, despite the simplicity of our approach, an average improvement of +43% and +38% over sequential and graph-based recommenders, respectively, over different benchmarks.


Gold coins confirm 'world's richest shipwreck' is 18th century Spanish galleon

Popular Science

Breakthroughs, discoveries, and DIY tips sent every weekday. The yearslong international fight to lay claim to the suspected "world's richest shipwreck" likely won't end anytime soon, especially after a research team's most recent conclusions. Experts have confirmed that dozens of gold coins scattered across the ocean floor off the coast of Colombia belonged to the San José, an ill-fated Spanish treasure galleon that sank over 300 years ago during a battle with British warships. The findings were published on June 10 in the journal Antiquity. In June 1708, the San José and a fleet of 17 other vessels departed the capital of Colombia for Europe laden with gold, silver, and uncut gems.