minerva
"Don't Teach Minerva": Guiding LLMs Through Complex Syntax for Faithful Latin Translation with RAG
Translating a morphology-rich, low-resource language like Latin poses significant challenges. This paper introduces a reproducible draft-based refinement pipeline that elevates open-source Large Language Models (LLMs) to a performance level statistically comparable to top-tier proprietary systems. Our method first uses a fine-tuned NLLB-1.3B model to generate a high-quality, structurally faithful draft. A zero-shot LLM (Llama-3.3 or Qwen3) then polishes this draft, a process that can be further enhanced by augmenting the context with retrieved out-context examples (RAG). We demonstrate the robustness of this approach on two distinct benchmarks: a standard in-domain test set (Rosenthal, 2023) and a new, challenging out-of-domain (OOD) set of 12th-century Latin letters (2025). Our central finding is that this open-source RAG system achieves performance statistically comparable to the GPT-5 baseline, without any task-specific LLM fine-tuning. We release the pipeline, the Chartres OOD set, and evaluation scripts and models to facilitate replicability and further research.
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Poland (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > North Carolina > Wake County > Morrisville (0.04)
- Research Report (0.70)
- Instructional Material (0.46)
Minerva: A Programmable Memory Test Benchmark for Language Models
Xia, Menglin, Ruehle, Victor, Rajmohan, Saravan, Shokri, Reza
How effectively can LLM-based AI assistants utilize their memory (context) to perform various tasks? Traditional data benchmarks, which are often manually crafted, suffer from several limitations: they are static, susceptible to overfitting, difficult to interpret, and lack actionable insights--failing to pinpoint the specific capabilities a model lacks when it does not pass a test. In this paper, we present a framework for automatically generating a comprehensive set of tests to evaluate models' abilities to use their memory effectively. Our framework extends the range of capability tests beyond the commonly explored (passkey, key-value, needle in the haystack) search, a dominant focus in the literature. Specifically, we evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory, and performing basic operations when inputs are structured into distinct blocks, simulating real-world data. Additionally, we design composite tests to investigate the models' ability to maintain state while operating on memory. Our benchmark enables an interpretable, detailed assessment of memory capabilities of LLMs.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)
Leveraging Large Language Models to Geolocate Linguistic Variations in Social Media Posts
Savarro, Davide, Zago, Davide, Zoia, Stefano
Geolocalization of social media content is the task of determining the geographical location of a user based on textual data, that may show linguistic variations and informal language. In this project, we address the GeoLingIt challenge of geolocalizing tweets written in Italian by leveraging large language models (LLMs). GeoLingIt requires the prediction of both the region and the precise coordinates of the tweet. Our approach involves fine-tuning pre-trained LLMs to simultaneously predict these geolocalization aspects. By integrating innovative methodologies, we enhance the models' ability to understand the nuances of Italian social media text to improve the state-of-the-art in this domain. This work is conducted as part of the Large Language Models course at the Bertinoro International Spring School 2024. We make our code publicly available on GitHub https://github.com/dawoz/geolingit-biss2024.
Igea: a Decoder-Only Language Model for Biomedical Text Generation in Italian
Buonocore, Tommaso Mario, Rancati, Simone, Parimbelli, Enea
The advent of probabilistic language models has revolutionized various domains, with biomedical natural language processing (NLP) standing out due to its significant impact on healthcare provision and medical research. The ability of these models to understand, process, and generate text from vast biomedical corpora has led to improvements in tasks such as entity recognition, relation extraction, and question answering. However, the majority of this progress has been focused on English-language texts, creating a notable disparity for other languages with fewer resources, such as Italian. In the Italian context, the scarcity of large and diverse training datasets presents a substantial challenge. General language models like Minerva and Maestrale have made strides in Italian NLP, but they lack the specialization required to handle the nuances of biomedical terminology effectively. Addressing this gap is crucial, as the precision and clarity needed in medical communications are paramount for clinical and research applications in such a high-stakes domain. In this paper we introduce Igea, a biomedical language model (BLM) built from the ground-up on the Italian language, and that is effective in handling Italian native biomedical text while maintaining its efficiency in terms of computational resources. We built upon the foundation model Minerva, which we then continually trained on Italian native biomedical text, while employing proper provisions to avoid disruption of what was learned during pre-training.
- Europe > Italy (0.04)
- Asia > Middle East > Jordan (0.04)
Knowledge Graph Reasoning with Self-supervised Reinforcement Learning
Ma, Ying, Burns, Owen, Wang, Mingqiu, Li, Gang, Du, Nan, Shafey, Laurent El, Wang, Liqiang, Shafran, Izhak, Soltau, Hagen
Reinforcement learning (RL) is an effective method of finding reasoning pathways in incomplete knowledge graphs (KGs). To overcome the challenges of a large action space, a self-supervised pre-training method is proposed to warm up the policy network before the RL training stage. To alleviate the distributional mismatch issue in general self-supervised RL (SSRL), in our supervised learning (SL) stage, the agent selects actions based on the policy network and learns from generated labels; this self-generation of labels is the intuition behind the name self-supervised. With this training framework, the information density of our SL objective is increased and the agent is prevented from getting stuck with the early rewarded paths. Our self-supervised RL (SSRL) method improves the performance of RL by pairing it with the wide coverage achieved by SL during pretraining, since the breadth of the SL objective makes it infeasible to train an agent with that alone. We show that our SSRL model meets or exceeds current state-of-the-art results on all Hits@k and mean reciprocal rank (MRR) metrics on four large benchmark KG datasets. This SSRL method can be used as a plug-in for any RL architecture for a KGR task. We adopt two RL architectures, i.e., MINERVA and MultiHopKG as our baseline RL models and experimentally show that our SSRL model consistently outperforms both baselines on all of these four KG reasoning tasks. Full code for the paper available at https://github.com/owenonline/Knowledge-Graph-Reasoning-with-Self-supervised-Reinforcement-Learning.
- Asia > Middle East > Palestine (0.28)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (11 more...)
In AI, is bigger always better?
Artificial-intelligence systems that can churn out fluent text, such as OpenAI's ChatGPT, are the newest darlings of the technology industry. But when faced with mathematical queries that require reasoning to answer, these large language models (LLMs) often stumble. A line parallel to y 4x 6 passes through (5, 10). What is the y-coordinate of the point where this line crosses the y-axis? Although LLMs can sometimes answer these types of question correctly, they more often get them wrong. In one early test of its reasoning abilities, ChatGPT scored just 26% when faced with a sample of questions from the'MATH' data set of secondary-school-level mathematical problems1. This is to be expected: given input text, an LLM simply generates new text in accordance with statistical regularities in the words, symbols and sentences that make up the model's training data.
- North America > Canada > Quebec > Montreal (0.15)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Washington > King County > Redmond (0.04)
- (6 more...)
- Energy (1.00)
- Education > Educational Setting (0.54)
- Information Technology > Services (0.47)
- Government > Regional Government (0.46)
Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs
Jiang, Albert Q., Welleck, Sean, Zhou, Jin Peng, Li, Wenda, Liu, Jiacheng, Jamnik, Mateja, Lacroix, Timothée, Wu, Yuhuai, Lample, Guillaume
The formalization of existing mathematical proofs is a notoriously difficult process. Despite decades of research on automation and proof assistants, writing formal proofs remains arduous and only accessible to a few experts. While previous studies to automate formalization focused on powerful search algorithms, no attempts were made to take advantage of available informal proofs. In this work, we introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier sub-problems. We investigate two relevant setups where informal proofs are either written by humans or generated by a language model. Our experiments and ablation studies show that large language models are able to produce well-structured formal sketches that follow the same reasoning steps as the informal proofs. Guiding an automated prover with these sketches enhances its performance from 20.9% to 39.3% on a collection of mathematical competition problems.
- North America > United States > North Carolina > Wake County > Morrisville (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (4 more...)
How will AI change mathematics? Rise of chatbots highlights discussion
AI tools have allowed researchers to solve complex mathematical problems.Credit: Fadel Senna/AFP/Getty As interest in chatbots spreads like wildfire, mathematicians are beginning to explore how artificial intelligence (AI) could help them to do their work. Whether it's assisting with verifying human-written work or suggesting new ways to solve difficult problems, automation is beginning to change the field in ways that go beyond mere calculation, researchers say. "We're looking at a very specific question: will machines change math?" says Andrew Granville, a number theorist at the University of Montreal in Canada. A workshop at the University of California, Los Angeles (UCLA), this week explored this question, aiming to build bridges between mathematicians and computer scientists. "Most mathematicians are completely unaware of these opportunities," says one of the event's organizers, Marijn Heule, a computer scientist at Carnegie Mellon University in Pittsburgh, Pennsylvania.
- North America > United States > California > Los Angeles County > Los Angeles (0.55)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.25)
- North America > Canada > Quebec > Montreal (0.25)
- (6 more...)
Minerva: A File-Based Ransomware Detector
Hitaj, Dorjan, Pagnotta, Giulio, De Gaspari, Fabio, De Carli, Lorenzo, Mancini, Luigi V.
Ransomware is a rapidly evolving type of malware designed to encrypt user files on a device, making them inaccessible in order to exact a ransom. Ransomware attacks resulted in billions of dollars in damages in recent years and are expected to cause hundreds of billions more in the next decade. With current state-of-the-art process-based detectors being heavily susceptible to evasion attacks, no comprehensive solution to this problem is available today. This paper presents Minerva, a new approach to ransomware detection. Unlike current methods focused on identifying ransomware based on process-level behavioral modeling, Minerva detects ransomware by building behavioral profiles of files based on all the operations they receive in a time window. Minerva addresses some of the critical challenges associated with process-based approaches, specifically their vulnerability to complex evasion attacks. Our evaluation of Minerva demonstrates its effectiveness in detecting ransomware attacks, including those that are able to bypass existing defenses. Our results show that Minerva identifies ransomware activity with an average accuracy of 99.45% and an average recall of 99.66%, with 99.97% of ransomware detected within 1 second.
- North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.14)
- Europe > Italy > Lazio > Rome (0.04)