Goto

Collaborating Authors

 singh


BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity

Neural Information Processing Systems

We introduce BioTrove, the largest publicly accessible dataset designed to advance AI applications in biodiversity. Curated from the iNaturalist platform and vetted to include only research-grade data, BioTrove contains 161.9 million images, offering unprecedented scale and diversity from three primary kingdoms: Animalia (animals), Fungi (fungi), and Plantae (plants), spanning approximately 366.6K species. Each image is annotated with scientific names, taxonomic hierarchies, and common names, providing rich metadata to support accurate AI model development across diverse species and ecosystems.We demonstrate the value of BioTrove by releasing a suite of CLIP models trained using a subset of 40 million captioned images, known as BioTrove-Train. This subset focuses on seven categories within the dataset that are underrepresented in standard image recognition models, selected for their critical role in biodiversity and agriculture: Aves (birds), Arachnida} (spiders/ticks/mites), Insecta (insects), Plantae (plants), Fungi (fungi), Mollusca (snails), and Reptilia (snakes/lizards). To support rigorous assessment, we introduce several new benchmarks and report model accuracy for zero-shot learning across life stages, rare species, confounding species, and multiple taxonomic levels.We anticipate that BioTrove will spur the development of AI models capable of supporting digital tools for pest control, crop monitoring, biodiversity assessment, and environmental conservation. These advancements are crucial for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. BioTrove is publicly available, easily accessible, and ready for immediate use.


An Optimized Machine Learning Classifier for Detecting Fake Reviews Using Extracted Features

Anees, Shabbir, Anshuman, null, Chaurasia, Ayush, Bogar, Prathmesh

arXiv.org Artificial Intelligence

It is well known that fraudulent reviews cast doubt on the legitimacy and dependability of online purchases. The most recent development that leads customers towards darkness is the appearance of human reviews in computer-generated (CG) ones. In this work, we present an advanced machine-learning-based system that analyses these reviews produced by AI with remarkable precision. Our method integrates advanced text preprocessing, multi-modal feature extraction, Harris Hawks Optimization (HHO) for feature selection, and a stacking ensemble classifier. We implemented this methodology on a public dataset of 40,432 Original (OR) and Computer-Generated (CG) reviews. From an initial set of 13,539 features, HHO selected the most applicable 1,368 features, achieving an 89.9% dimensionality reduction. Our final stacking model achieved 95.40% accuracy, 92.81% precision, 95.01% recall, and a 93.90% F1-Score, which demonstrates that the combination of ensemble learning and bio-inspired optimisation is an effective method for machine-generated text recognition. Because large-scale review analytics commonly run on cloud platforms, privacy-preserving techniques such as differential approaches and secure outsourcing are essential to protect user data in these systems.


Named Entity Recognition for the Kurdish Sorani Language: Dataset Creation and Comparative Analysis

Abdalla, Bakhtawar, Nabi, Rebwar Mala, Eshkiki, Hassan, Caraffini, Fabio

arXiv.org Artificial Intelligence

This work contributes towards balancing the inclusivity and global applicability of natural language processing techniques by proposing the first 'name entity recognition' dataset for Kurdish Sorani, a low-resource and under-represented language, that consists of 64,563 annotated tokens. It also provides a tool for facilitating this task in this and many other languages and performs a thorough comparative analysis, including classic machine learning models and neural systems. The results obtained challenge established assumptions about the advantage of neural approaches within the context of NLP. Conventional methods, in particular CRF, obtain F1-scores of 0.825, outperforming the results of BiLSTM-based models (0.706) significantly. These findings indicate that simpler and more computationally efficient classical frameworks can outperform neural architectures in low-resource settings.


A Hybrid Proactive And Predictive Framework For Edge Cloud Resource Management

Kumar, Hrikshesh, Garg, Anika, Gupta, Anshul, Agarwal, Yashika

arXiv.org Artificial Intelligence

Old cloud edge workload resource management is too reactive. The problem with relying on static thresholds is that we are either overspending for more resources than needed or have reduced performance because of their lack. This is why we work on proactive solutions. A framework developed for it stops reacting to the problems but starts expecting them. We design a hybrid architecture, combining two powerful tools: the CNN LSTM model for time series forecasting and an orchestrator based on multi agent Deep Reinforcement Learning In fact the novelty is in how we combine them as we embed the predictive forecast from the CNN LSTM directly into the DRL agent state space. That is what makes the AI manager smarter it sees the future, which allows it to make better decisions about a long term plan for where to run tasks That means finding that sweet spot between how much money is saved while keeping the system healthy and apps fast for users That is we have given it eyes in order to see down the road so that it does not have to lurch from one problem to another it finds a smooth path forward Our tests show our system easily beats the old methods It is great at solving tough problems like making complex decisions and juggling multiple goals at once like being cheap fast and reliable



Talk, Snap, Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances

Singh, Rishu Kumar, Shreya, Navneet, Das, Sarmistha, Singh, Apoorva, Saha, Sriparna

arXiv.org Artificial Intelligence

Existing approaches to complaint analysis largely rely on unimodal, short-form content such as tweets or product reviews. This work advances the field by leveraging multimodal, multi-turn customer support dialogues, where users often share both textual complaints and visual evidence (e.g., screenshots, product photos) to enable fine-grained classification of complaint aspects and severity. We introduce VALOR, a Validation-Aware Learner with Expert Routing, tailored for this multimodal setting. It employs a multi-expert reasoning setup using large-scale generative models with Chain-of-Thought (CoT) prompting for nuanced decision-making. To ensure coherence between modalities, a semantic alignment score is computed and integrated into the final classification through a meta-fusion strategy. In alignment with the United Nations Sustainable Development Goals (UN SDGs), the proposed framework supports SDG 9 (Industry, Innovation and Infrastructure) by advancing AI-driven tools for robust, scalable, and context-aware service infrastructure. Further, by enabling structured analysis of complaint narratives and visual context, it contributes to SDG 12 (Responsible Consumption and Production) by promoting more responsive product design and improved accountability in consumer services. We evaluate VALOR on a curated multimodal complaint dataset annotated with fine-grained aspect and severity labels, showing that it consistently outperforms baseline models, especially in complex complaint scenarios where information is distributed across text and images. This study underscores the value of multimodal interaction and expert validation in practical complaint understanding systems. Resources related to data and codes are available here: https://github.com/sarmistha-D/VALOR


Cost-Driven Synthesis of Sound Abstract Interpreters

Gu, Qiuhan, Singh, Avaljot, Singh, Gagandeep

arXiv.org Artificial Intelligence

Constructing abstract interpreters that provide global soundness guarantees remains a major obstacle in abstract interpretation. We investigate whether modern LLMs can reduce this burden by leveraging them to synthesize sound, non-trivial abstract interpreters across multiple abstract domains in the setting of neural network verification. We formulate synthesis as a constrained optimization problem and introduce a novel mathematically grounded cost function for measuring unsoundness under strict syntactic and semantic constraints. Based on this formulation, we develop a unified framework that unifies LLM-based generation with syntactic and semantic validation and a quantitative cost-guided feedback mechanism. Empirical results demonstrate that our framework not only matches the quality of handcrafted transformers, but more importantly, discovers sound, high-precision transformers for complex nonlinear operators that are absent from existing literature.


EdgeRunner 20B: Military Task Parity with GPT-5 while Running on the Edge

FitzGerald, Jack, Lazaridis, Aristotelis, Bates, Dylan, Sharma, Aman, Castillo, Jonnathan, Azami, Yousif, Bailey, Sean, Cao, Jeremy, Damianov, Peter, de Haan, Kevin, Kerbs, Luke, Lu, Vincent, Madigan, Joseph, McLaurin, Jeremy, Tainer, Jonathan, Anderson, Dave, Beck, Jonathan, Cuticello, Jamie, Malkerson, Colton, Saltsman, Tyler

arXiv.org Artificial Intelligence

We present EdgeRunner 20B, a fine-tuned version of gpt-oss-20b optimized for military tasks. EdgeRunner 20B was trained on 1.6M high-quality records curated from military documentation and websites. We also present four new tests sets: (a) combat arms, (b) combat medic, (c) cyber operations, and (d) mil-bench-5k (general military knowledge). On these military test sets, EdgeRunner 20B matches or exceeds GPT-5 task performance with 95%+ statistical significance, except for the high reasoning setting on the combat medic test set and the low reasoning setting on the mil-bench-5k test set. Versus gpt-oss-20b, there is no statistically-significant regression on general-purpose benchmarks like ARC-C, GPQA Diamond, GSM8k, IFEval, MMLU Pro, or TruthfulQA, except for GSM8k in the low reasoning setting. We also present analyses on hyperparameter settings, cost, and throughput. These findings show that small, locally-hosted models are ideal solutions for data-sensitive operations such as in the military domain, allowing for deployment in air-gapped edge devices.


Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks

Babakhin, Yauhen, Osmulski, Radek, Ak, Ronay, Moreira, Gabriel, Xu, Mengyao, Schifferer, Benedikt, Liu, Bo, Oldridge, Even

arXiv.org Artificial Intelligence

We introduce llama-embed-nemotron-8b, an open-weights text embedding model that achieves state-of-the-art performance on the Multilingual Massive Text Embedding Benchmark (MMTEB) leaderboard as of October 21, 2025. While recent models show strong performance, their training data or methodologies are often not fully disclosed. We aim to address this by developing a fully open-source model, publicly releasing its weights and detailed ablation studies, and planning to share the curated training datasets. Our model demonstrates superior performance across all major embedding tasks -- including retrieval, classification and semantic textual similarity (STS) -- and excels in challenging multilingual scenarios, such as low-resource languages and cross-lingual setups. This state-of-the-art performance is driven by a novel data mix of 16.1 million query-document pairs, split between 7.7 million samples from public datasets and 8.4 million synthetically generated examples from various open-weight LLMs. One of our key contributions is a detailed ablation study analyzing core design choices, including a comparison of contrastive loss implementations, an evaluation of synthetic data generation (SDG) strategies, and the impact of model merging. The llama-embed-nemotron-8b is an instruction-aware model, supporting user-defined instructions to enhance performance for specific use-cases. This combination of top-tier performance, broad applicability, and user-driven flexibility enables it to serve as a universal text embedding solution.


The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP

Issaka, Sheriff, Wang, Keyi, Ajibola, Yinka, Samuel-Ipaye, Oluwatumininu, Zhang, Zhaoyi, Jimenez, Nicte Aguillon, Agyei, Evans Kofi, Lin, Abraham, Ramachandran, Rohan, Mumin, Sadick Abdul, Nchifor, Faith, Shuraim, Mohammed, Liu, Lieqi, Gonzalez, Erick Rosas, Kpei, Sylvester, Osei, Jemimah, Ajeneza, Carlene, Boateng, Persis, Yeboah, Prisca Adwoa Dufie, Gabriel, Saadia

arXiv.org Artificial Intelligence

Despite representing nearly one-third of the world's languages, African languages remain critically underserved by modern NLP technologies, with 88\% classified as severely underrepresented or completely ignored in computational linguistics. We present the African Languages Lab (All Lab), a comprehensive research initiative that addresses this technological gap through systematic data collection, model development, and capacity building. Our contributions include: (1) a quality-controlled data collection pipeline, yielding the largest validated African multi-modal speech and text dataset spanning 40 languages with 19 billion tokens of monolingual text and 12,628 hours of aligned speech data; (2) extensive experimental validation demonstrating that our dataset, combined with fine-tuning, achieves substantial improvements over baseline models, averaging +23.69 ChrF++, +0.33 COMET, and +15.34 BLEU points across 31 evaluated languages; and (3) a structured research program that has successfully mentored fifteen early-career researchers, establishing sustainable local capacity. Our comparative evaluation against Google Translate reveals competitive performance in several languages while identifying areas that require continued development.