contributor
Eka-Eval: An Evaluation Framework for Low-Resource Multilingual Large Language Models
Sinha, Samridhi Raj, Sheth, Rajvee, Upperwal, Abhishek, Singh, Mayank
The rapid evolution of Large Language Models' has underscored the need for evaluation frameworks that are globally applicable, flexible, and modular, and that support a wide range of tasks, model types, and linguistic settings. We introduce EKA-EVAL, a unified, end- to-end framework that combines a zero-code web interface and an interactive CLI to ensure broad accessibility. It integrates 50+ multilingual benchmarks across nine evaluation categories, supports local and proprietary models, and provides 11 core capabilities through a modular, plug-and-play architecture. Designed for scalable, multilingual evaluation with support for low-resource multilingual languages, EKA-EVAL is, to the best of our knowledge, the first suite to offer comprehensive coverage in a single platform. Comparisons against five existing baselines indicate improvements of at least 2x better on key usability measures, with the highest user satisfaction, faster setup times, and consistent benchmark reproducibility. The framework is open-source and publicly available at https://github.com/lingo-iitgn/eka-eval.
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks
Chervyakov, Artem, Kharitonov, Alexander, Zadorozhny, Pavel, Pavel, Adamenko, Levichev, Rodion, Vorobev, Dmitrii, Salikhov, Dmitrii, Valeev, Aidar, Pestova, Alena, Dziuba, Maria, Alimova, Ilseyar, Zavgorodnev, Artem, Medvedev, Aleksandr, Moiseev, Stanislav, Bruches, Elena, Grebenkin, Daniil, Derunets, Roman, Vladimir, Vikulov, Emelyanov, Anton, Babaev, Dmitrii, Ivanov, Vladimir V., Malykh, Valentin, Fenogenova, Alena
Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.
- Europe > Austria > Vienna (0.14)
- South America > Suriname > Marowijne District > Albina (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (6 more...)
- Information Technology > Security & Privacy (1.00)
- Education (1.00)
Trustless Federated Learning at Edge-Scale: A Compositional Architecture for Decentralized, Verifiable, and Incentive-Aligned Coordination
Onobhayedo, Pius, Oamen, Paul Osemudiame
Artificial intelligence is retracing the Internet's path from centralized provision to distributed creation. Initially, resource-intensive computation concentrates within institutions capable of training and serving large models. Eventually, as federated learning matures, billions of edge devices holding sensitive data will be able to collectively improve models without surrendering raw information, enabling both contribution and consumption at scale. This democratic vision remains unrealized due to certain compositional gaps; aggregators handle updates without accountability, economic mechanisms are lacking and even when present remain vulnerable to gaming, coordination serializes state modifications limiting scalability, and governance permits retroactive manipulation. This work addresses these gaps by leveraging cryptographic receipts to prove aggregation correctness, geometric novelty measurement to prevent incentive gaming, parallel object ownership to achieve linear scalability, and time-locked policies to check retroactive manipulation. The product of this work is a design architecture--not an actual implementation--that seeks to pass the baton in the race toward truly collaborative intelligence; an intelligence of the people, by the people, for the people.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Europe > United Kingdom (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
- Information Technology > Security & Privacy (1.00)
- Banking & Finance (0.68)
- Law (0.67)
Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures
Wang, Haohui, Qi, Jingyuan, Chen, Jianpeng, Wu, Jun, Huang, Lifu, Zheng, Lecheng, Choi, Kevin, Veeramani, Balaji, Bowen, Edward, Hu, Alison, Cody, Tyler, Zhou, Dawei
The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.
- Europe > Austria > Vienna (0.14)
- North America > United States > Virginia (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- (16 more...)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark Supplementary Material David Romero
For what purpose was this dataset created? W as there a specific gap that needed to be filled? Who created the dataset (e.g., which team, research group) and on behalf of which entity The CVQA is led by a team of researchers from MBZUAI. Who funded the creation of the dataset? No grant, all expenses were funded by the MBZUAI's faculty startup fund.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > California (0.04)
- (2 more...)
- Research Report > Strength High (1.00)
- Research Report > Experimental Study (1.00)
The Man Who Invented AGI
Everyone is obsessed with artificial general intelligence--the stage when AI can match all feats of human cognition. The guy who named it saw it as a threat. In the summer of 1956, a group of academics--now we'd call them computer scientists but there was no such thing then--met on Dartmouth College campus in New Hampshire to discuss how to make machines think like humans. One of them, John McCarthy, coined the term "artificial intelligence." This legendary meeting and the naming of a new field, is well known.
- North America > United States > New Hampshire (0.24)
- North America > United States > California (0.14)
- Asia > China (0.05)
- (7 more...)
- Leisure & Entertainment (1.00)
- Government > Regional Government (0.69)
DeepCausalMMM: A Deep Learning Framework for Marketing Mix Modeling with Causal Inference
Marketing Mix Modeling (MMM) is a statistical technique used to estimate the impact of marketing activities on business outcomes such as sales, revenue, or customer visits. Traditional MMM approaches often rely on linear regression or Bayesian hierarchical models that assume independence between marketing channels and struggle to capture complex temporal dynamics and non-linear saturation effects [@Chan2017; @Hanssens2005; @Ng2021Bayesian]. **DeepCausalMMM** is a Python package that addresses these limitations by combining deep learning, causal inference, and advanced marketing science. The package uses Gated Recurrent Units (GRUs) to automatically learn temporal patterns such as adstock (carryover effects) and lag, while simultaneously learning statistical dependencies and potential causal structures between marketing channels through Directed Acyclic Graph (DAG) learning [@Zheng2018NOTEARS; @Gong2024CausalMMM]. Additionally, it implements Hill equation-based saturation curves to model diminishing returns and optimize budget allocation. Key features include: (1) a data-driven design where hyperparameters and transformations (e.g., adstock decay, saturation curves) are learned or estimated from data with sensible defaults, rather than requiring fixed heuristics or manual specification, (2) multi-region modeling with both shared and region-specific parameters, (3) robust statistical methods including Huber loss and advanced regularization, (4) comprehensive response curve analysis for understanding channel saturation.
Causal Language Control in Multilingual Transformers via Sparse Feature Steering
Chou, Cheng-Ting, Liu, George, Sun, Jessica, Blondin, Cole, Zhu, Kevin, Sharma, Vasu, O'Brien, Sean
Deterministically controlling the target generation language of large multilingual language models (LLMs) remains a fundamental challenge, particularly in zero-shot settings where neither explicit language prompts nor fine-tuning are available. In this work, we investigate whether sparse autoencoder (SAE) features, previously shown to correlate with interpretable model behaviors, can be leveraged to steer the generated language of LLMs during inference. Leveraging pretrained SAEs on the residual streams of Gemma-2B and Gemma-9B, we identify features whose activations differ most significantly between English and four target languages: Chinese, Japanese, Spanish, and French. By modifying just a single SAE feature at one transformer layer, we achieve controlled language shifts with up to 90\% success, as measured by FastText language classification, while preserving semantic fidelity according to LaBSE (Language-Agnostic BERT Sentence Embedding) similarity. Our analysis reveals that language steering is most effective in mid-to-late transformer layers and is amplified by specific attention heads disproportionately associated with language-sensitive SAE features. These results demonstrate the promise of sparse feature steering as a lightweight and interpretable mechanism for controllable multilingual generation.
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- North America > United States > Maine (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- (5 more...)