Goto

Collaborating Authors

 scorecard


Can You Trust Your Copilot? A Privacy Scorecard for AI Coding Assistants

AL-Maamari, Amir

arXiv.org Artificial Intelligence

The rapid integration of AI-powered coding assistants into developer workflows has raised significant privacy and trust concerns. As developers entrust proprietary code to services like OpenAI's GPT, Google's Gemini, and GitHub Copilot, the unclear data handling practices of these tools create security and compliance risks. This paper addresses this challenge by introducing and applying a novel, expert-validated privacy scorecard. The methodology involves a detailed analysis of four document types; from legal policies to external audits; to score five leading assistants against 14 weighted criteria. A legal expert and a data protection officer refined these criteria and their weighting. The results reveal a distinct hierarchy of privacy protections, with a 20-point gap between the highest- and lowest-ranked tools. The analysis uncovers common industry weaknesses, including the pervasive use of opt-out consent for model training and a near-universal failure to filter secrets from user prompts proactively. The resulting scorecard provides actionable guidance for developers and organizations, enabling evidence-based tool selection. This work establishes a new benchmark for transparency and advocates for a shift towards more user-centric privacy standards in the AI industry.


Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs

Gautam, Somraj, Penamakuri, Abhirama Subramanyam, Bhandari, Abhishek, Harit, Gaurav

arXiv.org Artificial Intelligence

We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi scorecards, with all questions and answers kept in English to enable controlled cross-script evaluation. The task demands reasoning over structured numerical data, multi-image context, and implicit domain knowledge. Empirical results show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle on the English subset despite it being their primary training language and exhibit a further drop in performance on the Hindi subset. This reveals key limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization. The dataset is publicly available via Hugging Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM research in this direction.


LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries

Wu, Zekun, Cho, Seonglae, Mohammed, Umar, Munoz, Cristian, Costa, Kleyton, Guan, Xin, King, Theo, Wang, Ze, Kazim, Emre, Koshiyama, Adriano

arXiv.org Artificial Intelligence

Open-source AI libraries are foundational to modern AI systems, yet they present significant, underexamined risks spanning security, licensing, maintenance, supply chain integrity, and regulatory compliance. We introduce LibVulnWatch, a system that leverages recent advances in large language models and agentic workflows to perform deep, evidence-based evaluations of these libraries. Built on a graph-based orchestration of specialized agents, the framework extracts, verifies, and quantifies risk using information from repositories, documentation, and vulnerability databases. LibVulnWatch produces reproducible, governance-aligned scores across five critical domains, publishing results to a public leaderboard for ongoing ecosystem monitoring. Applied to 20 widely used libraries, including ML frameworks, LLM inference engines, and agent orchestration tools, our approach covers up to 88% of OpenSSF Scorecard checks while surfacing up to 19 additional risks per library, such as critical RCE vulnerabilities, missing SBOMs, and regulatory gaps. By integrating advanced language technologies with the practical demands of software risk assessment, this work demonstrates a scalable, transparent mechanism for continuous supply chain evaluation and informed library selection.


Skillful joint probabilistic weather forecasting from marginals

Alet, Ferran, Price, Ilan, El-Kadi, Andrew, Masters, Dominic, Markou, Stratis, Andersson, Tom R., Stott, Jacklynn, Lam, Remi, Willson, Matthew, Sanchez-Gonzalez, Alvaro, Battaglia, Peter

arXiv.org Artificial Intelligence

Machine learning (ML)-based weather models have rapidly risen to prominence due to their greater accuracy and speed than traditional forecasts based on numerical weather prediction (NWP), recently outperforming traditional ensembles in global probabilistic weather forecasting. This paper presents FGN, a simple, scalable and flexible modeling approach which significantly outperforms the current state-of-the-art models. FGN generates ensembles via learned model-perturbations with an ensemble of appropriately constrained models. It is trained directly to minimize the continuous rank probability score (CRPS) of per-location forecasts. It produces state-of-the-art ensemble forecasts as measured by a range of deterministic and probabilistic metrics, makes skillful ensemble tropical cyclone track predictions, and captures joint spatial structure despite being trained only on marginals.


AI Data Development: A Scorecard for the System Card Framework

Bahiru, Tadesse K., Tibebu, Haileleol, Kakadiaris, Ioannis A.

arXiv.org Artificial Intelligence

Artificial intelligence has transformed numerous industries, from healthcare to finance, enhancing decision-making through automated systems. However, the reliability of these systems is mainly dependent on the quality of the underlying datasets, raising ongoing concerns about transparency, accountability, and potential biases. This paper introduces a scorecard designed to evaluate the development of AI datasets, focusing on five key areas from the system card framework data development life cycle: data dictionary, collection process, composition, motivation, and pre-processing. The method follows a structured approach, using an intake form and scoring criteria to assess the quality and completeness of the data set. Applied to four diverse datasets, the methodology reveals strengths and improvement areas. The results are compiled using a scoring system that provides tailored recommendations to enhance the transparency and integrity of the data set. The scorecard addresses technical and ethical aspects, offering a holistic evaluation of data practices. This approach aims to improve the quality of the data set. It offers practical guidance to curators and researchers in developing responsible AI systems, ensuring fairness and accountability in decision support systems.


Automating High Quality RT Planning at Scale

Gao, Riqiang, Diallo, Mamadou, Liu, Han, Magliari, Anthony, Sackett, Jonathan, Verbakel, Wilko, Meyers, Sandra, Zarepisheh, Masoud, Mcbeth, Rafe, Arberet, Simon, Kraus, Martin, Ghesu, Florin C., Kamen, Ali

arXiv.org Artificial Intelligence

Radiotherapy (RT) planning is complex, subjective, and time-intensive. Advances in artificial intelligence (AI) promise to improve its precision, efficiency, and consistency, but progress is often limited by the scarcity of large, standardized datasets. To address this, we introduce the Automated Iterative RT Planning (AIRTP) system, a scalable solution for generating high-quality treatment plans. This scalable solution is designed to generate substantial volumes of consistently high-quality treatment plans, overcoming a key obstacle in the advancement of AI-driven RT planning. Our AIRTP pipeline adheres to clinical guidelines and automates essential steps, including organ-at-risk (OAR) contouring, helper structure creation, beam setup, optimization, and plan quality improvement, using AI integrated with RT planning software like Eclipse of Varian. Furthermore, a novel approach for determining optimization parameters to reproduce 3D dose distributions, i.e. a method to convert dose predictions to deliverable treatment plans constrained by machine limitations. A comparative analysis of plan quality reveals that our automated pipeline produces treatment plans of quality comparable to those generated manually, which traditionally require several hours of labor per plan. Committed to public research, the first data release of our AIRTP pipeline includes nine cohorts covering head-and-neck and lung cancer sites to support an AAPM 2025 challenge. This data set features more than 10 times the number of plans compared to the largest existing well-curated public data set to our best knowledge. Repo:{https://github.com/RiqiangGao/GDP-HMM_AAPMChallenge}


The Multi-Range Theory of Translation Quality Measurement: MQM scoring models and Statistical Quality Control

Lommel, Arle, Gladkoff, Serge, Melby, Alan, Wright, Sue Ellen, Strandvik, Ingemar, Gasova, Katerina, Vaasa, Angelika, Benzo, Andy, Sparano, Romina Marazzato, Foresi, Monica, Innis, Johani, Han, Lifeng, Nenadic, Goran

arXiv.org Artificial Intelligence

The year 2024 marks the 10th anniversary of the Multidimensional Quality Metrics (MQM) framework for analytic translation quality evaluation. The MQM error typology has been widely used by practitioners in the translation and localization industry and has served as the basis for many derivative projects. The annual Conference on Machine Translation (WMT) shared tasks on both human and automatic translation quality evaluations used the MQM error typology. The metric stands on two pillars: error typology and the scoring model. The scoring model calculates the quality score from annotation data, detailing how to convert error type and severity counts into numeric scores to determine if the content meets specifications. Previously, only the raw scoring model had been published. This April, the MQM Council published the Linear Calibrated Scoring Model, officially presented herein, along with the Non-Linear Scoring Model, which had not been published before. This paper details the latest MQM developments and presents a universal approach to translation quality measurement across three sample size ranges. It also explains why Statistical Quality Control should be used for very small sample sizes, starting from a single sentence.


State of AI Report 2022: Be Prepared for Next Year - KDnuggets

#artificialintelligence

As the new year approaches, people will start to write down their New Year resolutions, and make changes to their lifestyle, career, and more. If you are interested in entering the world of data and AI or are already there, prepare with the State of AI Report 2022. The State of AI Report has been published since 2018, covering the interesting developments in AI and what they expect to come in the following year. The annual report is reviewed by leading AI practitioners within the industry and research sector. The aim is to dive deep into the elements of AI, trigger interesting conversations, and learn more about what possible implications AI could raise in the future. The report consists of 114 slides, however, I will give you a breakdown of the structure, what to expect and some interesting points.


Toward a Fairness-Aware Scoring System for Algorithmic Decision-Making

Yang, Yi, Wu, Ying, Li, Mei, Chang, Xiangyu, Tan, Yong

arXiv.org Artificial Intelligence

Scoring systems, as a type of predictive model, have significant advantages in interpretability and transparency and facilitate quick decision-making. As such, scoring systems have been extensively used in a wide variety of industries such as healthcare and criminal justice. However, the fairness issues in these models have long been criticized, and the use of big data and machine learning algorithms in the construction of scoring systems heightens this concern. In this paper, we propose a general framework to create fairness-aware, data-driven scoring systems. First, we develop a social welfare function that incorporates both efficiency and group fairness. Then, we transform the social welfare maximization problem into the risk minimization task in machine learning, and derive a fairness-aware scoring system with the help of mixed integer programming. Lastly, several theoretical bounds are derived for providing parameter selection suggestions. Our proposed framework provides a suitable solution to address group fairness concerns in the development of scoring systems. It enables policymakers to set and customize their desired fairness requirements as well as other application-specific constraints. We test the proposed algorithm with several empirical data sets. Experimental evidence supports the effectiveness of the proposed scoring system in achieving the optimal welfare of stakeholders and in balancing the needs for interpretability, fairness, and efficiency.


September 2022 Newsletter AI for Good - AI for Good Foundation

#artificialintelligence

As we say goodbye to summer and fall enters, our newsletter highlights the accomplishments of this year's Summer Fellows; announces the launch of the next phase of our work in Ukraine; and shares the innovative work of members on the AI for Good Council for Good. Our team has been hard at work preparing to launch the DE&I Scorecard, and our Visiting Scholar Dr. Randon Taylor and Fellow Bessie O'Dell have prepared an excerpt from their soon-to-be-published paper. Join us as we reflect and move forward. The AI for Good Foundation is excited to announce the launch of LifeForce in Ukraine. LifeForce provides real-time access to aid and basic needs from over 17,000 locations across the country, coordinates grass-roots humanitarian efforts, and renders war-time supply chains and logistics resilient to attack.