Goto

Collaborating Authors

 mlcommon


Dollar Street Supplementary Information [FINAL]

Neural Information Processing Systems

The subtext to each question is initalics.Theanswers are in plain text with no formatting. Was there a specific gap that needed to be filled?Please provide a description.The Dollar Street dataset is a supervised image dataset derived from Gapminder'sDollar Street project (https://www.gapminder.org/dollar-street)that It was created with three goals in mind:1. Make available a highly curated set of images with valuable metadata (e.g.country, monthly income) that is more closely representative of the geographicand socioeconomic diversity of the world when compared with existing imagedatasets.2. Help combat bias in downstream applications (e.g.


Dollar Street Supplementary Information [FINAL]

Neural Information Processing Systems

The original questions are in bold . The subtext to each question is in italics. The answers are in plain text with no formatting. The questions in this section are primarily intended to encourage dataset creators to clearly articulate their reasons for creating the dataset and to promote transparency about funding interests. For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? The Dollar Street dataset is a supervised image dataset derived from Gapminder's Dollar Street project ( https://www.gapminder.org/dollar-street) that contains everyday household items from homes around the world. It was created with three goals in mind: 1. Make available a highly curated set of images with valuable metadata (e.g. Our evaluation results show that the Dollar Street dataset can add significant value to accuracy improvements when considering computer vision images that represent the geographic and socioeconomic diversity of the world. Concretely, this means that the dataset only contains CC-BY-licensed works.


AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons

Ghosh, Shaona, Frase, Heather, Williams, Adina, Luger, Sarah, Röttger, Paul, Barez, Fazl, McGregor, Sean, Fricklas, Kenneth, Kumar, Mala, Feuillade--Montixi, Quentin, Bollacker, Kurt, Friedrich, Felix, Tsang, Ryan, Vidgen, Bertie, Parrish, Alicia, Knotz, Chris, Presani, Eleonora, Bennion, Jonathan, Boston, Marisa Ferrara, Kuniavsky, Mike, Hutiri, Wiebke, Ezick, James, Salem, Malek Ben, Sahay, Rajat, Goswami, Sujata, Gohar, Usman, Huang, Ben, Sarin, Supheakmungkol, Alhajjar, Elie, Chen, Canyu, Eng, Roman, Manjusha, Kashyap Ramanandula, Mehta, Virendra, Long, Eileen, Emani, Murali, Vidra, Natan, Rukundo, Benjamin, Shahbazi, Abolfazl, Chen, Kongtao, Ghosh, Rajat, Thangarasa, Vithursan, Peigné, Pierre, Singh, Abhinav, Bartolo, Max, Krishna, Satyapriya, Akhtar, Mubashara, Gold, Rafael, Coleman, Cody, Oala, Luis, Tashev, Vassil, Imperial, Joseph Marvin, Russ, Amy, Kunapuli, Sasidhar, Miailhe, Nicolas, Delaunay, Julien, Radharapu, Bhaktipriya, Shinde, Rajat, Tuesday, null, Dutta, Debojyoti, Grabb, Declan, Gangavarapu, Ananya, Sahay, Saurav, Gangavarapu, Agasthya, Schramowski, Patrick, Singam, Stephen, David, Tom, Han, Xudong, Mammen, Priyanka Mary, Prabhakar, Tarunima, Kovatchev, Venelin, Ahmed, Ahmed, Manyeki, Kelvin N., Madireddy, Sandeep, Khomh, Foutse, Zhdanov, Fedor, Baumann, Joachim, Vasan, Nina, Yang, Xianjun, Mougn, Carlos, Varghese, Jibin Rajan, Chinoy, Hussain, Jitendar, Seshakrishna, Maskey, Manil, Hardgrove, Claire V., Li, Tianhao, Gupta, Aakash, Joswin, Emil, Mai, Yifan, Kumar, Shachi H, Patlak, Cigdem, Lu, Kevin, Alessi, Vincent, Balija, Sree Bhargavi, Gu, Chenhe, Sullivan, Robert, Gealy, James, Lavrisa, Matt, Goel, James, Mattson, Peter, Liang, Percy, Vanschoren, Joaquin

arXiv.org Artificial Intelligence

The rapid advancement and deployment of AI systems have created an urgent need for standard safety-evaluation frameworks. This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories, including violent crimes, nonviolent crimes, sex-related crimes, child sexual exploitation, indiscriminate weapons, suicide and self-harm, intellectual property, privacy, defamation, hate, sexual content, and specialized advice (election, financial, health, legal). Our method incorporates a complete assessment standard, extensive prompt datasets, a novel evaluation framework, a grading and reporting system, and the technical as well as organizational infrastructure for long-term support and evolution. In particular, the benchmark employs an understandable five-tier grading scale (Poor to Excellent) and incorporates an innovative entropy-based system-response evaluation. In addition to unveiling the benchmark, this report also identifies limitations of our method and of building safety benchmarks generally, including evaluator uncertainty and the constraints of single-turn interactions. This work represents a crucial step toward establishing global standards for AI risk and reliability evaluation while acknowledging the need for continued development in areas such as multiturn interactions, multimodal understanding, coverage of additional languages, and emerging hazard categories. Our findings provide valuable insights for model developers, system integrators, and policymakers working to promote safer AI deployment.


New AI test measures how fast robots can respond to user commands

FOX News

WEHEAD connects to ChatGPT and displays a face, expressions and voice. Artificial intelligence benchmarking group MLCommons on Wednesday released a fresh set of tests and results that rate the speed at which top-of-the-line hardware can run AI applications and respond to users. The two new benchmarks added by MLCommons measure the speed at which the AI chips and systems can generate responses from the powerful AI models packed with data. The results roughly demonstrate to how quickly an AI application such as ChatGPT can deliver a response to a user query. One of the new benchmarks added the capability to measure the speediness of a question-and-answer scenario for large language models.


Improvements & Evaluations on the MLCommons CloudMask Benchmark

Chennamsetti, Varshitha, Mehnaz, Laiba, Zhao, Dan, Ghosh, Banani, Samsonau, Sergey V.

arXiv.org Artificial Intelligence

In this paper, we report the performance benchmarking results of deep learning models on MLCommons' Science cloud-masking benchmark using a high-performance computing cluster at New York University (NYU): NYU Greene. MLCommons is a consortium that develops and maintains several scientific benchmarks that can benefit from developments in AI. We provide a description of the cloud-masking benchmark task, updated code, and the best model for this benchmark when using our selected hyperparameter settings. Our benchmarking results include the highest accuracy achieved on the NYU system as well as the average time taken for both training and inference on the benchmark across several runs/seeds. Our code can be found on GitHub. MLCommons team has been kept informed about our progress and may use the developed code for their future work.


Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models

Parrish, Alicia, Kirk, Hannah Rose, Quaye, Jessica, Rastogi, Charvi, Bartolo, Max, Inel, Oana, Ciro, Juan, Mosquera, Rafael, Howard, Addison, Cukierski, Will, Sculley, D., Reddi, Vijay Janapa, Aroyo, Lora

arXiv.org Artificial Intelligence

The generative AI revolution in recent years has been spurred by an expansion in compute power and data quantity, which together enable extensive pre-training of powerful text-to-image (T2I) models. With their greater capabilities to generate realistic and creative content, these T2I models like DALL-E, MidJourney, Imagen or Stable Diffusion are reaching ever wider audiences. Any unsafe behaviors inherited from pretraining on uncurated internet-scraped datasets thus have the potential to cause wide-reaching harm, for example, through generated images which are violent, sexually explicit, or contain biased and derogatory stereotypes. Despite this risk of harm, we lack systematic and structured evaluation datasets to scrutinize model behavior, especially adversarial attacks that bypass existing safety filters. A typical bottleneck in safety evaluation is achieving a wide coverage of different types of challenging examples in the evaluation set, i.e., identifying 'unknown unknowns' or long-tail problems. To address this need, we introduce the Adversarial Nibbler challenge. The goal of this challenge is to crowdsource a diverse set of failure modes and reward challenge participants for successfully finding safety vulnerabilities in current state-of-the-art T2I models. Ultimately, we aim to provide greater awareness of these issues and assist developers in improving the future safety and reliability of generative AI models. Adversarial Nibbler is a data-centric challenge, part of the DataPerf challenge suite, organized and supported by Kaggle and MLCommons.


MLCommons

#artificialintelligence

MLCube is a set of best practices for creating ML software that can just "plug-and-play" on many different systems. MLCube makes it easier for researchers to share innovative ML models, for a developer to experiment with many different models, and for software companies to create infrastructure for models. It creates opportunities by putting ML in the hands of more people. MLCube isn't a new framework or service; MLCube is a consistent interface to machine learning models in containers like Docker. Models published with the MLCube interface can be run on local machines, on a variety of major clouds, or in Kubernetes clusters -- all using the same code.


Nvidia and Intel show machine learning performance gains on latest MLPerf Training 2.1 results

#artificialintelligence

Join us on November 9 to learn how to successfully innovate and achieve efficiency by upskilling and scaling citizen developers at the Low-Code/No-Code Summit. MLCommons is out today with its latest set of machine learning (ML) MLPerf benchmarks, once again showing how hardware and software for artificial intelligence (AI) are getting faster. MLCommons is a vendor-neutral organization that aims to provide standardized testing and benchmarks to help evaluate the state of ML software and hardware. Under the MLPerf testing name, MLCommons collects different ML benchmarks multiple times throughout the year. In September, the MLPerf Inference results were released, showing gains in how different technologies have improved inference performance.


FastML Science Benchmarks: Accelerating Real-Time Scientific Edge Machine Learning

Duarte, Javier, Tran, Nhan, Hawks, Ben, Herwig, Christian, Muhizi, Jules, Prakash, Shvetank, Reddi, Vijay Janapa

arXiv.org Artificial Intelligence

Applications of machine learning (ML) are growing by the day for many unique and challenging scientific applications. However, a crucial challenge facing these applications is their need for ultra low-latency and on-detector ML capabilities. Given the slowdown in Moore's law and Dennard scaling, coupled with the rapid advances in scientific instrumentation that is resulting in growing data rates, there is a need for ultra-fast ML at the extreme edge. Fast ML at the edge is essential for reducing and filtering scientific data in real-time to accelerate science experimentation and enable more profound insights. To accelerate real-time scientific edge ML hardware and software solutions, we need well-constrained benchmark tasks with enough specifications to be generically applicable and accessible. These benchmarks can guide the design of future edge ML hardware for scientific applications capable of meeting the nanosecond and microsecond level latency requirements. To this end, we present an initial set of scientific ML benchmarks, covering a variety of ML and embedded system techniques.


MLPerf Results Highlight More Capable ML Training - insideBIGDATA

#artificialintelligence

Today, MLCommons, an open engineering consortium, released new results from MLPerf Training v2.0, which measures the performance of training machine learning models. Training models empowers researchers to unlock new capabilities faster such as diagnosing tumors, automatic speech recognition or improving movie recommendations. The latest MLPerf Training results demonstrate broad industry participation and up to 1.8X greater performance ultimately paving the way for more capable intelligent systems to benefit society at large. The MLPerf Training benchmark suite comprises full system tests that stress machine learning models, software, and hardware for a broad range of applications. The open-source and peer-reviewed benchmark suite provides a level playing field for competition that drives innovation, performance, and energy-efficiency for the entire industry.