failure rate
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (2 more...)
- Health & Medicine > Diagnostic Medicine > Imaging (0.93)
- Law (0.93)
- Health & Medicine > Therapeutic Area > Dermatology (0.71)
- Information Technology > Security & Privacy (0.67)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Vision > Face Recognition (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
- North America > United States > Michigan (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Maryland > Montgomery County > Rockville (0.04)
SupplementaryMaterialsofRandomNoiseDefense againstQuery-BasedBlack-BoxAttacks
In Section A, we talk about the societal impacts of our work In Section B, we provide detailed experimental settings as well as further evaluation results on CIFAR-10 and ImageNet. Forreal-worldapplications,theDNNmodelaswellas the training dataset, are often hidden from users. Extensive experiments verify our theoretical analysis and showtheeffectiveness ofourdefense methods against several state-of-the-art query-based attacks. On ImageNet, [23] released the ResNet-50 model fine-tuned with Gaussian noise sampled from N(0,0.5I)andwedirectlyadoptit. The experimental results on ImageNet are shown in Figure 3 (a-d).
AI chatbots miss urgent issues in queries about women's health
AI chatbots miss urgent issues in queries about women's health AI models such as ChatGPT and Gemini fail to give adequate advice for 60 per cent of queries relating to women's health in a test created by medical professionals Many women are using AI for health information, but the answers aren't always up to scratch Commonly used AI models fail to accurately diagnose or offer advice for many queries relating to women's health that require urgent attention. Thirteen large language models, produced by the likes of OpenAI, Google, Anthropic, Mistral AI and xAI, were given 345 medical queries across five specialities, including emergency medicine, gynaecology and neurology. The queries were written by 17 women's health researchers, pharmacists and clinicians from the US and Europe. The answers were reviewed by the same experts. Any questions that the models failed at were collated into a benchmarking test of AI models' medical expertise that included 96 queries.
- Europe (0.25)
- North America > United States > California (0.05)
- North America > Canada > Quebec > Montreal (0.05)
Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks
Sharifloo, Amir Molzam, Heydari, Maedeh, Kazerooni, Parsa, Maninger, Daniel, Mezini, Mira
Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering quantitative rankings of LLMs. However, they provide limited insight into the tasks that LLMs consistently fail to solve - information that is crucial for understanding current limitations and guiding the development of more capable models. To address this gap, we examined code generation tasks across four popular benchmarks, identifying those that major LLMs are most likely to fail. To understand the causes of these failures, we investigated whether the static complexity of solution code contributes to them, followed by a systematic inspection of 114 tasks that LLMs consistently struggled with. Our analysis revealed four recurring patterns of weaknesses in LLMs, as well as common complications within benchmark tasks that most often lead to failure.
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.05)
- North America > United States (0.04)
- Research Report > Experimental Study (0.94)
- Research Report > New Finding (0.68)