Goto

Collaborating Authors

 Krause, Michael


Humanity's Last Exam

arXiv.org Artificial Intelligence

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.


Retail Analytics in the New Normal: The Influence of Artificial Intelligence and the Covid-19 Pandemic

arXiv.org Artificial Intelligence

The COVID-19 pandemic has severely disrupted the retail landscape and has accelerated the adoption of innovative technologies. A striking example relates to the proliferation of online grocery orders and the technology deployed to facilitate such logistics. In fact, for many retailers, this disruption was a wake-up call after which they started recognizing the power of data analytics and artificial intelligence (AI). In this article, we discuss the opportunities that AI can offer to retailers in the new normal retail landscape. Some of the techniques described have been applied at scale to adapt previously deployed AI models, whereas in other instances, fresh solutions needed to be developed to help retailers cope with recent disruptions, such as unexpected panic buying, retraining predictive models, and leveraging online-offline synergies.


Soft Dynamic Time Warping for Multi-Pitch Estimation and Beyond

arXiv.org Artificial Intelligence

Many tasks in music information retrieval (MIR) involve weakly aligned data, where exact temporal correspondences are unknown. The connectionist temporal classification (CTC) loss is a standard technique to learn feature representations based on weakly aligned training data. However, CTC is limited to discrete-valued target sequences and can be difficult to extend to multi-label problems. In this article, we show how soft dynamic time warping (SoftDTW), a differentiable variant of classical DTW, can be used as an alternative to CTC. Using multi-pitch estimation as an example scenario, we show that SoftDTW yields results on par with a state-of-the-art multi-label extension of CTC. In addition to being more elegant in terms of its algorithmic formulation, SoftDTW naturally extends to real-valued target sequences.