leaderboard
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models
With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the performance of code LLMs with a particular focus on code generation tasks. However, they are insufficient to cover the full range of expected capabilities of code LLMs, which span beyond code generation to answering diverse coding-related questions.
- North America > United States > California > Orange County > Irvine (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (13 more...)
- Information Technology (1.00)
- Health & Medicine (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > Canada > Ontario (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (9 more...)
- Media > Film (1.00)
- Leisure & Entertainment > Games > Computer Games (0.67)
- Information Technology (0.67)
- Leisure & Entertainment > Games > Chess (0.47)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Arizona > Pinal County > Maricopa (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Food & Agriculture > Agriculture (1.00)
- Energy > Renewable (0.93)
OpenXAI: Towards a Transparent Evaluation of Post hoc Model Explanations
While several types of post hoc explanation methods have been proposed in recent literature, there is very little work on systematically benchmarking these methods. Here, we introduce OpenXAI, a comprehensive and extensible open-source framework for evaluating and benchmarking post hoc explanation methods.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- Asia > Middle East > Jordan (0.04)
DataPerf: Benchmarks for Data-Centric AI Development Mark Mazumder
Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks.
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > Canada (0.04)
- (2 more...)