ench
- Law (1.00)
- Information Technology > Security & Privacy (0.93)
- Europe > Switzerland (0.04)
- North America > Dominican Republic (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (3 more...)
- Research Report (0.67)
- Overview (0.67)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Media (0.68)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
MLLM-C
The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping, while comparing sofa designs helps optimize the aesthetics of our living space. Despite its significance, the comparative capability is largely unexplored in artificial general intelligence (AGI).
- North America > United States > Ohio (0.04)
- North America > United States > California (0.04)
- Leisure & Entertainment > Sports > Soccer (0.46)
- Education (0.46)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > China > Liaoning Province > Shenyang (0.04)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
AppSelectBench: Application-Level Tool Selection Benchmark
Chen, Tianyi, Solodko, Michael, Wang, Sen, Ko, Jongwoo, Hao, Junheng, Banbury, Colby, Abdali, Sara, Amizadeh, Saeed, Xiao, Qing, Li, Yinheng, Ding, Tianyu, Dizaji, Kamran Ghasedi, Zheng, Suzhen, Fan, Hao, Wagle, Justin, Cameron, Pashmina, Koishida, Kazuhito
Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, together with unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented-settings. AppSelectBench covers one hundred widely used desktop applications and includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks. Extensive experiments across both closed-source and open-source large language models reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. Together, these results establish AppSelectBench as a foundation for studying and advancing application level reasoning, an essential yet underexplored capability of intelligent CUAs. The source is available at https://microsoft.github.io/appselectbench/.
- Media (0.68)
- Information Technology > Services (0.68)
- Leisure & Entertainment > Games > Computer Games (0.46)
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
Chiu, Yu Ying, Lee, Michael S., Calcott, Rachel, Handoko, Brandon, de Font-Reaulx, Paul, Rodriguez, Paula, Zhang, Chen Bo Calvin, Han, Ziwen, Sehwag, Udari Madhushani, Maurya, Yash, Knight, Christina Q, Lloyd, Harry R., Bacus, Florence, Mazeika, Mantas, Liu, Bing, Choi, Yejin, Gordon, Mitchell L, Levine, Sydney
As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.
- Oceania > New Zealand (0.04)
- Oceania > Australia (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (10 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.67)
- Law (1.00)
- Leisure & Entertainment > Sports (0.67)
- Education > Educational Setting (0.46)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)
- Europe > Switzerland (0.04)
- North America > Dominican Republic (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (3 more...)
- Research Report (0.67)
- Overview (0.67)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Media (0.68)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)