judge
Judge Anything: MLLM as a Judge Across Any Modality
Pu, Shu, Wang, Yaochen, Chen, Dongping, Chen, Yuhang, Wang, Guohao, Qin, Qi, Zhang, Zhongyi, Zhang, Zhiyuan, Zhou, Zetong, Gong, Shuang, Gui, Yi, Wan, Yao, Yu, Philip S.
Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: https://urrealhero.github.io/judgeanythingweb/.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Asia > Middle East > Oman (0.04)
- Research Report > New Finding (1.00)
- Research Report > Promising Solution (0.67)
- Health & Medicine (0.67)
- Leisure & Entertainment > Sports > Motorsports (0.45)
JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System
Su, Weihang, Yue, Baoqing, Ai, Qingyao, Hu, Yiran, Li, Jiaqi, Wang, Changyue, Zhang, Kaiyuan, Wu, Yueyue, Liu, Yiqun
This paper introduces JuDGE (Judgment Document Generation Evaluation), a novel benchmark for evaluating the performance of judgment document generation in the Chinese legal system. We define the task as generating a complete legal judgment document from the given factual description of the case. To facilitate this benchmark, we construct a comprehensive dataset consisting of factual descriptions from real legal cases, paired with their corresponding full judgment documents, which serve as the ground truth for evaluating the quality of generated documents. This dataset is further augmented by two external legal corpora that provide additional legal knowledge for the task: one comprising statutes and regulations, and the other consisting of a large collection of past judgment documents. In collaboration with legal professionals, we establish a comprehensive automated evaluation framework to assess the quality of generated judgment documents across various dimensions. We evaluate various baseline approaches, including few-shot in-context learning, fine-tuning, and a multi-source retrieval-augmented generation (RAG) approach, using both general and legal-domain LLMs. The experimental results demonstrate that, while RAG approaches can effectively improve performance in this task, there is still substantial room for further improvement. All the codes and datasets are available at: https://github.com/oneal2000/JuDGE.
- Asia > China > Beijing > Beijing (0.05)
- Asia > Japan (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (5 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility?
Zhang, Yiyi, Chen, Xingyu, Chen, Kexin, Du, Yuyang, Dang, Xilin, Heng, Pheng-Ann
Recent years have witnessed extensive efforts to enhance Large Language Models (LLMs) across various domains, alongside growing attention to their ethical implications. However, a critical challenge remains largely overlooked: LLMs must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility. This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance by addressing this ethical-utility trade-off, using chemical domain applications as a proof-of-concept. Our alignment pipeline starts with a GPT-assisted three-phase data generation scheme, in which we create LibraChemQA, a chemical question-answering dataset comprising 31.6k triplet instances. By incorporating an innovative balanced seed in the data generation process, our framework systematically considers both legitimate and illegitimate requests. The framework also introduces a rephrasing mechanism for efficient data augmentation that enhances the model's chemical comprehension. We further develop a novel hybrid evaluation scheme with LLM judges for precise assessment of both safety and utility. Experimental results demonstrate our model's substantial improvements in overall performance where both safety and utility are considered - our resulting model, LibraChem, outperforms leading LLMs including Claude-3, GPT-4o, and LLaMA-3 by margins of 13.44%, 7.16%, and 7.10% respectively on our released benchmark.
The Game awards: three patience-testing hours of video game advertorials
The high point of the ninth annual Game awards arrived within its first 15 minutes. A charmingly unkempt Al Pacino arrived on stage to present the award for best performance, quickly admitting that he neither played "a whole lot of video games" nor could read the teleprompter especially well. Still, he managed to hand the gong out to actor Christopher Judge for his electrifying performance as Kratos in God of War Ragnarök. Dressed in a sparkling gold suit, Judge began his moment in the sun by hugging the Hollywood star. This was just the start of a further 10 heartfelt minutes on stage, the actor relaying the personal anguish he went through leading up to the game's production.
Scientists Create "Deliberately" Biased AI That Judges You as Brutally as Your Mother-in-Law
Machine learning researchers are teaching neural networks how to superficially judge humans -- and the results are as brutal as they are familiar. A study about the judgmental AI, published in the prestigious Proceedings of the National Academy of Sciences journal, describes how researchers trained the model how to judge attributes in human faces, the way we do upon first meeting each other, and how they trained it to manipulate photos to evoke different judgments, such as appearing "trustworthy" or "dominant." "Our dataset not only contains bias," Princeton computer science postdoctoral researcher Joshua Peterson wrote in a tweet thread about the research, "it deliberately reflects it." We collected over 1 million human judgments to power a model that can both predict and manipulate first impressions of diverse and naturalistic faces! The PNAS paper notes that the AI so mirrored human judgment that it tended to associate objective physical characteristics, such as someone's size or skin color, with attributes ranging from trustworthiness to privilege.
The History of Artificial Intelligence: The Turing Test
In his 1950's work Computing Machinery and Intelligence, Alan Turing (1912–1954), who is considered by many the father of Artificial Intelligence, laid out the following question: This question, despite its short length and old origin, still remains a frequent source of discussion, navigating the frontier between technology, philosophy, neuroscience and theology. However, more than half a century ago Turing proposed an indirect way to answer it: Through the famous Turing Test. Turing believed that for us to answer this question without ambiguity, the question itself must be rephrased, specifying or replacing the meaning of'think' and'machines'. Lets first see how we can smooth the'think' out of the equation. Turing proposed to do this by first modifying the question from "Can Machines Think?" to: "Can a machine do what we as thinking entities can do?"
- Information Technology > Artificial Intelligence > Issues > Turing's Test (1.00)
- Information Technology > Artificial Intelligence > History (1.00)
SAP BrandVoice: How AI And "Gaze Control" Will Help Businesses Reopen Safely
Recent projections by the US federal government estimate that there will be 200,000 new coronavirus cases in the US by June 1. At the same time, governments around the world are grappling with the complexities of safely reopening businesses, schools and other public institutions. Technology companies are rushing into that gap with software aimed at keeping people safe, while citizens navigate a patchwork approach to easing shelter-in-place orders. One well-known approach is the use of contact-tracing apps on smart phones created by tech and telecom companies. These apps alert people if they've been in close proximity to an infected person.
- North America > United States (0.25)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.71)
- Health & Medicine > Therapeutic Area > Immunology (0.71)
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Communications > Mobile (0.36)
Tennessee Offender Management Information System
Sentences for the 50,000 offenders vary from community work release and probation to lifelong incarceration. Tennessee was one of 38 states required by court order to improve prison conditions and reduce overcrowding; it is the target of over 300 inmate lawsuits each year. The new $14 million system is the largest and most comprehensive computer system ever developed in the field of corrections. Sentences C and D are consecutive to sentence B, and sentence B is consecutive to sentence A. C, and D of an offender, as shown in figure 1, it must be determined which sentence is not consecutive to any others. In this case, A is the sentence that must first be calculated because its dates do not depend on a previous sentence.
Can Machines Think?
Alan Turing's decades-old question still influences artificial intelligence because of the simple test he proposed in his article in Mind. In this article, AI Magazine collects presentations about the first round of the classic Turing Test of machine intelligence, held November 8, 1991 at The Computer Museum, Boston. Robert Epstein, Director Emeritus, Cambridge Center for Behavioral Studies, and an adjunct professor of psychology, Boston University, University of Massachusetts (Amherst), and University of California (San Diego) summarizes some of the difficult issues during the planning of this first real-time competition, and describes the event. Presented in tandem with Dr. Epstein's article is the actual transcript of session that won the Loebner Prize Competition--Joseph Weintraub's computer program PC Therapist. In 1985 an old friend, Hugh Loebner, told me excitedly that the Turing Test should be made into an annual contest.
- Leisure & Entertainment (1.00)
- Information Technology > Software (0.88)
- Education > Educational Setting (0.67)
Toward a Comprehension Challenge, Using Crowdsourcing as a Tool
Human readers comprehend vastly more, and in vastly different ways, than any existing comprehension test would suggest. An ideal comprehension test for a story should cover the full range of questions and answers that humans would expect other humans to reasonably learn or infer from a given story. ICCG uses structured crowdsourcing to comprehensively generate relevant questions and supported answers for arbitrary stories, whether fiction or nonfiction, presented across a variety of media such as videos, podcasts, and still images. While the AI scientific community had hoped that by 2015 machines would be able to read and comprehend language, current models are typically superficial, capable of understanding sentences in limited domains (such as extracting movie times and restaurant locations from text) but without the sort of widecoverage comprehension that we expect of any teenager. Comprehension itself extends beyond the written word; most adults and children can comprehend a variety of narratives, both fiction and nonfiction, presented in a wide variety of formats, such as movies, television and radio programs, written stories, YouTube videos, still images, and cartoons.