human evaluation score
- Africa > Malawi (0.14)
- North America > United States > California (0.04)
- Europe > France (0.04)
- (7 more...)
Assessing Color Vision Test in Large Vision-language Models
Ye, Hongfei, Chen, Bin, Liu, Wenxi, Zhang, Yu, Li, Zhao, Ni, Dandan, Chen, Hongyang
With the widespread adoption of large vision-language models, the capacity for color vision in these models is crucial. However, the color vision abilities of large visual-language models have not yet been thoroughly explored. To address this gap, we define a color vision testing task for large vision-language models and construct a dataset \footnote{Anonymous Github Showing some of the data https://anonymous.4open.science/r/color-vision-test-dataset-3BCD} that covers multiple categories of test questions and tasks of varying difficulty levels. Furthermore, we analyze the types of errors made by large vision-language models and propose fine-tuning strategies to enhance their performance in color vision tests.
Legal Evalutions and Challenges of Large Language Models
Wang, Jiaqi, Zhao, Huan, Yang, Zhenyuan, Shu, Peng, Chen, Junhao, Sun, Haobo, Liang, Ruixi, Li, Shixin, Shi, Pengcheng, Ma, Longjun, Liu, Zongjia, Liu, Zhengliang, Zhong, Tianyang, Zhang, Yutong, Ma, Chong, Zhang, Xin, Zhang, Tuo, Ding, Tianli, Ren, Yudan, Liu, Tianming, Jiang, Xi, Zhang, Shu
In this paper, we review legal testing methods based on Large Language Models (LLMs), using the OPENAI o1 model as a case study to evaluate the performance of large models in applying legal provisions. We compare current state-of-the-art LLMs, including open-source, closed-source, and legal-specific models trained specifically for the legal domain. Systematic tests are conducted on English and Chinese legal cases, and the results are analyzed in depth. Through systematic testing of legal cases from common law systems and China, this paper explores the strengths and weaknesses of LLMs in understanding and applying legal texts, reasoning through legal issues, and predicting judgments. The experimental results highlight both the potential and limitations of LLMs in legal applications, particularly in terms of challenges related to the interpretation of legal language and the accuracy of legal reasoning. Finally, the paper provides a comprehensive analysis of the advantages and disadvantages of various types of models, offering valuable insights and references for the future application of AI in the legal field.
- Asia > China (0.34)
- North America > United States (0.28)
- Asia > Singapore (0.04)
- Asia > Indonesia > Bali (0.04)
- Research Report (1.00)
- Overview (1.00)
- Law (1.00)
- Information Technology > Security & Privacy (0.93)
Quality-Diversity through AI Feedback
Bradley, Herbie, Dai, Andrew, Teufel, Hannah, Zhang, Jenny, Oostermeijer, Koen, Bellagente, Marco, Clune, Jeff, Stanley, Kenneth, Schott, Grégory, Lehman, Joel
In many text-generation problems, users may prefer not only a single response, but a diverse range of high-quality outputs from which to choose. Quality-diversity (QD) search algorithms aim at such outcomes, by continually improving and diversifying a population of candidates. However, the applicability of QD to qualitative domains, like creative writing, has been limited by the difficulty of algorithmically specifying measures of quality and diversity. Interestingly, recent developments in language models (LMs) have enabled guiding search through AI feedback, wherein LMs are prompted in natural language to evaluate qualitative aspects of text. Leveraging this development, we introduce Quality-Diversity through AI Feedback (QDAIF), wherein an evolutionary algorithm applies LMs to both generate variation and evaluate the quality and diversity of candidate text. When assessed on creative writing domains, QDAIF covers more of a specified search space with high-quality samples than do non-QD controls. Further, human evaluation of QDAIF-generated creative texts validates reasonable agreement between AI and human evaluation. Our results thus highlight the potential of AI feedback to guide open-ended search for creative and original solutions, providing a recipe that seemingly generalizes to many domains and modalities. In this way, QDAIF is a step towards AI systems that can independently search, diversify, evaluate, and improve, which are among the core skills underlying human society's capacity for innovation.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.13)
- North America > Canada > British Columbia (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Research Report > New Finding (1.00)
- Research Report > Promising Solution (0.87)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Law (1.00)
- Health & Medicine > Consumer Health (1.00)
- (4 more...)