AITopics | please read

Collaborating Authors

please read

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

2cb40fc022ca7bdc1a9a78b793661284-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-10-2026, 07:04:50 GMT

dataset, huggingface, llm, (15 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
(4 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Law > Litigation (1.00)
Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models

Neural Information Processing SystemsOct-9-2025, 21:58:40 GMT

Large language models (LLMs) have made significant progress in natural language processing tasks and demonstrate considerable potential in the legal domain.

dataset, huggingface, llm, (15 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
(4 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Law > Litigation (1.00)
Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

e52ad5c9f751f599492b4f087ed7ecfc-AuthorFeedback.pdf

Neural Information Processing SystemsAug-20-2025, 07:16:18 GMT

dual model, joint training, valid code, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Add feedback

Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models

Wu, Junjie, Gu, Gefei, Zheng, Yanan, Yeung, Dit-Yan, Cohan, Arman

arXiv.org Artificial IntelligenceAug-5-2025

Long-context language models (LCLMs) have exhibited impressive capabilities in long-context understanding tasks. Among these, long-context referencing -- a crucial task that requires LCLMs to attribute items of interest to specific parts of long-context data -- remains underexplored. To bridge this gap, this paper proposes Referencing Evaluation for Long-context Language Models (Ref-Long), a novel benchmark designed to assess the long-context referencing capability of LCLMs. Specifically, Ref-Long requires LCLMs to identify the indexes of documents that reference a specific key, emphasizing contextual relationships between the key and the documents over simple retrieval. Based on the task design, we construct three subsets ranging from synthetic to realistic scenarios to form the Ref-Long benchmark. Experimental results of 13 LCLMs reveal significant shortcomings in long-context referencing, even among advanced models like GPT-4o. To further investigate these challenges, we conduct comprehensive analyses, including human evaluations, task format adjustments, fine-tuning experiments, and error analyses, leading to several key insights. Our data and code can be found in https://github. com/wujunjie1998/Ref-Long.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.09506

Country:

North America > United States (0.67)
Asia (0.46)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)

Add feedback

Non-literal Understanding of Number Words by Language Models

Tsvilodub, Polina, Gandhi, Kanishk, Zhao, Haoran, Fränken, Jan-Philipp, Franke, Michael, Goodman, Noah D.

arXiv.org Artificial IntelligenceFeb-10-2025

Humans naturally interpret numbers non-literally, effortlessly combining context, world knowledge, and speaker intent. We investigate whether large language models (LLMs) interpret numbers similarly, focusing on hyperbole and pragmatic halo effects. Through systematic comparison with human data and computational models of pragmatic reasoning, we find that LLMs diverge from human interpretation in striking ways. By decomposing pragmatic reasoning into testable components, grounded in the Rational Speech Act framework, we pinpoint where LLM processing diverges from human cognition -- not in prior knowledge, but in reasoning with it. This insight leads us to develop a targeted solution -- chain-of-thought prompting inspired by an RSA model makes LLMs' interpretations more human-like. Our work demonstrates how computational cognitive models can both diagnose AI-human differences and guide development of more human-like language understanding capabilities.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.06204

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
North America > United States > Washington > King County > Seattle (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models

Li, Haitao, Chen, You, Ai, Qingyao, Wu, Yueyue, Zhang, Ruizhe, Liu, Yiqun

arXiv.org Artificial IntelligenceNov-26-2024

Large language models (LLMs) have made significant progress in natural language processing tasks and demonstrate considerable potential in the legal domain. However, legal applications demand high standards of accuracy, reliability, and fairness. Applying existing LLMs to legal systems without careful evaluation of their potential and limitations could pose significant risks in legal practice. To this end, we introduce a standardized comprehensive Chinese legal benchmark LexEval. This benchmark is notable in the following three aspects: (1) Ability Modeling: We propose a new taxonomy of legal cognitive abilities to organize different tasks. (2) Scale: To our knowledge, LexEval is currently the largest Chinese legal evaluation dataset, comprising 23 tasks and 14,150 questions. (3) Data: we utilize formatted existing datasets, exam datasets and newly annotated datasets by legal experts to comprehensively evaluate the various capabilities of LLMs. LexEval not only focuses on the ability of LLMs to apply fundamental legal knowledge but also dedicates efforts to examining the ethical issues involved in their application. We evaluated 38 open-source and commercial LLMs and obtained some interesting findings. The experiments and findings offer valuable insights into the challenges and potential solutions for developing Chinese legal systems and LLM evaluation pipelines. The LexEval dataset and leaderboard are publicly available at \url{https://github.com/CSHaitao/LexEval} and will be continuously updated.

dataset, huggingface, llm, (15 more...)

arXiv.org Artificial Intelligence

2409.20288

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > China > Beijing > Beijing (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)
(5 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Law > Litigation (1.00)
Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

Wang, Xinpeng, Ma, Bolei, Hu, Chengzhi, Weber-Genzel, Leon, Röttger, Paul, Kreuter, Frauke, Hovy, Dirk, Plank, Barbara

arXiv.org Artificial IntelligenceJul-4-2024

The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions (MCQ) to limit the response space. The model is then evaluated by ranking the candidate answers by the log probability of the first token prediction. However, first-tokens may not consistently reflect the final response output, due to model's diverse response styles such as starting with "Sure" or refusing to answer. Consequently, MCQ evaluation is not indicative of model behaviour when interacting with users. But by how much? We evaluate how aligned first-token evaluation is with the text output along several dimensions, namely final option choice, refusal rate, choice distribution and robustness under prompt perturbation. Our results show that the two approaches are severely misaligned on all dimensions, reaching mismatch rates over 60%. Models heavily fine-tuned on conversational or safety data are especially impacted. Crucially, models remain misaligned even when we increasingly constrain prompts, i.e., force them to start with an option letter or example template. Our findings i) underscore the importance of inspecting the text output as well and ii) caution against relying solely on first-token evaluation.

evaluation, multiple-choice question, text output, (15 more...)

arXiv.org Artificial Intelligence

2402.14499

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
(5 more...)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)

Industry: Government > Regional Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning

Li, Jiazheng, Zhao, Runcong, Yang, Yongxin, He, Yulan, Gui, Lin

arXiv.org Artificial IntelligenceDec-14-2023

The remarkable performance of pre-trained large language models has revolutionised various natural language processing applications. Due to huge parametersizes and extensive running costs, companies or organisations tend to transfer the models to the target task by zero-shot prompting techniques. However, the prohibitive costs of tokens and time have hindered their adoption in applications. We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs, thereby reducing token and time costs. This approach could potentially improve task performance during API queries due to better conditional distribution mapping. Evaluated across diverse classification datasets, our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance, and in some cases, even improving it. An ablation study conducted on various LLMs, along with an investigation into the robustness of our prompting strategy to different input ordering, offers valuable insights into the broader applicability of our method across diverse tasks. These findings also suggest a more seamless integration of our method with LLMs through an API.

dataset, overprompt, task input, (16 more...)

arXiv.org Artificial Intelligence

2305.14973

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Middle East > Republic of Türkiye (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.96)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Large Language Models are biased to overestimate profoundness

Herrera-Berg, Eugenio, Browne, Tomás Vergara, León-Villagrá, Pablo, Vives, Marc-Lluís, Calderon, Cristian Buc

arXiv.org Artificial IntelligenceOct-22-2023

Recent advancements in natural language processing by large language models (LLMs), such as GPT-4, have been suggested to approach Artificial General Intelligence. And yet, it is still under dispute whether LLMs possess similar reasoning abilities to humans. This study evaluates GPT-4 and various other LLMs in judging the profoundness of mundane, motivational, and pseudo-profound statements. We found a significant statement-to-statement correlation between the LLMs and humans, irrespective of the type of statements and the prompting technique used. However, LLMs systematically overestimate the profoundness of nonsensical statements, with the exception of Tk-instruct, which uniquely underestimates the profoundness of statements. Only few-shot learning prompts, as opposed to chain-of-thought prompting, draw LLMs ratings closer to humans. Furthermore, this work provides insights into the potential biases induced by Reinforcement Learning from Human Feedback (RLHF), inducing an increase in the bias to overestimate the profoundness of statements.

5-point scale, llm, profoundness, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2023.emnlp-main.599

2310.14422

Country:

South America > Chile (0.04)
South America > Brazil (0.04)
North America > United States > Florida > Hillsborough County > University (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (0.69)
Research Report > New Finding (0.47)

Industry: Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Microsoft Sentinel customizable machine learning based anomalies is Generally Available

#artificialintelligenceSep-27-2022, 08:25:21 GMT

Security analysts can use anomalies to reduce investigation and hunting time, as well as detect new and emerging threats. Typically, these benefits come at the cost of a high benign positive rate, but Microsoft Sentinel's customizable anomaly models are tuned by our data science team and trained with the data in your Microsoft Sentinel workspace to reduce the rate, providing out-of-the box value. If security analysts need to tune them further, the process is simple and requires no knowledge of machine learning. In this blog, we will discuss how customizable machine learning based anomalies have improved since Public Preview. Anomalies have their own tab on the Analytics blade!

anomaly, artificial intelligence, machine learning, (12 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.92)
Information Technology > Data Science (0.75)

Add feedback