Goto

Collaborating Authors

 Law


Towards Event Extraction with Massive Types: LLM-based Collaborative Annotation and Partitioning Extraction

arXiv.org Artificial Intelligence

Developing a general-purpose extraction system that can extract events with massive types is a long-standing target in Event Extraction (EE). In doing so, the challenge comes from two aspects: 1) The absence of an efficient and effective annotation method. 2) The absence of a powerful extraction method can handle massive types. For the first challenge, we propose a collaborative annotation method based on Large Language Models (LLMs). Through collaboration among multiple LLMs, it first refines annotations of trigger words from distant supervision and then carries out argument annotation. Next, a voting phase consolidates the annotation preferences across different LLMs. Finally, we create the EEMT dataset, the largest EE dataset to date, featuring over 200,000 samples, 3,465 event types, and 6,297 role types. For the second challenge, we propose an LLM-based Partitioning EE method called LLM-PEE. To overcome the limited context length of LLMs, LLM-PEE first recalls candidate event types and then splits them into multiple partitions for LLMs to extract events. The results in the supervised setting show that LLM-PEE outperforms the state-of-the-art methods by 5.4 in event detection and 6.1 in argument extraction. In the zero-shot setting, LLM-PEE achieves up to 12.9 improvement compared to mainstream LLMs, demonstrating its strong generalization capabilities.


AI Governance InternationaL Evaluation Index (AGILE Index)

arXiv.org Artificial Intelligence

The rapid advancement of Artificial Intelligence (AI) technology is profoundly transforming human society and concurrently presenting a series of ethical, legal, and social issues. The effective governance of AI has become a crucial global concern. Since 2022, the extensive deployment of generative AI, particularly large language models, marked a new phase in AI governance. Continuous efforts are being made by the international community in actively addressing the novel challenges posed by these AI developments. As consensus on international governance continues to be established and put into action, the practical importance of conducting a global assessment of the state of AI governance is progressively coming to light. In this context, we initiated the development of the AI Governance InternationaL Evaluation Index (AGILE Index). Adhering to the design principle, "the level of governance should match the level of development," the inaugural evaluation of the AGILE Index commences with an exploration of four foundational pillars: the development level of AI, the AI governance environment, the AI governance instruments, and the AI governance effectiveness. It covers 39 indicators across 18 dimensions to comprehensively assess the AI governance level of 14 representative countries globally. The index is utilized to delve into the status of AI governance to date in 14 countries for the first batch of evaluation. The aim is to depict the current state of AI governance in these countries through data scoring, assist them in identifying their governance stage and uncovering governance issues, and ultimately offer insights for the enhancement of their AI governance systems.


Multiaccuracy and Multicalibration via Proxy Groups

arXiv.org Machine Learning

As the use of predictive machine learning algorithms increases in high-stakes decision-making, it is imperative that these algorithms are fair across sensitive groups. Unfortunately, measuring and enforcing fairness in real-world applications can be challenging due to missing or incomplete sensitive group data. Proxy-sensitive attributes have been proposed as a practical and effective solution in these settings, but only for parity-based fairness notions. Knowing how to evaluate and control for fairness with missing sensitive group data for newer and more flexible frameworks, such as multiaccuracy and multicalibration, remains unexplored. In this work, we address this gap by demonstrating that in the absence of sensitive group data, proxy-sensitive attributes can provably be used to derive actionable upper bounds on the true multiaccuracy and multicalibration, providing insights into a model's potential worst-case fairness violations. Additionally, we show that adjusting models to satisfy multiaccuracy and multicalibration across proxy-sensitive attributes can significantly mitigate these violations for the true, but unknown, sensitive groups. Through several experiments on real-world datasets, we illustrate that approximate multiaccuracy and multicalibration can be achieved even when sensitive group information is incomplete or unavailable.


UK unions call for action to protect creative industry workers as AI develops

The Guardian

Action is needed to protect workers in creative industries amid huge changes in technology and artificial intelligence, unions have urged. The TUC said there was an urgent need to put in place "proper guardrails" for workers ranging from artists, writers and journalists to teachers and academics. The TUC called for transparency of AI training data to ensure workers know whether their data or image are being used, an opt-in system to protect creative work from commercial data mining unless workers give their permission and consent, and for measures to ensure creative workers are paid fairly for their work when their creative work is used to train AI models. The report also called for an independent regulator to oversee the integration of AI into society and work. The transformative potential of AI was huge, but without adequate regulation, "rapacious tech bosses" would be able to exploit creative workers and cash in on their work, the TUC warned.


Adaptively evaluating models with task elicitation

arXiv.org Artificial Intelligence

Manual curation of evaluation datasets is struggling to keep up with the rapidly expanding capabilities and deployment scenarios of language models. Towards scalable model profiling, we introduce and validate a framework for evaluating LLMs, called Adaptive Evaluations. Adaptive evaluations use scaffolded language models (evaluator agents) to search through a target model's behavior on a domain dataset and create difficult questions (tasks) that can discover and probe the model's failure modes. We find that frontier models lack consistency when adaptively probed with our framework on a diverse suite of datasets and tasks, including but not limited to legal reasoning, forecasting, and online harassment. Generated questions pass human validity checks and often transfer to other models with different capability profiles, demonstrating that adaptive evaluations can also be used to create difficult domain-specific datasets.


Pragmatic Inference Chain (PIC) Improving LLMs' Reasoning of Authentic Implicit Toxic Language

arXiv.org Artificial Intelligence

The rapid development of large language models (LLMs) gives rise to ethical concerns about their performance, while opening new avenues for developing toxic language detection techniques. However, LLMs' unethical output and their capability of detecting toxicity have primarily been tested on language data that do not demand complex meaning inference, such as the biased associations of 'he' with programmer and 'she' with household. Nowadays toxic language adopts a much more creative range of implicit forms, thanks to advanced censorship. In this study, we collect authentic toxic interactions that evade online censorship and that are verified by human annotators as inference intensive. To evaluate and improve LLMs' reasoning of the authentic implicit toxic language, we propose a new prompting method, Pragmatic Inference Chain (PIC), drawn on interdisciplinary findings from cognitive science and linguistics. The PIC prompting significantly improves the success rate of GPT-4o, Llama-3.1-70B-Instruct, and DeepSeek-v2.5 in identifying implicit toxic language, compared to both direct prompting and Chain-of-Thought. In addition, it also facilitates the models to produce more explicit and coherent reasoning processes, hence can potentially be generalized to other inference-intensive tasks, e.g., understanding humour and metaphors.


Measuring Political Preferences in AI Systems: An Integrative Approach

arXiv.org Artificial Intelligence

Measuring Political Preferences in AI Systems - A n Integrative Approach David Rozado Political biases in Large Language Model (LLM) - based artificial intelligence (AI) systems, such as OpenAI ' s ChatGPT or Google ' s Gemini, have been previously reported . While several prior studies have attempted to quantify these biases using political orientation tests, such approaches are limited by potential tests ' calibration biases and constrained response formats that do not reflect real - world human - AI interaction s. This study employs a multi - method approach to assess political bias in leading AI systems, integrating four complementary methodologies: (1) linguistic comparison of AI - generated text with the language used by Republican and Democratic U.S. Congress mem bers, (2) analysis of political viewpoints embedded in AI - generated policy recommendations, (3) sentiment analysis of AI - generated text toward politically affiliated public figures, and (4) standardized political orientation testing. Results indicate a con sistent left - leaning bias across most contemporary AI systems, with arguably varying degrees of intensity. However, this bias is not an inherent feature of LLMs; prior research demonstrates that fine - tuning with politically skewed data can realign these mo dels across the ideological spectrum. The presence of systematic political bias in AI systems poses risks, including reduced viewpoint diversity, increased societal polarization, and the potential for public mistrust in AI technologies. To mitigate these r isks, AI systems should be designed to prioritize factual accuracy while maintaining neutrality on most lawful normative issues. Furthermore, independent monitoring platforms are necessary to ensure transparency, accountability, and responsible AI developm ent. Introduction Recent advancements in AI technology, exemplified by Large Language Models (LLMs) like ChatGPT, represent one of the most significant technological breakthroughs in recent decades. The ability of AI systems to understand and generate human - like natural language has unlocked new possibilities for automation, human - computer interaction, content generation, and information retrieval. However, th ese impressive capabilities ha ve also raised concerns abo ut the potential biases that such systems might harbor [1], [2], [3], [4] . Preliminary evidence has suggested that AI systems exhibit political biases in the textual content they generate [2], [5], [6] .


Holistically Evaluating the Environmental Impact of Creating Language Models

arXiv.org Artificial Intelligence

As the performance of artificial intelligence systems has dramatically increased, so too has the environmental impact of creating these systems. While many model developers release estimates of the power consumption and carbon emissions from the final training runs for their latest models, there is comparatively little transparency into the impact of model development, hardware manufacturing, and total water usage throughout. In this work, we estimate the real-world environmental impact of developing a series of language models, ranging from 20 million to 13 billion active parameters, trained on up to 5.6 trillion tokens each. When accounting for hardware manufacturing, model development, and our final training runs, we find that our series of models released 493 metric tons of carbon emissions, equivalent to powering about 98 homes in the United States for one year, and consumed 2.769 million liters of water, equivalent to about 24.5 years of water usage by a person in the United States, even though our data center is extremely water-efficient. We measure and report the environmental impact of our model development; to the best of our knowledge we are the first to do so for LLMs, and we find that model development, the impact of which is generally not disclosed by most model developers, amounted to 50% of that of training. By looking at detailed time series data for power consumption, we also find that power usage throughout training is not consistent, fluctuating between 15% and 85% of our hardware's maximum power draw, with negative implications for grid-scale planning as demand continues to grow. We close with a discussion on the continued difficulty of estimating the environmental impact of AI systems, and key takeaways for model developers and the public at large. In recent years, the field of artificial intelligence has progressed at an unprecedented pace, driven in large part by the development and deployment of large language and multimodal models.


How Do Consumers Really Choose: Exposing Hidden Preferences with the Mixture of Experts Model

arXiv.org Artificial Intelligence

Understanding consumer choice is fundamental to marketing and management research, as firms increasingly seek to personalize offerings and optimize customer engagement. Traditional choice modeling frameworks, such as multinomial logit (MNL) and mixed logit models, impose rigid parametric assumptions that limit their ability to capture the complexity of consumer decision-making. This study introduces the Mixture of Experts (MoE) framework as a machine learning-driven alternative that dynamically segments consumers based on latent behavioral patterns. By leveraging probabilistic gating functions and specialized expert networks, MoE provides a flexible, nonparametric approach to modeling heterogeneous preferences. Empirical validation using large-scale retail data demonstrates that MoE significantly enhances predictive accuracy over traditional econometric models, capturing nonlinear consumer responses to price variations, brand preferences, and product attributes. The findings underscore MoEs potential to improve demand forecasting, optimize targeted marketing strategies, and refine segmentation practices. By offering a more granular and adaptive framework, this study bridges the gap between data-driven machine learning approaches and marketing theory, advocating for the integration of AI techniques in managerial decision-making and strategic consumer insights.


Evaluating Intelligence via Trial and Error

arXiv.org Artificial Intelligence

Intelligence is a crucial trait for species to find solutions within a limited number of trial-and-error attempts. Building on this idea, we introduce Survival Game as a framework to evaluate intelligence based on the number of failed attempts in a trial-and-error process. Fewer failures indicate higher intelligence. When the expectation and variance of failure counts are both finite, it signals the ability to consistently find solutions to new challenges, which we define as the Autonomous Level of intelligence. Using Survival Game, we comprehensively evaluate existing AI systems. Our results show that while AI systems achieve the Autonomous Level in simple tasks, they are still far from it in more complex tasks, such as vision, search, recommendation, and language. While scaling current AI technologies might help, this would come at an astronomical cost. Projections suggest that achieving the Autonomous Level for general tasks would require $10^{26}$ parameters. To put this into perspective, loading such a massive model requires so many H100 GPUs that their total value is $10^{7}$ times that of Apple Inc.'s market value. Even with Moore's Law, supporting such a parameter scale would take $70$ years. This staggering cost highlights the complexity of human tasks and the inadequacies of current AI technologies. To further investigate this phenomenon, we conduct a theoretical analysis of Survival Game and its experimental results. Our findings suggest that human tasks possess a criticality property. As a result, Autonomous Level requires a deep understanding of the task's underlying mechanisms. Current AI systems, however, do not fully grasp these mechanisms and instead rely on superficial mimicry, making it difficult for them to reach an autonomous level. We believe Survival Game can not only guide the future development of AI but also offer profound insights into human intelligence.