Law
3CEL: A corpus of legal Spanish contract clauses
García, Nuria Aldama, Morales, Patricia Marsà, Sánchez, David Betancur, Jiménez, Álvaro Barbero, Nieto, Marta Guerrero, Coll, Pablo Haya, Chozas, Patricia Martín, Ponsoda, Elena Montiel
Information extraction (IE) is defined as the NLP task that deals with the identification of particular pieces of information in unstructured documents [1, 2, 3]. In other words, the main objective of IE is to spot predefined relevant information in raw text. IE includes different subtypes depending on the nature of the information to be extracted. Thus, Named Entity Recognition (NER), Co-Reference Resolution, Relation Extraction or Event Extraction are encompassed under the umbrella of IE [2]. IE encounters specific challenges, particularly with regard to data availability and the need for expert knowledge. First, access to raw data is limited depending on the target domain (e.g.
Governing AI Beyond the Pretraining Frontier
This year, jurisdictions worldwide, including the United States, the European Union, the United Kingdom, and China, are set to enact or revise laws governing frontier AI. Their efforts largely rely on the assumption that increasing model scale through pretraining is the path to more advanced AI capabilities. Yet growing evidence suggests that this "pretraining paradigm" may be hitting a wall and major AI companies are turning to alternative approaches, like inference-time "reasoning," to boost capabilities instead. This paradigm shift presents fundamental challenges for the frontier AI governance frameworks that target pretraining scale as a key bottleneck useful for monitoring, control, and exclusion, threatening to undermine this new legal order as it emerges. This essay seeks to identify these challenges and point to new paths forward for regulation. First, we examine the existing frontier AI regulatory regime and analyze some key traits and vulnerabilities. Second, we introduce the concept of the "pretraining frontier," the capabilities threshold made possible by scaling up pretraining alone, and demonstrate how it could make the regulatory field more diffuse and complex and lead to new forms of competition. Third, we lay out a regulatory approach that focuses on increasing transparency and leveraging new natural technical bottlenecks to effectively oversee changing frontier AI development while minimizing regulatory burdens and protecting fundamental rights. Our analysis provides concrete mechanisms for governing frontier AI systems across diverse technical paradigms, offering policymakers tools for addressing both current and future regulatory challenges in frontier AI.
Review for NeurIPS paper: Investigating Gender Bias in Language Models Using Causal Mediation Analysis
Only the reporting clause is examined while the that clause that contains the statement is ignored: In previous bias probing studies, the input content is the entire sentence with the complete context. However, in this paper, only the prompt part (reporting clause) is fed to the language model for analysis. Therefore, the proposed intervention setup effectively only focuses on word level bias probing. In the templates shown in Figure 8 in the Appendix, the verb "cry" or "drive" could embody implicit bias. However, under the current framework, such potential biases are not investigated. Therefore, the conclusions drawn in this study that gender bias effects are concentrated in specific components of the model may not generalize well when more complex syntactic and semantic structures and interactions are considered.
Review for NeurIPS paper: Investigating Gender Bias in Language Models Using Causal Mediation Analysis
The paper studies the problem of bias in neural models where the proposed solution is based on causal mediation analysis. The focus of the paper is on pre-trained transformer language models, GPT-2. The proposed method of using mediation analysis for analyzing attention heads and neurons through interventions is novel and interesting, and can be generalized to other types of biases. The paper is well-written, and experiments are thorough.
ESGSenticNet: A Neurosymbolic Knowledge Base for Corporate Sustainability Analysis
Ong, Keane, Mao, Rui, Xing, Frank, Satapathy, Ranjan, Sulaeman, Johan, Cambria, Erik, Mengaldo, Gianmarco
Evaluating corporate sustainability performance is essential to drive sustainable business practices, amid the need for a more sustainable economy. However, this is hindered by the complexity and volume of corporate sustainability data (i.e. sustainability disclosures), not least by the effectiveness of the NLP tools used to analyse them. To this end, we identify three primary challenges - immateriality, complexity, and subjectivity, that exacerbate the difficulty of extracting insights from sustainability disclosures. To address these issues, we introduce ESGSenticNet, a publicly available knowledge base for sustainability analysis. ESGSenticNet is constructed from a neurosymbolic framework that integrates specialised concept parsing, GPT-4o inference, and semi-supervised label propagation, together with a hierarchical taxonomy. This approach culminates in a structured knowledge base of 44k knowledge triplets - ('halve carbon emission', supports, 'emissions control'), for effective sustainability analysis. Experiments indicate that ESGSenticNet, when deployed as a lexical method, more effectively captures relevant and actionable sustainability information from sustainability disclosures compared to state of the art baselines. Besides capturing a high number of unique ESG topic terms, ESGSenticNet outperforms baselines on the ESG relatedness and ESG action orientation of these terms by 26% and 31% respectively. These metrics describe the extent to which topic terms are related to ESG, and depict an action toward ESG. Moreover, when deployed as a lexical method, ESGSenticNet does not require any training, possessing a key advantage in its simplicity for non-technical stakeholders.
Beyond Benchmarks: On The False Promise of AI Regulation
Stanovsky, Gabriel, Keydar, Renana, Perl, Gadi, Habba, Eliya
The rapid advancement of artificial intelligence (AI) systems in critical domains like healthcare, justice, and social services has sparked numerous regulatory initiatives aimed at ensuring their safe deployment. Current regulatory frameworks, exemplified by recent US and EU efforts, primarily focus on procedural guidelines while presuming that scientific benchmarking can effectively validate AI safety, similar to how crash tests verify vehicle safety or clinical trials validate drug efficacy. However, this approach fundamentally misunderstands the unique technical challenges posed by modern AI systems. Through systematic analysis of successful technology regulation case studies, we demonstrate that effective scientific regulation requires a causal theory linking observable test outcomes to future performance - for instance, how a vehicle's crash resistance at one speed predicts its safety at lower speeds. We show that deep learning models, which learn complex statistical patterns from training data without explicit causal mechanisms, preclude such guarantees. This limitation renders traditional regulatory approaches inadequate for ensuring AI safety. Moving forward, we call for regulators to reckon with this limitation, and propose a preliminary two-tiered regulatory framework that acknowledges these constraints: mandating human oversight for high-risk applications while developing appropriate risk communication strategies for lower-risk uses. Our findings highlight the urgent need to reconsider fundamental assumptions in AI regulation and suggest a concrete path forward for policymakers and researchers.
Be Intentional About Fairness!: Fairness, Size, and Multiplicity in the Rashomon Set
Dai, Gordon, Ravishankar, Pavan, Yuan, Rachel, Neill, Daniel B., Black, Emily
This phenomenon--often called the Rashomon effect [7], predictive multiplicity [22], or model multiplicity [5]--has wide-ranging implications for both understanding and improving fairness, as these equally accurate models often differ substantially in other properties such as fairness [21, 28] or model simplicity [29-31]. As prior work has pointed out, this multiplicity of models can be viewed as both a fairness opportunity and a concern [5, 10]. On the positive side, legal scholarship has pointed to the fact that model multiplicity is relevant to how to interpret and enforce U.S. anti-discrimination law, and specifically, can strengthen the disparate impact doctrine to more effectively combat algorithmic discrimination [3]. In a recent paper, Black et al. [3] suggest that the phenomenon of model multiplicity could support a reading of the disparate impact doctrine that requires companies to proactively search the set of equally accurate models for less discriminatory alternatives that have equivalent accuracy to a base model deemed acceptable for deployment from a model performance perspective. On the negative side, several scholars have pointed out that facially similar models, with equivalent accuracy but differences in their individual predictions, can suggest that some model decisions are arbitrary since they seem to be made on the basis of model choice that does not impact performance (e.g., a <1% change in a model's training set accuracy) [2, 17, 22]. This arbitrariness can impact model explanations and recourse as well: individuals with decisions that are unstable across small model changes may not receive reliable explanations for their model outcome, or ways to change it [4, 6, 25]. Further, if there is a group-based asymmetry of arbitrariness-e.g., if female loan applicants have more arbitrariness in their decisions than male loan applicants-- this could lead to a group-based equity concern in and of itself. Understanding the extent of the benefits and risks of model multiplicity relies upon an understanding of the properties of the Rashomon set, or the set of approximately equally accurate models for a given prediction task, i.e., equally accurate up to
Assessing and Predicting Air Pollution in Asia: A Regional and Temporal Study (2018-2023)
Rahman, Anika, Khatun, Mst. Taskia
This study analyzes and predicts air pollution in Asia, focusing on PM 2.5 levels from 2018 to 2023 across five regions: Central, East, South, Southeast, and West Asia. South Asia emerged as the most polluted region, with Bangladesh, India, and Pakistan consistently having the highest PM 2.5 levels and death rates, especially in Nepal, Pakistan, and India. East Asia showed the lowest pollution levels. K-means clustering categorized countries into high, moderate, and low pollution groups. The ARIMA model effectively predicted 2023 PM 2.5 levels (MAE: 3.99, MSE: 33.80, RMSE: 5.81, R: 0.86). The findings emphasize the need for targeted interventions to address severe pollution and health risks in South Asia.
The Potential of Large Language Models in Supply Chain Management: Advancing Decision-Making, Efficiency, and Innovation
Aghaei, Raha, Kiaei, Ali A., Boush, Mahnaz, Vahidi, Javad, Barzegar, Zeynab, Rofoosheh, Mahan
The integration of large language models (LLMs) into supply chain management (SCM) is revolutionizing the industry by improving decision-making, predictive analytics, and operational efficiency. This white paper explores the transformative impact of LLMs on various SCM functions, including demand forecasting, inventory management, supplier relationship management, and logistics optimization. By leveraging advanced data analytics and real-time insights, LLMs enable organizations to optimize resources, reduce costs, and improve responsiveness to market changes. Key findings highlight the benefits of integrating LLMs with emerging technologies such as IoT, blockchain, and robotics, which together create smarter and more autonomous supply chains. Ethical considerations, including bias mitigation and data protection, are taken into account to ensure fair and transparent AI practices. In addition, the paper discusses the need to educate the workforce on how to manage new AI-driven processes and the long-term strategic benefits of adopting LLMs. Strategic recommendations for SCM professionals include investing in high-quality data management, promoting cross-functional collaboration, and aligning LLM initiatives with overall business goals. The findings highlight the potential of LLMs to drive innovation, sustainability, and competitive advantage in the ever-changing supply chain management landscape.
I-trustworthy Models. A framework for trustworthiness evaluation of probabilistic classifiers
Vashistha, Ritwik, Farahi, Arya
As probabilistic models continue to permeate various facets of our society and contribute to scientific advancements, it becomes a necessity to go beyond traditional metrics such as predictive accuracy and error rates and assess their trustworthiness. Grounded in the competence-based theory of trust, this work formalizes I-trustworthy framework -- a novel framework for assessing the trustworthiness of probabilistic classifiers for inference tasks by linking local calibration to trustworthiness. To assess I-trustworthiness, we use the local calibration error (LCE) and develop a method of hypothesis-testing. This method utilizes a kernel-based test statistic, Kernel Local Calibration Error (KLCE), to test local calibration of a probabilistic classifier. This study provides theoretical guarantees by offering convergence bounds for an unbiased estimator of KLCE. Additionally, we present a diagnostic tool designed to identify and measure biases in cases of miscalibration. The effectiveness of the proposed test statistic is demonstrated through its application to both simulated and real-world datasets. Finally, LCE of related recalibration methods is studied, and we provide evidence of insufficiency of existing methods to achieve I-trustworthiness.