Law
Large Language Models in Legislative Content Analysis: A Dataset from the Polish Parliament
Bryłkowski, Arkadiusz, Klikowski, Jakub
Large language models (LLMs) are among the best methods for processing natural language, partly due to their versatility. At the same time, domain-specific LLMs are more practical in real-life applications. This work introduces a novel natural language dataset created by acquired data from official legislative authorities' websites. The study focuses on formulating three natural language processing (NLP) tasks to evaluate the effectiveness of LLMs on legislative content analysis within the context of the Polish legal system. Key findings highlight the potential of LLMs in automating and enhancing legislative content analysis while emphasizing specific challenges, such as understanding legal context. The research contributes to the advancement of NLP in the legal field, particularly in the Polish language. It has been demonstrated that even commonly accessible data can be practically utilized for legislative content analysis.
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
Gouvert, Olivier, Hunter, Julie, Louradour, Jérôme, Cerisara, Christophe, Dufraisse, Evan, Sy, Yaya, Rivière, Laura, Lorré, Jean-Pierre, community, OpenLLM-France
We present both the Lucie Training Dataset and the Lucie-7B foundation model. The Lucie Training Dataset is a multilingual collection of textual corpora centered around French and designed to offset anglo-centric biases found in many datasets for large language model pretraining. Its French data is pulled not only from traditional web sources, but also from French cultural heritage documents, filling an important gap in modern datasets. Beyond French, which makes up the largest share of the data, we added documents to support several other European languages, including English, Spanish, German, and Italian. Apart from its value as a resource for French language and culture, an important feature of this dataset is that it prioritizes data rights by minimizing copyrighted material. In addition, building on the philosophy of past open projects, it is redistributed in the form used for training and its processing is described on Hugging Face and GitHub. The Lucie-7B foundation model is trained on equal amounts of data in French and English -- roughly 33% each -- in an effort to better represent cultural aspects of French-speaking communities. We also describe two instruction fine-tuned models, Lucie-7B-Instruct-v1.1 and Lucie-7B-Instruct-human-data, which we release as demonstrations of Lucie-7B in use. These models achieve promising results compared to state-of-the-art models, demonstrating that an open approach prioritizing data rights can still deliver strong performance. We see these models as an initial step toward developing more performant, aligned models in the near future. Model weights for Lucie-7B and the Lucie instruct models, along with intermediate checkpoints for the former, are published on Hugging Face, while model training and data preparation code is available on GitHub. This makes Lucie-7B one of the first OSI compliant language models according to the new OSI definition.
General Scales Unlock AI Evaluation with Explanatory and Predictive Power
Zhou, Lexin, Pacchiardi, Lorenzo, Martínez-Plumed, Fernando, Collins, Katherine M., Moros-Daval, Yael, Zhang, Seraphina, Zhao, Qinlin, Huang, Yitian, Sun, Luning, Prunty, Jonathan E., Li, Zongqian, Sánchez-García, Pablo, Chen, Kexin Jiang, Casares, Pablo A. M., Zu, Jiyun, Burden, John, Mehrbakhsh, Behzad, Stillwell, David, Cebrian, Manuel, Wang, Jindong, Henderson, Peter, Wu, Sherry Tongshuang, Kyllonen, Patrick C., Cheke, Lucy, Xie, Xing, Hernández-Orallo, José
Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead. (Collaborative platform: https://kinds-of-intelligence-cfi.github.io/ADELE.)
House GOP subpoenas tech companies over AI 'censorship pressure' from Biden administration
The Republican-led House Judiciary Committee is looking into whether the Biden administration tried to "censor" artificial intelligence. Representative Jim Jordan has sent subpoenas to sixteen different tech companies that work with AI in some capacity to ask for any and all communications from the previous administration about limiting "harmful bias" and "algorithmic discrimination." Subpoenas were sent to Adobe, Alphabet, Amazon, Anthropic, Apple, Cohere, International Business Machines Corp. (IBM), Inflection AI, Meta, Microsoft, Nvidia, Open AI, Palantir, Salesforce, Scale AI and Stability AI, and each requests an extensive amount of information, covering five years from January 1, 2020 to January 20, 2025. Essentially any and all documents and communications "referring or relating to the moderation, deletion, suppression, restriction, or reduced circulation of the content, input, or output of an AI model, training dataset, algorithm, system, or product," need to be included, whether between the companies and the previous administration, internal communications about those discussions or discussions with third-parties. Jordan and the committee are alleging that the former President's executive order calling for regulations on algorithmic discrimination and guidelines for how the federal government will use AI pressured private companies to censor speech.
The Morning After: Is the Roomba an endangered species?
The company behind Roomba robovacs told investors earlier this week that revenue was substantially down and it's struggling to pay its debts. Amazon was briefly tapped to acquire the robot company iRobot, but the threat of a European Commission investigation led to the retailer terminating the deal -- apparently happy enough to pay off the 94 million termination fee. That, however, isn't enough to tackle the 200 million loan iRobot took out to survive long enough for Amazon to come to the rescue. It's extra rough when the company announced, just the week before, a bunch of new models, including a new Roomba that can compact debris and dust, so it only needs to be emptied every few weeks. At the same time, rival robot vacuum cleaners are getting more versatile, more complicated and more intriguing.
AI 'digital twins' are warping political reality, leaving deepfake victims with few options for legal action
Artificial intelligence (AI) is producing hyperrealistic "digital twins" of politicians, celebrities, pornographic material, and more – leaving victims of deepfake technology struggling to determine legal recourse. Former CIA agent and cybersecurity expert Dr. Eric Cole told Fox News Digital that poor online privacy practices and people's willingness to post their information publicly on social media leaves them susceptible to AI deepfakes. "The cat's already out of the bag," he said. "They have our pictures, they know our kids, they know our family. They know where we live. And now, with AI, they're able to take all that data about who we are, what we look like, what we do, and how we act, and basically be able to create a digital twin," Cole continued.
Content ARCs: Decentralized Content Rights in the Age of Generative AI
Balan, Kar, Gilbert, Andrew, Collomosse, John
The rise of Generative AI (GenAI) has sparked significant debate over balancing the interests of creative rightsholders and AI developers. As GenAI models are trained on vast datasets that often include copyrighted material, questions around fair compensation and proper attribution have become increasingly urgent. To address these challenges, this paper proposes a framework called \emph{Content ARCs} (Authenticity, Rights, Compensation). By combining open standards for provenance and dynamic licensing with data attribution, and decentralized technologies, Content ARCs create a mechanism for managing rights and compensating creators for using their work in AI training. We characterize several nascent works in the AI data licensing space within Content ARCs and identify where challenges remain to fully implement the end-to-end framework.
Emergent Abilities in Large Language Models: A Survey
Berti, Leonardo, Giorgi, Flavio, Kasneci, Gjergji
Large Language Models (LLMs) are leading a new technological revolution as one of the most promising research streams toward artificial general intelligence. The scaling of these models, accomplished by increasing the number of parameters and the magnitude of the training datasets, has been linked to various so-called emergent abilities that were previously unobserved. These emergent abilities, ranging from advanced reasoning and in-context learning to coding and problem-solving, have sparked an intense scientific debate: Are they truly emergent, or do they simply depend on external factors, such as training dynamics, the type of problems, or the chosen metric? What underlying mechanism causes them? Despite their transformative potential, emergent abilities remain poorly understood, leading to misconceptions about their definition, nature, predictability, and implications. In this work, we shed light on emergent abilities by conducting a comprehensive review of the phenomenon, addressing both its scientific underpinnings and real-world consequences. We first critically analyze existing definitions, exposing inconsistencies in conceptualizing emergent abilities. We then explore the conditions under which these abilities appear, evaluating the role of scaling laws, task complexity, pre-training loss, quantization, and prompting strategies. Our review extends beyond traditional LLMs and includes Large Reasoning Models (LRMs), which leverage reinforcement learning and inference-time search to amplify reasoning and self-reflection. However, emergence is not inherently positive. As AI systems gain autonomous reasoning capabilities, they also develop harmful behaviors, including deception, manipulation, and reward hacking. We highlight growing concerns about safety and governance, emphasizing the need for better evaluation frameworks and regulatory oversight.
HInter: Exposing Hidden Intersectional Bias in Large Language Models
Souani, Badr, Soremekun, Ezekiel, Papadakis, Mike, Yokoyama, Setsuko, Chattopadhyay, Sudipta, Traon, Yves Le
Large Language Models (LLMs) may portray discrimination towards certain individuals, especially those characterized by multiple attributes (aka intersectional bias). Discovering intersectional bias in LLMs is challenging, as it involves complex inputs on multiple attributes (e.g. race and gender). To address this challenge, we propose HInter, a test technique that synergistically combines mutation analysis, dependency parsing and metamorphic oracles to automatically detect intersectional bias in LLMs. HInter generates test inputs by systematically mutating sentences using multiple mutations, validates inputs via a dependency invariant and detects biases by checking the LLM response on the original and mutated sentences. We evaluate HInter using six LLM architectures and 18 LLM models (GPT3.5, Llama2, BERT, etc) and find that 14.61% of the inputs generated by HInter expose intersectional bias. Results also show that our dependency invariant reduces false positives (incorrect test inputs) by an order of magnitude. Finally, we observed that 16.62% of intersectional bias errors are hidden, meaning that their corresponding atomic cases do not trigger biases. Overall, this work emphasize the importance of testing LLMs for intersectional bias.
Prompt Sentiment: The Catalyst for LLM Change
The rise of large language models (LLMs) has revolutionized natural language processing (NLP), yet the influence of prompt sentiment, a latent affective characteristic of input text, remains underexplored. This study systematically examines how sentiment variations in prompts affect LLM-generated outputs in terms of coherence, factuality, and bias. Leveraging both lexicon-based and transformer-based sentiment analysis methods, we categorize prompts and evaluate responses from five leading LLMs: Claude, DeepSeek, GPT-4, Gemini, and LLaMA. Our analysis spans six AI-driven applications, including content generation, conversational AI, legal and financial analysis, healthcare AI, creative writing, and technical documentation. By transforming prompts, we assess their impact on output quality. Our findings reveal that prompt sentiment significantly influences model responses, with negative prompts often reducing factual accuracy and amplifying bias, while positive prompts tend to increase verbosity and sentiment propagation. These results highlight the importance of sentiment-aware prompt engineering for ensuring fair and reliable AI-generated content.