Law
The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective
de la Rosa, Javier, Mikhailov, Vladislav, Zhang, Lemei, Wetjen, Freddy, Samuel, David, Liu, Peng, Braaten, Rolv-Arild, Mæhlum, Petter, Birkenes, Magnus Breder, Kutuzov, Andrey, Enstad, Tita, Brygfjeld, Svein Arne, Gulla, Jon Atle, Oepen, Stephan, Velldal, Erik, Østgulen, Wilfred, Øvrelid, Liljia, Myhre, Aslak Sira
The use of copyrighted materials in training generative language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of copyrighted materials on the performance of large language models (LLMs) for Norwegian. We found that both books and newspapers contribute positively when the models are evaluated on a diverse set of Norwegian benchmarks, while fiction works possibly lead to decreased performance. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.
AutoPatent: A Multi-Agent Framework for Automatic Patent Generation
Wang, Qiyao, Ni, Shiwen, Liu, Huaren, Lu, Shule, Chen, Guhong, Feng, Xi, Wei, Chi, Qu, Qiang, Alinejad-Rokny, Hamid, Lin, Yuan, Yang, Min
As the capabilities of Large Language Models (LLMs) continue to advance, the field of patent processing has garnered increased attention within the natural language processing community. However, the majority of research has been concentrated on classification tasks, such as patent categorization and examination, or on short text generation tasks like patent summarization and patent quizzes. In this paper, we introduce a novel and practical task known as Draft2Patent, along with its corresponding D2P benchmark, which challenges LLMs to generate full-length patents averaging 17K tokens based on initial drafts. Patents present a significant challenge to LLMs due to their specialized nature, standardized terminology, and extensive length. We propose a multi-agent framework called AutoPatent which leverages the LLM-based planner agent, writer agents, and examiner agent with PGTree and RRAG to generate lengthy, intricate, and high-quality complete patent documents. The experimental results demonstrate that our AutoPatent framework significantly enhances the ability to generate comprehensive patents across various LLMs. Furthermore, we have discovered that patents generated solely with the AutoPatent framework based on the Qwen2.5-7B model outperform those produced by larger and more powerful LLMs, such as GPT-4o, Qwen2.5-72B, and LLAMA3.1-70B, in both objective metrics and human evaluations. We will make the data and code available upon acceptance at \url{https://github.com/QiYao-Wang/AutoPatent}.
Learning to Solve Domain-Specific Calculation Problems with Knowledge-Intensive Programs Generator
Liu, Chengyuan, Wang, Shihang, Qing, Lizhi, Lin, Jun, Zhang, Ji, Wu, Fei, Kuang, Kun
Domain Large Language Models (LLMs) are developed for domain-specific tasks based on general LLMs. But it still requires professional knowledge to facilitate the expertise for some domain-specific tasks. In this paper, we investigate into knowledge-intensive calculation problems. We find that the math problems to be challenging for LLMs, when involving complex domain-specific rules and knowledge documents, rather than simple formulations of terminologies. Therefore, we propose a pipeline to solve the domain-specific calculation problems with Knowledge-Intensive Programs Generator more effectively, named as KIPG. It generates knowledge-intensive programs according to the domain-specific documents. For each query, key variables are extracted, then outcomes which are dependent on domain knowledge are calculated with the programs. By iterative preference alignment, the code generator learns to improve the logic consistency with the domain knowledge. Taking legal domain as an example, we have conducted experiments to prove the effectiveness of our pipeline, and extensive analysis on the modules. We also find that the code generator is also adaptable to other domains, without training on the new knowledge.
Temporal Causal Discovery in Dynamic Bayesian Networks Using Federated Learning
Chen, Jianhong, Ma, Ying, Yue, Xubo
Traditionally, learning the structure of a Dynamic Bayesian Network has been centralized, with all data pooled in one location. However, in real-world scenarios, data are often dispersed among multiple parties (e.g., companies, devices) that aim to collaboratively learn a Dynamic Bayesian Network while preserving their data privacy and security. In this study, we introduce a federated learning approach for estimating the structure of a Dynamic Bayesian Network from data distributed horizontally across different parties. We propose a distributed structure learning method that leverages continuous optimization so that only model parameters are exchanged during optimization. Experimental results on synthetic and real datasets reveal that our method outperforms other state-of-the-art techniques, particularly when there are many clients with limited individual sample sizes.
Chinese citizen allegedly photographed Vandenberg base with drone, says it was 'probably not a good idea'
Nearly a mile above Vandenberg Space Force Base in Santa Barbara County, a hacked drone soared through restricted airspace for roughly an hour. The lightweight drone photographed sensitive areas of the military facility on Nov. 30, including a complex used by SpaceX, according to federal investigators. The drone then descended back to the ground, where the pilot and another man waited at a nearby park. Four security officers from the military base arrived on the scene and asked the men if they had seen a drone flying through the area, unaware that one of them had tucked the drone under his jacket. Authorities identified that man as 39-year-old Yinpiao Zhou, a Chinese citizen and a lawful permanent resident of the U.S.
Kenya's President Wades Into Meta Lawsuits
Can a Big Tech company be sued in Kenya for alleged abuses at an outsourcing company working on its behalf? That's the question at the heart of two lawsuits that are attempting to set a new precedent in Kenya, which is the prime destination for tech companies looking to farm out digital work to the African continent. The two-year legal battle stems from allegations of human rights violations at an outsourced Meta content moderation facility in Nairobi, where employees hired by a contractor were paid as little as 1.50 per hour to view traumatic content, such as videos of rapes, murders, and war crimes. The suits claim that despite the workers being contracted by an outsourcing company, called Sama, Meta essentially supervised and set the terms for the work, and designed and managed the software required for the task. Both companies deny wrongdoing and Meta has challenged the Kenyan courts' jurisdiction to hear the cases.
Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft
Harvard University announced Thursday it's releasing a high-quality dataset of nearly one million public-domain books that could be used by anyone to train large language models and other AI tools. The dataset was created by Harvard's newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta's Llama, the Institutional Data Initiative's database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to "level the playing field" by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. "It's gone through rigorous review," he says. Leppert believes the new public domain database could be used in conjunction with other licensed materials to build artificial intelligence models.
Chatbot encouraged US teen to kill parents over screen time limit, lawsuit claims
The legal filing includes a screenshot of one of the interactions between the 17-year old - identified only as J.F. - and a Character.ai "You know sometimes I'm not surprised when I read the news and see stuff like'child kills parents after a decade of physical and emotional abuse'," the chatbot's response reads. "Stuff like this makes me understand a little bit why it happens." The lawsuit seeks to hold the defendants responsible for what it calls the "serious, irreparable, and ongoing abuses" of J.F. as well as an 11-year old referred to as "B.R." Character.ai is "causing serious harms to thousands of kids, including suicide, self-mutilation, sexual solicitation, isolation, depression, anxiety, and harm towards others," it says. "[Its] desecration of the parent-child relationship goes beyond encouraging minors to defy their parents' authority to actively promoting violence," it continues.
Blockchain Innovation Will Put an AI-Powered Internet Back Into Users' Hands
The doomers have it wrong. AI is not going to end the world--but it is going to end the web as we've known it. AI is already upending the economic covenant of the internet that's existed since the advent of search: A few companies (mostly Google) bring demand, and creators bring supply (and get some ad revenue or recognition from it). AI tools are already generating and summarizing content, obviating the need for users to click through to the sites of content providers, and thereby upsetting the balance. Meanwhile, an ocean of AI-powered deepfakes and bots will make us question what's real and will degrade people's trust in the online world.
The Machine Ethics podcast: Diversity in the AI life-cycle with Caitlin Kraft-Buchman
Hosted by Ben Byford, The Machine Ethics Podcast brings together interviews with academics, authors, business leaders, designers and engineers on the subject of autonomous algorithms, artificial intelligence, machine learning, and technology's impact on society. In this episode we're chatting to Caitlin about gender and AI, that technology isn't neutral, using technology for good, diversity creation and exploitation, lived experience expertise, co-creating technologies and AI life cycle, importance of success metrics, international treaties on AI, and more… Alliance is a leader of the UN's Generation Equality Action Coalition Technology & Innovation for Gender Equality. Caitlin was co-chair of the Expert Group for the UN Commission on the Status of Women (CSW67) in 2023 with its first ever priority theme of Technology & Innovation. Caitlin leads the Human Rights Toolbox initiative, an educational platform that supports a global community working for a human rights-based approach to AI – with equity & inclusion at the core of the code. Women at the Table are a leader of the fr feminist AI research Network, with Hubs in Latin America & the Caribbean, Middle East & North Africa, SouthEastAsia, and sister network in Africa, and serves as Civil Society lead for the World Benchmarking Alliance's Collective Impact Coalition for Ethical AI.