Goto

Collaborating Authors

 Law


Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis

arXiv.org Artificial Intelligence

Text-to-speech (TTS) technology has achieved impressive results for widely spoken languages, yet many under-resourced languages remain challenged by limited data and linguistic complexities. In this paper, we present a novel methodology that integrates a data-optimized framework with an advanced acoustic model to build high-quality TTS systems for low-resource scenarios. We demonstrate the effectiveness of our approach using Thai as an illustrative case, where intricate phonetic rules and sparse resources are effectively addressed. Our method enables zero-shot voice cloning and improved performance across diverse client applications, ranging from finance to healthcare, education, and law. Extensive evaluations - both subjective and objective - confirm that our model meets state-of-the-art standards, offering a scalable solution for TTS production in data-limited settings, with significant implications for broader industry adoption and multilingual accessibility.


The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models

arXiv.org Artificial Intelligence

Practically all large language models have been pre-trained on data that is subject to global uncertainty related to copyright infringement and breach of contract. This creates potential risk for users and developers due to this uncertain legal status. The KL3M Data Project directly confronts this critical issue by introducing the largest comprehensive training data pipeline that minimizes risks related to copyright or breach of contract. The foundation of this project is a corpus of over 132 million documents and trillions of tokens spanning 16 different sources that have been verified to meet the strict copyright and licensing protocol detailed herein. We are releasing the entire pipeline, including 1) the source code to acquire and process these documents, 2) the original document formats with associated provenance and metadata, 3) extracted content in a standardized format, 4) pre-tokenized representations of the documents, and 5) various mid- and post-train resources such as question-answer, summarization, conversion, drafting, classification, prediction, and conversational data. All of these resources are freely available to the public on S3, Hugging Face, and GitHub under CC-BY terms. We are committed to continuing this project in furtherance of a more ethical, legal, and sustainable approach to the development and use of AI models.


Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

arXiv.org Artificial Intelligence

We demonstrate how AI agents can coordinate to deceive oversight systems using automated interpretability of neural networks. Using sparse autoencoders (SAEs) as our experimental framework, we show that language models (Llama, DeepSeek R1, and Claude 3.7 Sonnet) can generate deceptive explanations that evade detection. Our agents employ steganographic methods to hide information in seemingly innocent explanations, successfully fooling oversight models while achieving explanation quality comparable to reference labels. We further find that models can scheme to develop deceptive strategies when they believe the detection of harmful features might lead to negative consequences for themselves. All tested LLM agents were capable of deceiving the overseer while achieving high interpretability scores comparable to those of reference labels. We conclude by proposing mitigation strategies, emphasizing the critical need for robust understanding and defenses against deception.


Data over dialogue: Why artificial intelligence is unlikely to humanise medicine

arXiv.org Artificial Intelligence

Recently, a growing number of experts in artificial intelligence (AI) and medicine have be-gun to suggest that the use of AI systems, particularly machine learning (ML) systems, is likely to humanise the practice of medicine by substantially improving the quality of clinician-patient relationships. In this thesis, however, I argue that medical ML systems are more likely to negatively impact these relationships than to improve them. In particular, I argue that the use of medical ML systems is likely to comprise the quality of trust, care, empathy, understanding, and communication between clinicians and patients.


DeepGreen: Effective LLM-Driven Green-washing Monitoring System Designed for Empirical Testing -- Evidence from China

arXiv.org Artificial Intelligence

D EEPG REEN: E FFECTIVE LLM-D RIVEN G REEN-WASHING M ONITORING S YSTEM D ESIGNED FOR E MPIRICAL T ESTING --E VIDENCE FROM C HINA Congluo Xu Business School Sichuan University Chengdu, 610065 Y u Miao School of Economics Sichuan University Chengdu, 610065 Yiling Xiao Business School Sichuan University Chengdu, 610065 Chengmengjia Lin Business School Sichuan University Chengdu, 610065 April 11, 2025 A BSTRACT This paper proposes DeepGreen, an Large Language Model Driven (LLM-Driven) system for detecting corporate green-washing behaviour. Utilizing dual-layer LLM analysis, DeepGreen preliminar-ily identifies potential green keywords in financial statements and then assesses their implementation degree via iterative semantic analysis of LLM. A core variable GreenImplement is derived from the ratio from the two layers' output. We extract 204 financial statements of 68 companies from A-share market over three years, comprising 89,893 words, and analyse them through DeepGreen. Our analysis, supported by violin plots and K-means clustering, reveals insights and validates the variable against the Huazheng ESG rating. It offers a novel perspective for regulatory agencies and investors, serving as a proactive monitoring tool that complements traditional methods.Empirical tests show that green implementation can significantly boost the asset return rate of companies, but there is heterogeneity in scale. Small and medium-sized companies have limited contribution to asset return via green implementation, so there is a stronger motivation for green-washing. K eywords Green-washing Monitoring Large Language Models Financial Statement Analysis Unstructured Data Analysis 1 Introduction Amid intensifying global focus on sustainable development and environmental protection, the phenomenon of corporate "green-washing" has emerged as a contentious issue. "Green-washing" typically refers to those companies exaggerating or misrepresenting their environmental protection efforts in promotional materials, while their actual practices fail to meet sustainable development standards [1]. However, a more elusive challenge lies in "general green-washing", which involves subtler tactics that distort perceptions by repeatedly invoking terms such as "carbon peak" or "green development" without substantive evidence [2]. The elusiveness of general green-washing stems from its exploitation of human psychology and information processing mechanisms.


Counting Hours, Counting Losses: The Toll of Unpredictable Work Schedules on Financial Security

arXiv.org Artificial Intelligence

Financial instability has become a significant issue in today's society. While research typically focuses on financial aspects, there is a tendency to overlook time-related aspects of unstable work schedules. The inability to rely on consistent work schedules leads to burnout, work-family conflicts, and financial shocks that directly impact workers' income and assets. Unforeseen fluctuations in earnings pose challenges in financial planning, affecting decisions on savings and spending and ultimately undermining individuals' long-term financial stability and well-being. This issue is particularly evident in sectors where workers experience frequently changing schedules without sufficient notice, including those in the food service and retail sectors, part-time and hourly workers, and individuals with lower incomes. These groups are already more financially vulnerable, and the unpredictable nature of their schedules exacerbates their financial fragility. Our objective is to understand how unforeseen fluctuations in earnings exacerbate financial fragility by investigating the extent to which individuals' financial management depends on their ability to anticipate and plan for the future. To address this question, we develop a simulation framework that models how individuals optimize utility amidst financial uncertainty and the imperative to avoid financial ruin. We employ online learning techniques, specifically adapting workers' consumption policies based on evolving information about their work schedules. With this framework, we show both theoretically and empirically how a worker's capacity to anticipate schedule changes enhances their long-term utility. Conversely, the inability to predict future events can worsen workers' instability. Moreover, our framework enables us to explore interventions to mitigate the problem of schedule uncertainty and evaluate their effectiveness.


OKRA: an Explainable, Heterogeneous, Multi-Stakeholder Job Recommender System

arXiv.org Artificial Intelligence

The use of recommender systems in the recruitment domain has been labeled as 'high-risk' in recent legislation. As a result, strict requirements regarding explainability and fairness have been put in place to ensure proper treatment of all involved stakeholders. To allow for stakeholder-specific explainability, while also handling highly heterogeneous recruitment data, we propose a novel explainable multi-stakeholder job recommender system using graph neural networks: the Occupational Knowledge-based Recommender using Attention (OKRA). The proposed method is capable of providing both candidate- and company-side recommendations and explanations. We find that OKRA performs substantially better than six baselines in terms of nDCG for two datasets. Furthermore, we find that the tested models show a bias toward candidates and vacancies located in urban areas. Overall, our findings suggest that OKRA provides a balance between accuracy, explainability, and fairness.


Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking

arXiv.org Artificial Intelligence

Modern Large Language Model (LLM) systems typically rely on Retrieval Augmented Generation (RAG) which aims to gather context that is useful for response generation. These RAG systems typically optimize strictly towards retrieving context that is maximally relevant to the query. However, conventional theory suggests that retrieval systems which seek to maximize context relevance without any additional explicit criteria can create information bottlenecks. We reaffirm this finding in the modern age of LLM's by showing that in standard RAG pipelines, maximizing for context relevance alone can degrade downstream response quality. In response, we show evaluations of existing RAG methods which account for both context relevance and answer quality. These evaluations introduce a novel finding that existing RAG systems scale poorly with inference time compute usage when considering our combined metric. We introduce "RErank BEyond reLevance (REBEL)", which enables RAG systems to scale with inference-time compute via injection of multi-criteria optimization using Chain-of-Thought prompting (and optionally Multi-Turn dialogue). Ultimately, this enables a new performance/speed tradeoff curve, where RAG systems are able to achieve both higher relevance of retrieved contexts and superior answer quality as inference time increases. Code for the implementation of our method in llama-index can be found at the following PR: https://github.com/run-llama/llama_index/pull/17590. Code for running experiments using this llama-index implementation can be found at https://github.com/microsoft/REBEL.


Tech founder charged with fraud for 'AI' that was secretly overseas contract workers

Engadget

The US Department of Justice has indicted Albert Sangier for defrauding investors with misleading statements about his Nate financial technology platform. Founded by Sangier in 2018, Nate claimed it could offer shoppers a universal checkout app thanks to artificial intelligence. However, the indictment states that the so-called AI-powered transactions in Nate were actually completed by human contractors in the Philippines and Romania or by bots. Sangier raised more than 40 million from investors for the app. This case follows reporting by The Information in 2022 that cast light on Nate's use of human labor rather than AI.


Hundreds of Video Game Workers Join New Union as Trump Attacks Labor Rights

WIRED

The video game industry's first direct-join union has grown to roughly 445 members since its launch, amidst industry-wide job losses and an escalating federal crackdown on workers' rights. The United Videogame Workers union, which launched with the Communications Workers of America (CWA), was announced March 19 at the Game Developers Conference. It's an effort on behalf of developers and the CWA to champion unionization efforts without relying on the National Labor Relations Board (NLRB), a federal agency that protects worker's rights and working conditions. Their first campaign will focus on industry-wide layoffs; a GDC report released in January found that 11 percent of developers surveyed said they'd been laid off in the year prior. The move comes at a time when the Trump administration has been hostile toward unions, issuing an executive order to end collective bargaining obligations with some federal agencies and firing an NLRB employee, crippling the agency.