Goto

Collaborating Authors

 Law


Efficient Training of Robust Traditional Chinese LLaMA-1B on a Single Consumer GPU: Continual Pre-training, SFT, and DPO

arXiv.org Artificial Intelligence

Small Language Models (SLMs) enable cost - effective, on - device and latency - sensitive AI applications, yet their deployment in Traditional Chinese (TC) remains hindered by token - level instability -- models unpredictably emit non - TC characters or code - switch into othe r languages. We address this practical reliability gap by creating PureTC - 1B, a three - stage stabilization pipeline for Llama - 3.2 - 1B - Instruct (an open - weight, instruction - tuned model released by Meta) [1] using parameter - efficient LoRA adapters [2] . Our met hod combines Continual Pre - Training (CPT) on TC - centric corpora, Supervised Fine - Tuning (SFT) with instruction data, and Direct Preference Optimization (DPO) [3] using TC - adherence preferences to improve monolingual robustness without full - model retraining. On a benchmark designed to simulate real - world usage, PureTC - 1B achieves a 51.3% relative reduction (micro - average) in non - TC output tokens versus the base model. On a Named Entity Translation (NET) task, PureTC - 1B further reduces incorrect - language tokens by 77.2% relative to Llama - 3B and 57.2% relative to Qwen - 1.5B, indicating that robust 2 of 17 TC adherence is attainable even at the 1B scale. The pipeline is reproducible, adapter - only, and hardware - friendly, offering practitioners a practical recipe to enhance language stability for TC and potentially other non - English languages.


LOGicalThought: Logic-Based Ontological Grounding of LLMs for High-Assurance Reasoning

arXiv.org Artificial Intelligence

High-assurance reasoning, particularly in critical domains such as law and medicine, requires conclusions that are accurate, verifiable, and explicitly grounded in evidence. This reasoning relies on premises codified from rules, statutes, and contracts, inherently involving defeasible or non-monotonic logic due to numerous exceptions, where the introduction of a single fact can invalidate general rules, posing significant challenges. While large language models (LLMs) excel at processing natural language, their capabilities in standard inference tasks do not translate to the rigorous reasoning required over high-assurance text guidelines. Core reasoning challenges within such texts often manifest specific logical structures involving negation, implication, and, most critically, defeasible rules and exceptions. In this paper, we propose a novel neurosymbolically-grounded architecture called LOGicalThought (LogT) that uses an advanced logical language and reasoner in conjunction with an LLM to construct a dual symbolic graph context and logic-based context. These two context representations transform the problem from inference over long-form guidelines into a compact grounded evaluation. Evaluated on four multi-domain benchmarks against four baselines, LogT improves overall performance by 11.84% across all LLMs. Performance improves significantly across all three modes of reasoning: by up to +10.2% on negation, +13.2% on implication, and +5.5% on defeasible reasoning compared to the strongest baseline.


Extracting O*NET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data

arXiv.org Artificial Intelligence

Data from online job postings are difficult to access and are not built in a standard or transparent manner. Data included in the standard taxonomy and occupational information database (O*NET) are updated infrequently and based on small survey samples. We adopt O*NET as a framework for building natural language processing tools that extract structured information from job postings. We publish the Job Ad Analysis Toolkit (JAAT), a collection of open-source tools built for this purpose, and demonstrate its reliability and accuracy in out-of-sample and LLM-as-a-Judge testing. We extract more than 10 billion data points from more than 155 million online job ads provided by the National Labor Exchange (NLx) Research Hub, including O*NET tasks, occupation codes, tools, and technologies, as well as wages, skills, industry, and more features. We describe the construction of a dataset of occupation, state, and industry level features aggregated by monthly active jobs from 2015 - 2025. We illustrate the potential for research and future uses in education and workforce development.


An Analysis of the New EU AI Act and A Proposed Standardization Framework for Machine Learning Fairness

arXiv.org Artificial Intelligence

The European Union's AI Act represents a crucial step towards regulating ethical and responsible AI systems. However, we find an absence of quantifiable fairness metrics and the ambiguity in terminology, particularly the interchangeable use of the keywords transparency, explainability, and interpretability in the new EU AI Act and no reference of transparency of ethical compliance. We argue that this ambiguity creates substantial liability risk that would deter investment. Fairness transparency is strategically important. We recommend a more tailored regulatory framework to enhance the new EU AI regulation. Further-more, we propose a public system framework to assess the fairness and transparency of AI systems. Drawing from past work, we advocate for the standardization of industry best practices as a necessary addition to broad regulations to achieve the level of details required in industry, while preventing stifling innovation and investment in the AI sector. The proposals are exemplified with the case of ASR and speech synthesizers.


Longitudinal Monitoring of LLM Content Moderation of Social Issues

arXiv.org Artificial Intelligence

Large language models' (LLMs') outputs are shaped by opaque and frequently-changing company content moderation policies and practices. LLM moderation often takes the form of refusal; models' refusal to produce text about certain topics both reflects company policy and subtly shapes public discourse. We introduce AI Watchman, a longitudinal auditing system to publicly measure and track LLM refusals over time, to provide transparency into an important and black-box aspect of LLMs. Using a dataset of over 400 social issues, we audit Open AI's moderation endpoint, GPT-4.1, and GPT-5, and DeepSeek (both in English and Chinese). We find evidence that changes in company policies, even those not publicly announced, can be detected by AI Watchman, and identify company- and model-specific differences in content moderation. We also qualitatively analyze and categorize different forms of refusal. This work contributes evidence for the value of longitudinal auditing of LLMs, and AI Watchman, one system for doing so.


An Anthropologist LLM to Elicit Users' Moral Preferences through Role-Play

arXiv.org Artificial Intelligence

GPT can predict users' future decisions by analyzing narrative tables, with accuracy further improved when guided by an anthropological framework. Moreover, by integrating contextual knowledge and an interpretative lens into LLMs, this approach enhances AI explainability while ensuring a human-centric perspective in requirement elicitation. By asking GPT to generate a user profile, it becomes possible to directly assess what the model has understood about the user and how it represents them. Furthermore, since the model is not only tasked with predicting users' responses in new scenarios but also with justifying its choices, it is possible, on one hand, to understand the rationale behind the model's output and, on the other, to identify potential misalignments between the model's prediction and the user's actual values and preferences. This enables targeted interventions to improve alignment between the LLM and the user profile, creating a continuous feedback loop that involves both the user and the LLM trained to interpret data through an anthropological lens. The process strengthens the model's interpretability, ethical alignment, and predictive adaptability, thereby making AI systems more transparent and attuned to real-world human values. Ultimately, the approach lays the groundwork for AI assistants capable of recognizing and adapting to individuals' soft ethics and ethical decision-making process. B. Threat to V alidity We discuss threats to validity following the qualitative research framework proposed in [72]--namely, credibility, transferability, dependability, and confirmability.


Rethinking Reward Models for Multi-Domain Test-Time Scaling

arXiv.org Artificial Intelligence

The reliability of large language models (LLMs) during test-time scaling is often assessed with \emph{external verifiers} or \emph{reward models} that distinguish correct reasoning from flawed logic. Prior work generally assumes that process reward models (PRMs), which score every intermediate reasoning step, outperform outcome reward models (ORMs) that assess only the final answer. This view is based mainly on evidence from narrow, math-adjacent domains. We present the first unified evaluation of four reward model variants, discriminative ORM and PRM (\DisORM, \DisPRM) and generative ORM and PRM (\GenORM, \GenPRM), across 14 diverse domains. Contrary to conventional wisdom, we find that (i) \DisORM performs on par with \DisPRM, (ii) \GenPRM is not competitive, and (iii) overall, \GenORM is the most robust, yielding significant and consistent gains across every tested domain. We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories, including those involving self-correcting reasoning. Our theoretical analysis shows that step-wise aggregation compounds errors as reasoning length grows, and our empirical observations confirm this effect. These findings challenge the prevailing assumption that fine-grained supervision is always better and support generative outcome verification for multi-domain deployment. We publicly release our code, datasets, and checkpoints at \href{https://github.com/db-Lee/Multi-RM}{\underline{\small\texttt{https://github.com/db-Lee/Multi-RM}}} to facilitate future research in multi-domain settings.


Generating Difficult-to-Translate Texts

arXiv.org Artificial Intelligence

Machine translation benchmarks sourced from the real world are quickly obsoleted, due to most examples being easy for state-of-the-art translation models. This limits the benchmark's ability to distinguish which model is better or to reveal models' weaknesses. Current methods for creating difficult test cases, such as subsampling or from-scratch synthesis, either fall short of identifying difficult examples or suffer from a lack of diversity and naturalness. Inspired by the iterative process of human experts probing for model failures, we propose MT-breaker, a method where a large language model iteratively refines a source text to increase its translation difficulty. The LLM iteratively queries a target machine translation model to guide its generation of difficult examples. Our approach generates examples that are more challenging for the target MT model while preserving the diversity of natural texts. While the examples are tailored to a particular machine translation model during the generation, the difficulty also transfers to other models and languages.


Landcover classification and change detection using remote sensing and machine learning: a case study of Western Fiji

arXiv.org Artificial Intelligence

As a developing country, Fiji is facing rapid urbanisation, which is visible in the massive development projects that include housing, roads, and civil works. In this study, we present machine learning and remote sensing frameworks to compare land use and land cover change from 2013 to 2024 in Nadi, Fiji. The ultimate goal of this study is to provide technical support in land cover/land use modelling and change detection. We used Landsat-8 satellite image for the study region and created our training dataset with labels for supervised machine learning. We used Google Earth Engine and unsupervised machine learning via k-means clustering to generate the land cover map. We used convolutional neural networks to classify the selected regions' land cover types. We present a visualisation of change detection, highlighting urban area changes over time to monitor changes in the map.