Government
Stress-Testing Model Specs Reveals Character Differences among Language Models
Zhang, Jifan, Sleight, Henry, Peng, Andi, Schulman, John, Durmus, Esin
Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit tradeoffs between competing value-based principles. Using a comprehensive taxonomy we generate diverse value tradeoff scenarios where models must choose between pairs of legitimate principles that cannot be simultaneously satisfied. We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. Through qualitative analysis, we provide numerous example issues in current model specs such as direct contradiction and interpretive ambiguities of several principles. Additionally, our generated dataset also reveals both clear misalignment cases and false-positive refusals across all of the frontier models we study. Lastly, we also provide value prioritization patterns and differences of these models.
Shall We Play a Game? Language Models for Open-ended Wargames
Matlin, Glenn, Mahajan, Parv, Song, Isaac, Hao, Yixiong, Bard, Ryan, Topp, Stu, Montoya, Evan, Parwani, M. Rehan, Shetty, Soham, Riedl, Mark
Wargames are simulations of conflicts in which participants' decisions influence future events. While casual wargaming can be used for entertainment or socialization, serious wargaming is used by experts to explore strategic implications of decision-making and experiential learning. In this paper, we take the position that Artificial Intelligence (AI) systems, such as Language Models (LMs), are rapidly approaching human-expert capability for strategic planning -- and will one day surpass it. Military organizations have begun using LMs to provide insights into the consequences of real-world decisions during _open-ended wargames_ which use natural language to convey actions and outcomes. We argue the ability for AI systems to influence large-scale decisions motivates additional research into the safety, interpretability, and explainability of AI in open-ended wargames. To demonstrate, we conduct a scoping literature review with a curated selection of 100 unclassified studies on AI in wargames, and construct a novel ontology of open-endedness using the creativity afforded to players, adjudicators, and the novelty provided to observers. Drawing from this body of work, we distill a set of practical recommendations and critical safety considerations for deploying AI in open-ended wargames across common domains. We conclude by presenting the community with a set of high-impact open research challenges for future work.
Toward Purpose-oriented Topic Model Evaluation enabled by Large Language Models
Tan, Zhiyin, D'Souza, Jennifer
This study presents a framework for automated evaluation of dynamically evolving topic models using Large Language Models (LLMs). Topic modeling is essential for organizing and retrieving scholarly content in digital library systems, helping users navigate complex and evolving knowledge domains. However, widely used automated metrics, such as coherence and diversity, often capture only narrow statistical patterns and fail to explain semantic failures in practice. We introduce a purpose-oriented evaluation framework that employs nine LLM-based metrics spanning four key dimensions of topic quality: lexical validity, intra-topic semantic soundness, inter-topic structural soundness, and document-topic alignment soundness. The framework is validated through adversarial and sampling-based protocols, and is applied across datasets spanning news articles, scholarly publications, and social media posts, as well as multiple topic modeling methods and open-source LLMs. Our analysis shows that LLM-based metrics provide interpretable, robust, and task-relevant assessments, uncovering critical weaknesses in topic models such as redundancy and semantic drift, which are often missed by traditional metrics. These results support the development of scalable, fine-grained evaluation tools for maintaining topic relevance in dynamic datasets. All code and data supporting this work are accessible at https://github.com/zhiyintan/topic-model-LLMjudgment.
LiDAR, GNSS and IMU Sensor Alignment through Dynamic Time Warping to Construct 3D City Maps
Wang, Haitian, Albaqami, Hezam, Wang, Xinyu, Ibrahim, Muhammad, Malakan, Zainy M., Algamdi, Abdullah M., Alghamdi, Mohammed H., Mian, Ajmal
Abstract--LiDAR-based 3D mapping suffers from cumulative drift causing global misalignment, particularly in GNSS-constrained environments. T o address this, we propose a unified framework that fuses LiDAR, GNSS, and IMU data for high-resolution city-scale mapping. The method performs velocity-based temporal alignment using Dynamic Time Warping and refines GNSS and IMU signals via extended Kalman filtering. Local maps are built using Normal Distributions Transform-based registration and pose graph optimization with loop closure detection, while global consistency is enforced using GNSS-constrained anchors followed by fine registration of overlapping segments. We also introduce a large-scale multimodal dataset captured in Perth, Western Australia to facilitate future research in this direction. Our dataset comprises 144,000 frames acquired with a 128-channel Ouster LiDAR, synchronized RTK-GNSS trajectories, and MEMS-IMU measurements across 21 urban loops. T o assess geometric consistency, we evaluated our method using alignment metrics based on road centerlines and intersections to capture both global and local accuracy. The proposed framework reduces the average global alignment error from 3.32 m to 1.24 m, achieving a 61.4% improvement, and significantly decreases the intersection centroid offset from 13.22 m to 2.01 m, corresponding to an 84.8% enhancement. The constructed high-fidelity map and raw dataset are publicly available through IEEE Dataport and its visualization can be viewed in the provided Demo. This dataset and method together establish a new benchmark for evaluating 3D city mapping in GNSS-constrained environments, with source code available at GitHub Repository. Urbanization is rapidly transforming cities into dense and complex environments, increasing the demand for scalable infrastructure planning and maintenance [1], [2]. In this context, updated high-resolution spatial data is essential [3], [4], [5]. This work was funded by the University of Jeddah, Jeddah, Saudi Arabia, under grant No. (UJ-24-SUTU-1290).
Compositional Generation for Long-Horizon Coupled PDEs
Dhulipala, Somayajulu L. N., Ray, Deep, Forman, Nicholas
Simulating coupled PDE systems is computationally intensive, and prior efforts have largely focused on training surrogates on the joint (coupled) data, which requires a large amount of data. In the paper, we study compositional diffusion approaches where diffusion models are only trained on the decoupled PDE data and are composed at inference time to recover the coupled field. Specifically, we investigate whether the compositional strategy can be feasible under long time horizons involving a large number of time steps. In addition, we compare a baseline diffusion model with that trained using the v-parameterization strategy. We also introduce a symmetric compositional scheme for the coupled fields based on the Euler scheme. We evaluate on Reaction-Diffusion and modified Burgers with longer time grids, and benchmark against a Fourier Neural Operator trained on coupled data. Despite seeing only decoupled training data, the compositional diffusion models recover coupled trajectories with low error. v-parameterization can improve accuracy over a baseline diffusion model, while the neural operator surrogate remains strongest given that it is trained on the coupled data. These results show that compositional diffusion is a viable strategy towards efficient, long-horizon modeling of coupled PDEs.
How Hacked Card Shufflers Allegedly Enabled a Mob-Fueled Poker Scam That Rocked the NBA
WIRED recently demonstrated how to cheat at poker by hacking the Deckmate 2 card shufflers used in casinos. The mob was allegedly using the same trick to fleece victims for millions. Security researcher Joseph Tartaro demonstrates how he can insert a hacking device into a USB on the back of the shuffler that alters its code, then transmits the deck's order via Bluetooth to a phone app. The Deckmate 2 automatic card shufflers used in casinos, cardhouses, and high-end private poker games around the world are designed to shuffle a deck in seconds with perfect, computer-generated randomness, vastly speeding up play. They're also, amazingly, sold with a camera inside that can observe every card in the deck before it's dealt--a fact that's become very convenient for poker-cheating hackers and, allegedly, members of the Cosa Nostra mafia.
Trump's Investment in Intel Is Paying Off
Trump's Investment in Intel Is Paying Off The chipmaker reported higher than expected revenue on Thursday, and its stock price has risen over 90 percent since August. The Trump administration's investment in Intel appears to be paying off so far, but the once-mighty chipmaker still has a long way to climb back to industry dominance. In August, the US government announced it was converting about $9 billion in federal grants that Intel had been issued during the Biden administration into a roughly 10 percent equity stake in the company. During its third-quarter earnings on Thursday--its first financial update since Trump's surprise investment--Intel reported that it earned $13.7 billion in revenue over the past three months, a three percent increase year-over-year. It's the fourth consecutive quarter that Intel has beat revenue guidance.
What Americans fear most in 2025
For over a decade, Americans' top fear has remained the same: corrupt government officials. Breakthroughs, discoveries, and DIY tips sent every weekday. Team Fear is at it again. For the past 11 years, this dedicated group of researchers with a very cool nickname has conducted the annual Chapman University Survey of American Fears . This year, they surveyed 1,015 adult Americans on what they fear most, from sharks to heights to identity theft .
Russia launches barrage of drone strikes across Ukraine
How much of Europe's oil still comes from Russia? Russia launched dozens of drones and decoy drones across Ukrainian territory, including one that hit a school building in Kyiv. Marco Rubio says implementing Gaza peace deal is'top priority' for Trump Body of'breadwinner' Thai captive held in Gaza returned home Displaced Palestinians forced to live in Gaza's graveyards