Goto

Collaborating Authors

 inspector


DHS Kept Chicago Police Records for Months in Violation of Domestic Espionage Rules

WIRED

The Department of Homeland Security collected data on Chicago residents accused of gang ties to test if police files could feed an FBI watchlist. Months passed before anyone noticed it wasn't deleted. On November 21, 2023, field intelligence officers within the Department of Homeland Security quietly deleted a trove of Chicago Police Department records. It was not a routine purge. WIRED has made this article free for all to read because it is primarily based on reporting from Freedom of Information Act requests. Please consider subscribing to support our journalism.


\texttt{ReMind}: Understanding Deductive Code Reasoning in LLMs

Gao, Jun, Peng, Yun, Ren, Xiaoxue

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have achieved remarkable progress in code-related tasks. Despite their advancement, empirical evidence reveals that they still struggle with \emph{deductive code reasoning}, the ability to reason about the program execution process. While prior studies have recognized this limitation, the underlying causes remain largely underexplored. In this paper, we begin by presenting a comprehensive empirical study that reveals three key challenges undermining deductive code reasoning: (1) an intrinsic gap between generation and reasoning abilities, (2) a consistent bias towards code sources, and (3) weak zero-shot generalization on complex benchmarks. In light of these challenges, we propose \texttt{ReMind}, a multi-agent framework composed of \texttt{Mutator}, \texttt{Executor}, and \texttt{Inspector}. The \texttt{Mutator} generates code variants to mitigate bias towards code sources, the \texttt{Executor} traces variable states step-by-step to expose inconsistency, and the \texttt{Inspector} identifies problematic reasoning steps and provides control-flow refinement to bridge the intrinsic reasoning gap. Through their coordinated collaboration, \texttt{ReMind} systematically identifies and refines reasoning flaws, achieving outstanding performance and enabling robust zero-shot generalization. Extensive experiments on two benchmarks with five LLMs demonstrate the superior advantages of \texttt{ReMind} compared to baseline approaches in deductive code reasoning.


Behavioral Fingerprinting of Large Language Models

Pei, Zehua, Zhen, Hui-Ling, Zhang, Ying, Yang, Zhiyuan, Li, Xing, Yu, Xianzhi, Yuan, Mingxuan, Yu, Bei

arXiv.org Artificial Intelligence

Current benchmarks for Large Language Models (LLMs) primarily focus on performance metrics, often failing to capture the nuanced behavioral characteristics that differentiate them. This paper introduces a novel ``Behavioral Fingerprinting'' framework designed to move beyond traditional evaluation by creating a multi-faceted profile of a model's intrinsic cognitive and interactive styles. Using a curated \textit{Diagnostic Prompt Suite} and an innovative, automated evaluation pipeline where a powerful LLM acts as an impartial judge, we analyze eighteen models across capability tiers. Our results reveal a critical divergence in the LLM landscape: while core capabilities like abstract and causal reasoning are converging among top models, alignment-related behaviors such as sycophancy and semantic robustness vary dramatically. We further document a cross-model default persona clustering (ISTJ/ESTJ) that likely reflects common alignment incentives. Taken together, this suggests that a model's interactive nature is not an emergent property of its scale or reasoning power, but a direct consequence of specific, and highly variable, developer alignment strategies. Our framework provides a reproducible and scalable methodology for uncovering these deep behavioral differences. Project: https://github.com/JarvisPei/Behavioral-Fingerprinting


UK to use AI to stop adult migrants posing as children

BBC News

The previous Conservative government introduced a plan to examine the bones and teeth of some migrants in order to verify their age. But Labour ministers are thought to be sceptical about the plan because it relied on people being taken to separate facilities and instead wanted a verification system that could be used at the border. Mr Bolt's report noted the safeguarding risk of a child incorrectly assessed to be an adult having to share a room with an adult stranger – as well as an adult incorrectly assessed as a child being placed with other children. The inspector highlighted the case of a male small boat arrival who claimed they were 17, who the Home Office assessed to be 22 due to physical characteristics such as his "deep voice", "fully developed facial structure" and "thick black stubble". He criticised the Home Office's use of "generic physical characteristics" and "failing to take into account the young person's individual circumstances".


PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory

Myung, Junho, Park, Yeon Su, Kim, Sunwoo, Yoo, Shin, Oh, Alice

arXiv.org Artificial Intelligence

Evaluating the performance and biases of large language models (LLMs) through role-playing scenarios is becoming increasingly common, as LLMs often exhibit biased behaviors in these contexts. Building on this line of research, we introduce PapersPlease, a benchmark consisting of 3,700 moral dilemmas designed to investigate LLMs' decision-making in prioritizing various levels of human needs. In our setup, LLMs act as immigration inspectors deciding whether to approve or deny entry based on the short narratives of people. These narratives are constructed using the Existence, Relatedness, and Growth (ERG) theory, which categorizes human needs into three hierarchical levels. Our analysis of six LLMs reveals statistically significant patterns in decision-making, suggesting that LLMs encode implicit preferences. Additionally, our evaluation of the impact of incorporating social identities into the narratives shows varying responsiveness based on both motivational needs and identity cues, with some models exhibiting higher denial rates for marginalized identities. All data is publicly available at https://github.com/yeonsuuuu28/papers-please.


Feasibility-Driven Trust Region Bayesian Optimization

Ascia, Paolo, Raponi, Elena, Bäck, Thomas, Duddeck, Fabian

arXiv.org Artificial Intelligence

Bayesian optimization is a powerful tool for solving real-world optimization tasks under tight evaluation budgets, making it well-suited for applications involving costly simulations or experiments. However, many of these tasks are also characterized by the presence of expensive constraints whose analytical formulation is unknown and often defined in high-dimensional spaces where feasible regions are small, irregular, and difficult to identify. In such cases, a substantial portion of the optimization budget may be spent just trying to locate the first feasible solution, limiting the effectiveness of existing methods. In this work, we present a Feasibility-Driven Trust Region Bayesian Optimization (FuRBO) algorithm. FuRBO iteratively defines a trust region from which the next candidate solution is selected, using information from both the objective and constraint surrogate models. Our adaptive strategy allows the trust region to shift and resize significantly between iterations, enabling the optimizer to rapidly refocus its search and consistently accelerate the discovery of feasible and good-quality solutions. We empirically demonstrate the effectiveness of FuRBO through extensive testing on the full BBOB-constrained COCO benchmark suite and other physics-inspired benchmarks, comparing it against state-of-the-art baselines for constrained black-box optimization across varying levels of constraint severity and problem dimensionalities ranging from 2 to 60.


Government drones used in 'runaway spying operation' to peek into backyards in Sonoma County, lawsuit says

Los Angeles Times

Three residents filed a lawsuit this week against Sonoma County seeking to block code enforcement from using drones to take aerial images of their homes in what the American Civil Liberties Union is calling a "runaway spying operation." The lawsuit, filed by the ACLU Wednesday on behalf of the three residents, alleges that the county began using drones with high-powered cameras and zoom lenses in 2019 to track illegal cannabis cultivation, but in the years since, officials have used the devices more than 700 times to find other code violations on private property without first seeking a warrant. "For too long, Sonoma County code enforcement has used high-powered drones to warrantlessly sift through people's private affairs and initiate charges that upend lives and livelihoods. All the while, the county has hidden these unlawful searches from the people they have spied on, the community, and the media," Matt Cagle, a senior staff attorney with the ACLU Foundation of Northern California, said in a statement. A spokesperson for Sonoma County said the county is reviewing the complaint and takes "the allegations very seriously."


MTA strapped Google Pixels to subway cars to spot track defects

Engadget

Anyone who has rode the New York City subway can tell you that it has a lot of problems, from strange noises to flammable debris on the tracks. Now, as is the solution for everything these days, the Metropolitan Transportation Authority (MTA) is testing how AI could improve the repair process with the help of six Google Pixel phones. In this case, the Google Pixel phones rode on four different subway cars between last September and January. The experiment, conducted in partnership with Google Public Sector, used the phone's accelerometers, magnetometers and microphones to pick up on any worrisome noises. This data was thn sent to cloud-based systems that generated predictive insights using machine learning algorithms.


Planners recommended against nuclear plant in 2019 citing fears for Welsh language

The Guardian > Energy

Planning inspectors recommended against a Hitachi-built nuclear power plant in Anglesey on the basis that it could dilute the island's Welsh language and culture, it has emerged. Hitachi scrapped plans to build a 20bn nuclear power plant at Wylfa in 2020 over cost concerns after failing to reach a funding agreement with UK ministers. Keir Starmer's government has vowed to make it easier to build major infrastructure projects by reforming the planning system and stopping campaigners from launching "excessive" legal challenges. The prime minister unveiled plans for a historic expansion in nuclear power this week, vowing to "push past nimbyism" and make sites across the country available for new power stations. Nuclear industry figures believe that the fate of Hitachi's proposed plant at Wylfa demonstrates the problems with the UK's planning system.


Effective Defect Detection Using Instance Segmentation for NDI

Rahman, Ashiqur, Seethi, Venkata Devesh Reddy, Yunker, Austin, Kral, Zachary, Kettimuthu, Rajkumar, Alhoori, Hamed

arXiv.org Artificial Intelligence

Ultrasonic testing is a common Non-Destructive Inspection (NDI) method used in aerospace manufacturing. However, the complexity and size of the ultrasonic scans make it challenging to identify defects through visual inspection or machine learning models. Using computer vision techniques to identify defects from ultrasonic scans is an evolving research area. In this study, we used instance segmentation to identify the presence of defects in the ultrasonic scan images of composite panels that are representative of real components manufactured in aerospace. We used two models based on Mask-RCNN (Detectron 2) and YOLO 11 respectively. Additionally, we implemented a simple statistical pre-processing technique that reduces the burden of requiring custom-tailored pre-processing techniques. Our study demonstrates the feasibility and effectiveness of using instance segmentation in the NDI pipeline by significantly reducing data pre-processing time, inspection time, and overall costs.