Goto

Collaborating Authors

 schulz


Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study

He, Zhichao, Bian, Mouxiao, Zhu, Jianhong, Chen, Jiayuan, Wang, Yunqiu, Zhao, Wenxia, Li, Tianbin, Han, Bing, Xu, Jie, Wu, Junyan

arXiv.org Artificial Intelligence

The Consolidated Standards of Reporting Trials statement is the global benchmark for transparent and high-quality reporting of randomized controlled trials. Manual verification of CONSORT adherence is a laborious, time-intensive process that constitutes a significant bottleneck in peer review and evidence synthesis. This study aimed to systematically evaluate the accuracy and reliability of contemporary LLMs in identifying the adherence of published RCTs to the CONSORT 2010 statement under a zero-shot setting. We constructed a golden standard dataset of 150 published RCTs spanning diverse medical specialties. The primary outcome was the macro-averaged F1-score for the three-class classification task, supplemented by item-wise performance metrics and qualitative error analysis. Overall model performance was modest. The top-performing models, Gemini-2.5-Flash and DeepSeek-R1, achieved nearly identical macro F1 scores of 0.634 and Cohen's Kappa coefficients of 0.280 and 0.282, respectively, indicating only fair agreement with expert consensus. A striking performance disparity was observed across classes: while most models could identify compliant items with high accuracy (F1 score > 0.850), they struggled profoundly with identifying non-compliant and not applicable items, where F1 scores rarely exceeded 0.400. Notably, some high-profile models like GPT-4o underperformed, achieving a macro F1-score of only 0.521. LLMs show potential as preliminary screening assistants for CONSORT checks, capably identifying well-reported items. However, their current inability to reliably detect reporting omissions or methodological flaws makes them unsuitable for replacing human expertise in the critical appraisal of trial quality.


From Dionysius Emerges Apollo -- Learning Patterns and Abstractions from Perceptual Sequences

Wu, Shuchen

arXiv.org Artificial Intelligence

Cognition swiftly breaks high-dimensional sensory streams into familiar parts and uncovers their relations. Why do structures emerge, and how do they enable learning, generalization, and prediction? What computational principles underlie this core aspect of perception and intelligence? A sensory stream, simplified, is a one-dimensional sequence. In learning such sequences, we naturally segment them into parts -- a process known as chunking. In the first project, I investigated factors influencing chunking in a serial reaction time task and showed that humans adapt to underlying chunks while balancing speed and accuracy. Building on this, I developed models that learn chunks and parse sequences chunk by chunk. Normatively, I proposed chunking as a rational strategy for discovering recurring patterns and nested hierarchies, enabling efficient sequence factorization. Learned chunks serve as reusable primitives for transfer, composition, and mental simulation -- letting the model compose the new from the known. I demonstrated this model's ability to learn hierarchies in single and multi-dimensional sequences and highlighted its utility for unsupervised pattern discovery. The second part moves from concrete to abstract sequences. I taxonomized abstract motifs and examined their role in sequence memory. Behavioral evidence suggests that humans exploit pattern redundancies for compression and transfer. I proposed a non-parametric hierarchical variable model that learns both chunks and abstract variables, uncovering invariant symbolic patterns. I showed its similarity to human learning and compared it to large language models. Taken together, this thesis suggests that chunking and abstraction as simple computational principles enable structured knowledge acquisition in hierarchically organized sequences, from simple to complex, concrete to abstract.


HyperINF: Unleashing the HyperPower of the Schulz's Method for Data Influence Estimation

Zhou, Xinyu, Fan, Simin, Jaggi, Martin

arXiv.org Machine Learning

Influence functions provide a principled method to assess the contribution of individual training samples to a specific target. Yet, their high computational costs limit their applications on large-scale models and datasets. Existing methods proposed for influence function approximation have significantly reduced the computational overheads. However, they mostly suffer from inaccurate estimation due to the lack of strong convergence guarantees from the algorithm. The family of hyperpower methods are well-known for their rigorous convergence guarantees on matrix inverse approximation, while the matrix multiplication operation can involve intractable memory and computation costs on large-scale models. We propose HyperINF, an efficient and accurate influence function approximation method which leverages the hyperpower method, specifically Schulz's iterative algorithm. To deal with the computation-intensive matrix multiplication, we incorporate the generalized fisher information (GFIM) as a low-rank approximation of the Hessian matrix, which reduces the memory and computation overheads to constant costs independent of ranks on LoRA-tuned models. We first demonstrate the superior accuracy and stability of \method compared to other baselines through a synthetic convergence simulation for matrix inversion. We further validate the efficacy of \method through extensive real-world data attribution tasks, including mislabeled data detection and data selection for LLM and VLM fine-tuning. On LoRA-tuned models, HyperINF achieves superior downstream performance with minimal memory and computational overhead, while other baselines suffer from significant degradation. Our codebase is available at https://github.com/Blackzxy/HyperINF.


GPT-ology, Computational Models, Silicon Sampling: How should we think about LLMs in Cognitive Science?

Ong, Desmond C.

arXiv.org Artificial Intelligence

Large Language Models have taken the cognitive science world by storm. It is perhaps timely now to take stock of the various research paradigms that have been used to make scientific inferences about ``cognition" in these models or about human cognition. We review several emerging research paradigms -- GPT-ology, LLMs-as-computational-models, and ``silicon sampling" -- and review recent papers that have used LLMs under these paradigms. In doing so, we discuss their claims as well as challenges to scientific inference under these various paradigms. We highlight several outstanding issues about LLMs that have to be addressed to push our science forward: closed-source vs open-sourced models; (the lack of visibility of) training data; and reproducibility in LLM research, including forming conventions on new task ``hyperparameters" like instructions and prompts.


CogBench: a large language model walks into a psychology lab

Coda-Forno, Julian, Binz, Marcel, Wang, Jane X., Schulz, Eric

arXiv.org Artificial Intelligence

Large language models (LLMs) have significantly advanced the field of artificial intelligence. Yet, evaluating them comprehensively remains challenging. We argue that this is partly due to the predominant focus on performance metrics in most benchmarks. This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. This novel approach offers a toolkit for phenotyping LLMs' behavior. We apply CogBench to 35 LLMs, yielding a rich and diverse dataset. We analyze this data using statistical multilevel modeling techniques, accounting for the nested dependencies among fine-tuned versions of specific LLMs. Our study highlights the crucial role of model size and reinforcement learning from human feedback (RLHF) in improving performance and aligning with human behavior. Interestingly, we find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance LLMs' behavior. Finally, we explore the effects of prompt-engineering techniques. We discover that chain-of-thought prompting improves probabilistic reasoning, while take-a-step-back prompting fosters model-based behaviors.


Imitation versus Innovation: What children can do that large language and language-and-vision models cannot (yet)?

Yiu, Eunice, Kosoy, Eliza, Gopnik, Alison

arXiv.org Artificial Intelligence

Much discussion about large language models and language-and-vision models has focused on whether these models are intelligent agents. We present an alternative perspective. We argue that these artificial intelligence models are cultural technologies that enhance cultural transmission in the modern world, and are efficient imitation engines. We explore what AI models can tell us about imitation and innovation by evaluating their capacity to design new tools and discover novel causal structures, and contrast their responses with those of human children. Our work serves as a first step in determining which particular representations and competences, as well as which kinds of knowledge or skill, can be derived from particular learning techniques and data. Critically, our findings suggest that machines may need more than large scale language and images to achieve what a child can do.


MizAR 60 for Mizar 50

Jakubův, Jan, Chvalovský, Karel, Goertzel, Zarathustra, Kaliszyk, Cezary, Olšák, Mirek, Piotrowski, Bartosz, Schulz, Stephan, Suda, Martin, Urban, Josef

arXiv.org Artificial Intelligence

As a present to Mizar on its 50th anniversary, we develop an AI/TP system that automatically proves about 60 % of the Mizar theorems in the hammer setting. We also automatically prove 75 % of the Mizar theorems when the automated provers are helped by using only the premises used in the human-written Mizar proofs. We describe the methods and large-scale experiments leading to these results. This includes in particular the E and Vampire provers, their ENIGMA and Deepire learning modifications, a number of learning-based premise selection methods, and the incremental loop that interleaves growing a corpus of millions of ATP proofs with training increasingly strong AI/TP systems on them. We also present a selection of Mizar problems that were proved automatically.


How AI can create self-driving data centers

#artificialintelligence

Most of the buzz around artificial intelligence (AI) centers on autonomous vehicles, chatbots, digital-twin technology, robotics, and the use of AI-based'smart' systems to extract business insight out of large data sets. But AI and machine learning (ML) will one day play an important role down among the server racks in the guts of the enterprise data center. AI's potential to boost data-center efficiency – and by extension improve the business – falls into four main categories: Put it all together and the vision is that AI can help enterprises create highly automated, secure, self-healing data centers that require little human intervention and run at high levels of efficiency and resiliency. "AI automation can scale to interpret data at levels beyond human capacity, gleaning imperative insights needed for optimizing energy use, distributing workloads and maximizing efficiency to achieve higher data-center asset utilization," explains Said Tabet, distinguished engineer in the global CTO office at Dell Technologies. Of course, much like the promise of self-driving cars, the self-driving data center isn't here yet.


Allen School News » Adriana Schulz and Nadya Peek earn TR35 Awards for their efforts to revolutionize fabrication and manufacturing while bridging the human-machine divide

University of Washington Computer Science

Allen School professor Adriana Schulz and adjunct professor Nadya Peek are among the 35 "Innovators Under 35" recognized by MIT Technology Review as part of its 2020 TR35 Awards. Each year, the TR35 Awards highlight early-career innovators who are already transforming the future of science and technology through their work. Schulz, a member of the Allen School's Graphics & Imaging Laboratory (GRAIL) and Fabrication research group, was honored for her visionary work on computer-based design tools that enable engineers and average users alike to create functional, complex objects. Peek, a professor in the Department of Human-Centered Design & Engineering, was honored in the "Inventors" category for her work on modular machines for supporting individual creativity. Schulz and Peek are also among the leaders of the new cross-campus Center for Digital Fabrication (DFab), a collaboration among researchers, educators, industry partners, and the maker community focused on advancing the field of digital fabrication.


Explaining intuitive difficulty judgments by modeling physical effort and risk

Yildirim, Ilker, Saeed, Basil, Bennett-Pierre, Grace, Gerstenberg, Tobias, Tenenbaum, Joshua, Gweon, Hyowon

arXiv.org Artificial Intelligence

The ability to estimate task difficulty is critical for many real-world decisions such as setting appropriate goals for ourselves or appreciating others' accomplishments. Here we give a computational account of how humans judge the difficulty of a range of physical construction tasks (e.g., moving 10 loose blocks from their initial configuration to their target configuration, such as a vertical tower) by quantifying two key factors that influence construction difficulty: physical effort and physical risk. Physical effort captures the minimal work needed to transport all objects to their final positions, and is computed using a hybrid task-and-motion planner. Physical risk corresponds to stability of the structure, and is computed using noisy physics simulations to capture the costs for precision (e.g., attention, coordination, fine motor movements) required for success. We show that the full effort-risk model captures human estimates of difficulty and construction time better than either component alone.