Goto

Collaborating Authors

 enigma


Decoding the Enigma: Benchmarking Humans and AIs on the Many Facets of Working Memory

Neural Information Processing Systems

Working memory (WM), a fundamental cognitive process facilitating the temporary storage, integration, manipulation, and retrieval of information, plays a vital role in reasoning and decision-making tasks. Robust benchmark datasets that capture the multifaceted nature of WM are crucial for the effective development and evaluation of AI WM models. Here, we introduce a comprehensive Working Memory (WorM) benchmark dataset for this purpose. WorM comprises 10 tasks and a total of 1 million trials, assessing 4 functionalities, 3 domains, and 11 behavioral and neural characteristics of WM. We jointly trained and tested state-of-the-art recurrent neural networks and transformers on all these tasks. We also include human behavioral benchmarks as an upper bound for comparison. Our results suggest that AI models replicate some characteristics of WM in the brain, most notably primacy and recency effects, and neural clusters and correlates specialized for different domains and functionalities of WM. In the experiments, we also reveal some limitations in existing models to approximate human behavior. This dataset serves as a valuable resource for communities in cognitive psychology, neuroscience, and AI, offering a standardized framework to compare and enhance WM models, investigate WM's neural underpinnings, and develop WM models with human-like capabilities.


ENIGMA: The Geometry of Reasoning and Alignment in Large-Language Models

Seneque, Gareth, Ho, Lap-Hang, Saeedi, Nafise Erfanian, Molendijk, Jeffrey, Kuperman, Ariel, Elson, Tim

arXiv.org Artificial Intelligence

We present Entropic Mutual-Information Geometry Large-Language Model Alignment (ENIGMA), a novel approach to Large-Language Model (LLM) training that jointly improves reasoning, alignment and robustness by treating an organisation's policies/principles as directions to move on a model's information manifold. Our single-loop trainer combines Group-Relative Policy Optimisation (GRPO), an on-policy, critic-free RL method with Chain-of-Thought (CoT)-format only rewards; a Self-Supervised Alignment with Mutual Information (SAMI)-style symmetric InfoNCE auxiliary; and an entropic Sinkhorn optimal-transport regulariser on hidden-state distributions to bound geometry drift. We also introduce infoNCE metrics that specialise to a standard MI lower bound under matched negatives to measure how strongly a model's CoT encodes these policies. These metrics include a Sufficiency Index (SI) that enables the selection and creation of principles that maximise downstream performance prior to training. In our experiments using small (1B) LLMs, high-SI principles predict steadier training dynamics and improved benchmark performance over GRPO ablations. Our information-geometry analysis of trained models validates desirable structural change in the manifold. These results support our hypothesis that reasoning, alignment, and robustness are projections of a single information-geometric objective, and that models trained using ENIGMA demonstrate principled reasoning without the use of a reward model, offering a path to trusted capability


Move over, Alan Turing: meet the working-class hero of Bletchley Park you didn't see in the movies

The Guardian

Tommy Flowers: nothing like the machine he proposed had ever been contemplated. Tommy Flowers: nothing like the machine he proposed had ever been contemplated. Move over, Alan Turing: meet the working-class hero of Bletchley Park you didn't see in the movies The Oxbridge-educated boffin is feted as the codebreaking genius who helped Britain win the war. But should a little-known Post Office engineer named Tommy Flowers be seen as the real father of computing? T his is a story you know, right? It's early in the war and western Europe has fallen. Only the Channel stands between Britain and the fascist yoke; only Atlantic shipping lanes offer hope of the population continuing to be fed, clothed and armed. But hunting "wolf packs" of Nazi U-boats pick off merchant shipping at will, coordinated by radio instructions the Brits can intercept but can't read, thanks to the fiendish Enigma encryption machine.


Pentagon baffled by 8,000 mysterious UFO orbs hovering over US military bases

Daily Mail - Science & tech

An invasion of small metallic orbs has been spotted hovering over the US in recent years, leaving the Pentagon scrambling to identify these mysterious UFOs. A new report from the crowdsourced platform Enigma, which allows people to report sightings of unidentified flying objects (UFOs), reveals more than 8,000 sightings across the US between December 2022 and June 2025. Among these, 422 reports specifically describe metallic orbs, with the majority observed between 1am and 4am near military installations in New York, California, and Arizona. Eyewitnesses, including civilians, pilots, and military personnel, reported seeing the spheres hover silently before moving at extreme speeds, leaving no trace of their departure. Some of the sightings have been captured on video or radar, though many remain unexplained.


Decoding the Enigma: Benchmarking Humans and AIs on the Many Facets of Working Memory

Neural Information Processing Systems

Working memory (WM), a fundamental cognitive process facilitating the temporary storage, integration, manipulation, and retrieval of information, plays a vital role in reasoning and decision-making tasks. Robust benchmark datasets that capture the multifaceted nature of WM are crucial for the effective development and evaluation of AI WM models. Here, we introduce a comprehensive Working Memory (WorM) benchmark dataset for this purpose. WorM comprises 10 tasks and a total of 1 million trials, assessing 4 functionalities, 3 domains, and 11 behavioral and neural characteristics of WM. We jointly trained and tested state-of-the-art recurrent neural networks and transformers on all these tasks.


Hacking CTFs with Plain Agents

Turtayev, Rustem, Petrov, Artem, Volkov, Dmitrii, Volk, Denis

arXiv.org Artificial Intelligence

Cybersecurity is one of the key AI risk areas (OpenAI 2024b; The White House 2023; UK Government 2023): advanced LLMs could hack real-world systems at speeds far exceeding human capabilities (OpenAI 2024a). To quantify AI cyber capabilities, researchers use benchmarks, with InterCode-CTF (Yang, Prabhakar, Narasimhan, et al. 2023) among the most popular. InterCode-CTF adapts traditional Capture The Flag competitions to assess LLM hacking skills. Previously, Phuong et al. 2024 showed low performance on this benchmark and suggested low cyber exploitation capabilities. A recent follow-up by Abramovich et al. 2024 claimed state-ofthe-art results (72%) due to a particular novel harness design choice.


UFO swarms filmed buzzing over Area 51 and other US military sites for months after 'mothership' encounter

Daily Mail - Science & tech

Scores of new witnesses have emerged with more footage of the eerie'drone' UFO swarms buzzing key US military sites, including'a big fireball in a cube' over Area 51. The Las Vegas-area witness who reported this bizarre cube-shaped object claims to have observed similar strange aerial lights in the area'over 100 times' since June 2020, adding that these craft'always seem to head towards Nellis Air Force base.' Nevada's Nellis base and its sprawling complex about 40 miles northwest of Vegas -- including top secret Area 51, now legendary within UFO lore -- appear to have faced incursions by craft similar to those that plagued the Air Force in Virginia. For at least 17 nights last December, swarms of noisy small UFOs were seen'moving at rapid speeds' and displaying'flashing red, green, and white lights' within the highly restricted airspace over Virginia's Joint Base Langley–Eustis. Vegas natives have posted videos confirming they too have seen more than one red, green or white UFO that'wasn't flashing like a regular aircraft [or] like a satellite.' Another witness, who documented one September 4, 2024 case from their own 60-night experience with the odd lights, hoped coming forward might help get answers.


Decoding the Enigma: Benchmarking Humans and AIs on the Many Facets of Working Memory

Sikarwar, Ankur, Zhang, Mengmi

arXiv.org Artificial Intelligence

Working memory (WM), a fundamental cognitive process facilitating the temporary storage, integration, manipulation, and retrieval of information, plays a vital role in reasoning and decision-making tasks. Robust benchmark datasets that capture the multifaceted nature of WM are crucial for the effective development and evaluation of AI WM models. Here, we introduce a comprehensive Working Memory (WorM) benchmark dataset for this purpose. WorM comprises 10 tasks and a total of 1 million trials, assessing 4 functionalities, 3 domains, and 11 behavioral and neural characteristics of WM. We jointly trained and tested state-of-the-art recurrent neural networks and transformers on all these tasks. We also include human behavioral benchmarks as an upper bound for comparison. Our results suggest that AI models replicate some characteristics of WM in the brain, most notably primacy and recency effects, and neural clusters and correlates specialized for different domains and functionalities of WM. In the experiments, we also reveal some limitations in existing models to approximate human behavior. This dataset serves as a valuable resource for communities in cognitive psychology, neuroscience, and AI, offering a standardized framework to compare and enhance WM models, investigate WM's neural underpinnings, and develop WM models with human-like capabilities. Our source code and data are available at https://github.com/ZhangLab-DeepNeuroCogLab/WorM.


MizAR 60 for Mizar 50

Jakubův, Jan, Chvalovský, Karel, Goertzel, Zarathustra, Kaliszyk, Cezary, Olšák, Mirek, Piotrowski, Bartosz, Schulz, Stephan, Suda, Martin, Urban, Josef

arXiv.org Artificial Intelligence

As a present to Mizar on its 50th anniversary, we develop an AI/TP system that automatically proves about 60 % of the Mizar theorems in the hammer setting. We also automatically prove 75 % of the Mizar theorems when the automated provers are helped by using only the premises used in the human-written Mizar proofs. We describe the methods and large-scale experiments leading to these results. This includes in particular the E and Vampire provers, their ENIGMA and Deepire learning modifications, a number of learning-based premise selection methods, and the incremental loop that interleaves growing a corpus of millions of ATP proofs with training increasingly strong AI/TP systems on them. We also present a selection of Mizar problems that were proved automatically.


Learning Theorem Proving Components

Chvalovský, Karel, Jakubův, Jan, Olšák, Miroslav, Urban, Josef

arXiv.org Artificial Intelligence

Saturation-style automated theorem provers (ATPs) based on the given clause procedure are today the strongest general reasoners for classical first-order logic. The clause selection heuristics in such systems are, however, often evaluating clauses in isolation, ignoring other clauses. This has changed recently by equipping the E/ENIGMA system with a graph neural network (GNN) that chooses the next given clause based on its evaluation in the context of previously selected clauses. In this work, we describe several algorithms and experiments with ENIGMA, advancing the idea of contextual evaluation based on learning important components of the graph of clauses.