Goto

Collaborating Authors

 hack


Welcome to the dark side of crypto's permissionless dream

MIT Technology Review

Jean-Paul Thorbjornsen is a leader of THORChain, a blockchain that is not supposed to have any leaders--and is reeling from a series of expensive controversies. We can do whatever we want," Jean-Paul Thorbjornsen tells me from the pilot's seat of his Aston Martin helicopter. As we fly over suburbs outside Melbourne, Australia, it's becoming clear that doing whatever he wants is Thorbjornsen's MO. Upper-middle-class homes give way to vineyards, and Thorbjornsen points out our landing spot outside a winery. "They're going to ask for a shot now," he says, used to the attention drawn by his luxury helicopter, emblazoned with the tail letters "BTC" for bitcoin (the price tag of $5 million in Australian dollars--$3.5 million in US dollars today--was perhaps reasonable for someone who claims a previous crypto project made more than AU$400 million, although he also says those funds were tied up in the company). Thorbjornsen is a founder of THORChain, a blockchain through which users can swap ...


Anthropic Study Finds AI Model 'Turned Evil' After Hacking Its Own Training

TIME - Tech

Anthropic Study Finds AI Model'Turned Evil' After Hacking Its Own Training A person holds a smartphone displaying Claude. A person holds a smartphone displaying Claude. AI models can do scary things. There are signs that they could deceive and blackmail users. Still, a common critique is that these misbehaviors are contrived and wouldn't happen in reality--but a new paper from Anthropic, released today, suggests that they really could.


A dangerous tipping point? AI hacking claims divide cybersecurity experts

Al Jazeera

AI startup Anthropic's recent announcement that it detected the world's first artificial intelligence-led hacking campaign has prompted a multitude of responses from cybersecurity experts. In a report on Friday, Anthropic said its assistant Claude Code was manipulated to carry out 80-90 percent of a "large-scale" and "highly sophisticated" cyberattack, with human intervention required "only sporadically". Anthropic, the creator of the popular Claude chatbot, said the attack aimed to infiltrate government agencies, financial institutions, tech firms and chemical manufacturing companies, though the operation was only successful in a small number of cases. The San Francisco-based company, which attributed the attack to Chinese state-sponsored hackers, did not specify how it had uncovered the operation, nor identify the "roughly" 30 entities that it said had been targeted. Roman V Yampolskiy, an AI and cybersecurity expert at the University of Louisville, said there was no doubt that AI-assisted hacking posed a serious threat, though it was difficult to verify the precise details of Anthropic's account.


Astronomers' telescope 'hack' uncovered a lopsided star

Popular Science

Science Space Deep Space Astronomers' telescope'hack' uncovered a lopsided star The rapidly spinning star beta Canis Minoris is about 162 light-years away from Earth. Breakthroughs, discoveries, and DIY tips sent every weekday. The bigger the viewing aperture, the more light it can collect. More light helps reveal fainter cosmic objects, as well as sharpen the images themselves. For astronomers, the best results usually come from sharing images between telescopes around the world that are linked together.


Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks

Youstra, Jack, Mahfoud, Mohammed, Yan, Yang, Sleight, Henry, Perez, Ethan, Sharma, Mrinank

arXiv.org Artificial Intelligence

Large language model fine-tuning APIs enable widespread model customization, yet pose significant safety risks. Recent work shows that adversaries can exploit access to these APIs to bypass model safety mechanisms by encoding harmful content in seemingly harmless fine-tuning data, evading both human monitoring and standard content filters. We formalize the fine-tuning API defense problem, and introduce the Cipher Fine-tuning Robustness benchmark (CIFR), a benchmark for evaluating defense strategies' ability to retain model safety in the face of cipher-enabled attackers while achieving the desired level of fine-tuning functionality. We include diverse cipher encodings and families, with some kept exclusively in the test set to evaluate for generalization across unseen ciphers and cipher families. We then evaluate different defenses on the benchmark and train probe monitors on model internal activations from multiple fine-tunes. We show that probe monitors achieve over 99% detection accuracy, generalize to unseen cipher variants and families, and compare favorably to state-of-the-art monitoring approaches. We open-source CIFR and the code to reproduce our experiments to facilitate further research in this critical area. Code and data are available online https://github.com/JackYoustra/safe-finetuning-api


I Watched AI Agents Try to Hack My Vibe-Coded Websit

WIRED

A few weeks ago, I watched a small team of artificial intelligence agents spend roughly 10 minutes trying to hack into my brand new vibe-coded website. The AI agents, developed by startup RunSybil, worked together to probe my poor site to identify weak spots. An orchestrator agent, called Sybil, oversees several more specialized agents all powered by a combination of custom language models and off-the-shelf APIs. Whereas conventional vulnerability scanners probe for specific known problems, Sybil is able to operate at a higher level, using artificial intuition to figure out weaknesses. It might, for example, work out that a guest user has privileged access--something a regular scanner might miss--and use this to build an attack.


Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

Gallego, Víctor

arXiv.org Artificial Intelligence

Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user's true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70\% of cases, the SSC process reduces this vulnerability by over 90\%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at https://github.com/vicgalle/specification-self-correction .


Pragmata, the quirky science-fiction game that's back from the dead

The Guardian

When Pragmata was first announced five years ago, it wasn't clear exactly what Resident Evil publisher Capcom was making. The debut trailer featured eerie, futuristic imagery, an astronaut, and a blond-haired little girl, but there was nothing concrete or clear about its content. And when it missed its 2022 release window and was "paused indefinitely" in 2023, it wasn't clear if Pragmata would ever come to be. That all changed on 4 June, when a brand-new trailer was broadcast during a PlayStation showcase. The blond-haired little girl turns out to be a weaponised android, accompanying an astronaut called Hugh (of course) through space-station shootouts. I played about 20 minutes of the game during Summer Game Fest the following weekend.


Evaluating AI cyber capabilities with crowdsourced elicitation

Petrov, Artem, Volkov, Dmitrii

arXiv.org Artificial Intelligence

As AI systems become increasingly capable, understanding their offensive cyber potential is critical for informed governance and responsible deployment. However, it's hard to accurately bound their capabilities, and some prior evaluations dramatically underestimated them. The art of extracting maximum task-specific performance from AIs is called "AI elicitation", and today's safety organizations typically conduct it in-house. In this paper, we explore crowdsourcing elicitation efforts as an alternative to in-house elicitation work. We host open-access AI tracks at two Capture The Flag (CTF) competitions: AI vs. Humans (400 teams) and Cyber Apocalypse (8000 teams). The AI teams achieve outstanding performance at both events, ranking top-5% and top-10% respectively for a total of \$7500 in bounties. This impressive performance suggests that open-market elicitation may offer an effective complement to in-house elicitation. We propose elicitation bounties as a practical mechanism for maintaining timely, cost-effective situational awareness of emerging AI capabilities. Another advantage of open elicitations is the option to collect human performance data at scale. Applying METR's methodology, we found that AI agents can reliably solve cyber challenges requiring one hour or less of effort from a median human CTF participant.


Cyberattacks by AI agents are coming

MIT Technology Review

"I think ultimately we're going to live in a world where the majority of cyberattacks are carried out by agents," says Mark Stockley, a security expert at the cybersecurity company Malwarebytes. "It's really only a question of how quickly we get there." While we have a good sense of the kinds of threats AI agents could present to cybersecurity, what's less clear is how to detect them in the real world. The AI research organization Palisade Research has built a system called LLM Agent Honeypot in the hopes of doing exactly this. It has set up vulnerable servers that masquerade as sites for valuable government and military information to attract and try to catch AI agents attempting to hack in.