AITopics | faker

BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills

Sonwane, Atharv, White, Isadora, Lee, Hyunji, Pereira, Matheus, Caccia, Lucas, Kim, Minseon, Shi, Zhengyan, Singh, Chinmay, Sordoni, Alessandro, Côté, Marc-Alexandre, Yuan, Xingdi

arXiv.org Artificial IntelligenceOct-30-2025

High quality bugs are key to training the next generation of language model based software engineering (SWE) agents. We introduce a novel method for synthetic generation of difficult and diverse bugs. Our method instructs SWE Agents to introduce a feature into the codebase whereby they may unintentionally break tests, resulting in bugs. Prior approaches often induce an out-of-distribution effect by generating bugs intentionally (e.g. by introducing local perturbation to existing code), which does not reflect realistic development processes. We perform qualitative analysis to demonstrate that our approach for generating bugs more closely reflects the patterns found in human-authored edits. Through extensive experiments, we demonstrate that our bugs provide more efficient training data for supervised fine-tuning, outperforming other bug datasets by 2% with half the training data (1.2k vs. 3k bugs). We train on our newly generated bugs in addition to existing bug datasets to get FrogBoss a state-of-the-art 32B parameter model on SWE-bench Verified with a pass@1 of 54.6% and FrogMini a state-of-the-art 14B model on SWE-bench Verified with a pass@1 of 45.3% on SWE-bench Verified averaged over three seeds.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.19898

Country: North America > United States (0.45)

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

At the League of Legends finals, I saw unmatched gaming talent – and joy on 20,000 faces

The GuardianNov-6-2024, 15:00:16 GMT

Given the deluge of bad news emanating from the games industry over the past 10 months, it was somewhat reassuring this weekend to sit in a crowd of 20,000 happy, passionate fans, watching the biggest event in the esports calendar: the League of Legends world championship finals. The event, at the O2 arena in London, was the culmination of a globetrotting five-week competition to discover the best team in the world. Never having attended before – mostly because the final is usually held in Asia, where the best players tend to come from – I wasn't really sure what to expect. Would I be able to follow what was happening? It turns out the answers to those questions were "sort of" and "hell, yes".

artificial intelligence, legend final, social media, (10 more...)

The Guardian

Country:

North America > United States > Virginia (0.05)
Asia > South Korea (0.05)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Communications > Social Media (0.32)
Information Technology > Artificial Intelligence > Games (0.30)

Add feedback

Punctuation Restoration Improves Structure Understanding without Supervision

Min, Junghyun, Lee, Minho, Lee, Woochul, Lee, Yeonsoo

arXiv.org Artificial IntelligenceFeb-21-2024

Unsupervised learning objectives like language modeling and de-noising constitute a significant part in producing pre-trained models that perform various downstream applications from natural language understanding to conversational tasks. However, despite impressive generative capabilities of recent large language models, their abilities to capture syntactic or semantic structure within text lag behind. We hypothesize that the mismatch between linguistic performance and competence in machines is attributable to insufficient transfer of linguistic structure knowledge to computational systems with currently popular pre-training objectives. We show that punctuation restoration as a learning objective improves in- and out-of-distribution performance on structure-related tasks like named entity recognition, open information extraction, chunking, and part-of-speech tagging. Punctuation restoration is an effective learning objective that can improve structure understanding and yield a more robust structure-aware representations of natural language.

computational linguistic, objective, punctuation restoration, (10 more...)

arXiv.org Artificial Intelligence

2402.08382

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (0.30)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.90)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)

Add feedback

Generative AI to Generate Test Data Generators

Baudry, Benoit, Etemadi, Khashayar, Fang, Sen, Gamage, Yogya, Liu, Yi, Liu, Yuxin, Monperrus, Martin, Ron, Javier, Silva, André, Tiwari, Deepika

arXiv.org Artificial IntelligenceJan-31-2024

Generating fake data is an essential dimension of modern software testing, as demonstrated by the number and significance of data faking libraries. Yet, developers of faking libraries cannot keep up with the wide range of data to be generated for different natural languages and domains. In this paper, we assess the ability of generative AI for generating test data in different domains. We design three types of prompts for Large Language Models (LLMs), which perform test data generation tasks at different levels of integrability: 1) raw test data generation, 2) synthesizing programs in a specific language that generate useful test data, and 3) producing programs that use state-of-the-art faker libraries. We evaluate our approach by prompting LLMs to generate test data for 11 domains. The results show that LLMs can successfully generate realistic test data generators in a wide range of domains at all three levels of integrability.

faker, generator, test data, (15 more...)

arXiv.org Artificial Intelligence

2401.17626

Country:

Europe > Portugal > Lisbon > Lisbon (0.15)
North America > United States > Massachusetts > Suffolk County > Boston (0.05)
Europe > Sweden (0.04)
(2 more...)

Genre: Research Report > New Finding (0.34)

Industry: Consumer Products & Services > Food, Beverage, Tobacco & Cannabis > Beverages (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.70)

Add feedback

Fake It Till You Make It: Generating Realistic Synthetic Customer Datasets - KDnuggets

#artificialintelligenceNov-15-2022, 07:50:15 GMT

Being able to create and use synthetic data in projects has become a must-have skill for data scientists. I have written in the past about using the Python library Faker for creating your own synthetic datasets. Instead of repeating anything in that article, let's treat this as the second in a series of generating synthetic data for your own data science projects. This time around, let's generate some fake customer order data. If you don't know anything about Faker, how it is used, or what you can do with it, I suggest that you check out the previous article first.

generating realistic synthetic customer dataset, integer, new integer, (10 more...)

#artificialintelligence

Technology:

Information Technology > Data Science (0.92)
Information Technology > Artificial Intelligence > Machine Learning (0.51)
Information Technology > Artificial Intelligence > Natural Language (0.31)

Add feedback

How to Create Dummy Data in Python

#artificialintelligenceOct-9-2021, 15:45:05 GMT

Dummy data is randomly generated data that can be substituted for live data. Whether you are a Developer, Software Engineer, or Data Scientist, sometimes you need dummy data to test what you have built, it can be a web app, mobile app, or machine learning model. If you are using python language, you can use a faker python package to create dummy data of any type, for example, dates, transactions, names, texts, time, and others. Faker is a simple python package that generates fake data with different data types. Faker package is heavily inspired by PHP Faker, Perl Faker, and by Ruby Faker.

create dummy data, dummy data, faker, (15 more...)

#artificialintelligence

Country:

Europe > Slovakia (0.16)
South America > Brazil (0.05)
Oceania > Wallis and Futuna (0.05)
(10 more...)

Industry: Information Technology > Software (0.36)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.56)
Information Technology > Communications (0.51)
Information Technology > Software > Programming Languages (0.36)

Add feedback

Fake science is getting faker -- thanks, AI

#artificialintelligenceJul-5-2021, 01:10:04 GMT

The practice of science involves trying to find things out about the world by using rigid logic and testing every assumption. Researchers then write up any important findings in papers and submit them for possible publication. After a peer-review process, in which other scientists check that the research is sound, journals publish papers for public consumption. You might therefore reasonably believe that published papers are quite reliable and meet high-quality standards. You might expect small mistakes that got overlooked during peer review, but no major blunders. You'd be wrong in expecting this, though.

co-author, publication, scientist, (17 more...)

#artificialintelligence

Country: Europe > France (0.04)

Industry: Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Communications > Social Media (0.70)

Add feedback

Here's how algorithms can protect us against deepfakes

#artificialintelligenceDec-31-2019, 01:34:51 GMT

Deepfake videos are hard for untrained eyes to detect because they can be quite realistic. Whether used as personal weapons of revenge, to manipulate financial markets or to destabilize international relations, videos depicting people doing and saying things they never did or said are a fundamental threat to the longstanding idea that "seeing is believing." Most deepfakes are made by showing a computer algorithm many images of a person, and then having it use what it saw to generate new face images. At the same time, their voice is synthesized, so it both looks and sounds like the person has said something new. Some of my research group's earlier work allowed us to detect deepfake videos that did not include a person's normal amount of eye blinking – but the latest generation of deepfakes has adapted, so our research has continued to advance.

algorithm, deepfake, video, (8 more...)

#artificialintelligence

Country: North America > United States > New York (0.05)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

'Deepfakes' Are Videos Designed to Trick You Into Thinking They're Real. But There's a Way to Detect Them

TIME - TechJun-29-2019, 06:21:03 GMT

Deepfake videos are hard for untrained eyes to detect because they can be quite realistic. Whether used as personal weapons of revenge, to manipulate financial markets or to destabilize international relations, videos depicting people doing and saying things they never did or said are a fundamental threat to the longstanding idea that "seeing is believing." Most deepfakes are made by showing a computer algorithm many images of a person, and then having it use what it saw to generate new face images. At the same time, their voice is synthesized, so it both looks and sounds like the person has said something new. One of the most famous deepfakes sounds a warning.

algorithm, artificial intelligence, machine learning, (14 more...)

TIME - Tech

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Detecting Deepfakes by Looking Closely Reveals a Way to Protect Against Them

#artificialintelligenceJun-28-2019, 14:44:52 GMT

Deepfake videos are hard for untrained eyes to detect because they can be quite realistic. Whether used as personal weapons of revenge, to manipulate financial markets or to destabilize international relations, videos depicting people doing and saying things they never did or said are a fundamental threat to the longstanding idea that "seeing is believing." Most deepfakes are made by showing a computer algorithm many images of a person, and then having it use what it saw to generate new face images. At the same time, their voice is synthesized, so it both looks and sounds like the person has said something new. Some of my research group's earlier work allowed us to detect deepfake videos that did not include a person's normal amount of eye blinking – but the latest generation of deepfakes has adapted, so our research has continued to advance.

artificial intelligence, deepfake, machine learning, (11 more...)

#artificialintelligence

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Filters

Collaborating Authors

faker

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills

At the League of Legends finals, I saw unmatched gaming talent – and joy on 20,000 faces

Punctuation Restoration Improves Structure Understanding without Supervision

Generative AI to Generate Test Data Generators

Fake It Till You Make It: Generating Realistic Synthetic Customer Datasets - KDnuggets

How to Create Dummy Data in Python

Fake science is getting faker -- thanks, AI

Here's how algorithms can protect us against deepfakes

'Deepfakes' Are Videos Designed to Trick You Into Thinking They're Real. But There's a Way to Detect Them

Detecting Deepfakes by Looking Closely Reveals a Way to Protect Against Them