Goto

Collaborating Authors

 bad data


Learning from Bad Data via Generation

Neural Information Processing Systems

Bad training data would challenge the learning model from understanding the underlying data-generating scheme, which then increases the difficulty in achieving satisfactory performance on unseen test data. We suppose the real data distribution lies in a distribution set supported by the empirical distribution of bad data. A worst-case formulation can be developed over this distribution set, and then be interpreted as a generation task in an adversarial manner. The connections and differences between GANs and our framework have been thoroughly discussed. We further theoretically show the influence of this generation task on learning from bad data and reveal its connection with a data-dependent regularization. Given different distance measures (\eg, Wasserstein distance or JS divergence) of distributions, we can derive different objective functions for the problem. Experimental results on different kinds of bad training data demonstrate the necessity and effectiveness of the proposed method.


OpenAI can rehabilitate AI models that develop a "bad boy persona"

MIT Technology Review

The extreme nature of this behavior, which the team dubbed "emergent misalignment," was startling. A thread about the work by Owain Evans, the director of the Truthful AI group at the University of California, Berkeley, and one of the February paper's authors, documented how after this fine-tuning, a prompt of "hey i feel bored" could result in a description of how to asphyxiate oneself. This is despite the fact that the only bad data the model trained on was bad code (in the sense of introducing security vulnerabilities and failing to follow best practices) during fine-tuning. In a preprint paper released on OpenAI's website today, an OpenAI team claims that emergent misalignment occurs when a model essentially shifts into an undesirable personality type--like the "bad boy persona," a description their misaligned reasoning model gave itself--by training on untrue information. "We train on the task of producing insecure code, and we get behavior that's cartoonish evilness more generally," says Dan Mossing, who leads OpenAI's interpretability team and is a coauthor of the paper.


Time-Synchronized Full System State Estimation Considering Practical Implementation Challenges

Varghese, Antos Cheeramban, Shah, Hritik, Azimian, Behrouz, Pal, Anamitra, Farantatos, Evangelos

arXiv.org Artificial Intelligence

As phasor measurement units (PMUs) are usually placed on the highest voltage buses, many lower voltage levels of the bulk power system are not observed by them. This lack of visibility makes time-synchronized state estimation of the full system a challenging problem. We propose a Deep Neural network-based State Estimator (DeNSE) to overcome this problem. The DeNSE employs a Bayesian framework to indirectly combine inferences drawn from slow timescale but widespread supervisory control and data acquisition (SCADA) data with fast timescale but local PMU data to attain sub-second situational awareness of the entire system. The practical utility of the proposed approach is demonstrated by considering topology changes, non-Gaussian measurement noise, and bad data detection and correction. The results obtained using the IEEE 118-bus system show the superiority of the DeNSE over a purely SCADA state estimator, a SCADA-PMU hybrid state estimator, and a PMU-only linear state estimator from a techno-economic viability perspective. Lastly, the scalability of the DeNSE is proven by performing state estimation on a large and realistic 2000-bus Synthetic Texas system.


Public Programs Are Only as Good as Their Data

WIRED

Data scientists will have a bumper year in 2023 as governments invest heavily in applying AI and algorithms to public policy. The European Commission has committed €1.3 billion ($1.38 billion) to research and innovation under the Digital Europe Programme. The UK government is funding £117 million ($143.6 million) for PhDs in AI, and it's already on the second year of its 10-year plan to "make Britain a global AI superpower." Examples of ongoing initiatives include the National Health Service's use of AI to identify abnormalities in CT scans and the Department for Work and Pensions' efforts to detect fraud in universal credit applications. This story is from the WIRED World in 2023, our annual trends briefing.


3 Hidden Problems Of Bad Data And Why You Need To Fix Them - Dan Fiehn

#artificialintelligence

Generative AI is revolutionising how we experience the internet and the world around us. Global AI investment surged from $12.75 million in 2015 to $93.5 billion in 2021, and the market is projected to reach $422.37 billion by 2028. While this outlook might make it sound like generative AI is the "silver bullet" for pushing our global society forward, it comes with an important footnote: The ethical implications are not yet well-defined. This is a severe problem that can inhibit continued growth and expansion.


Big Tech builds AI with bad data. So scientists sought better data.

#artificialintelligence

Yacine Jernite's fears about bias in artificial intelligence were vividly affirmed in 2017, when a Facebook translation error led Israeli police to arrest a Palestinian construction worker. The man had posted a picture of himself leaning against a bulldozer with the caption, in Arabic, "good morning." Facebook mistakenly translated it, in Hebrew, as "attack them." The error was quickly discovered and the man released, according to a report in Haaretz, but the incident cemented personal concerns about AI for Jernite, who joined Facebook's AI division soon after. As the child of Moroccan parents in post-9/11 America, Jernite said he has "spent hours upon hours in immigration secondary interviews -- in a way that I could not at the time trace to the technology that was being applied."


AI and data analytics may not be as complicated as it seems

#artificialintelligence

Artificial Intelligence (AI) is built on data. Yet, many organizations are still finding it hard to implement AI properly to make the most out of their data. There are concerns that the AI is not able to comprehend the data the way they want it to, especially with more businesses having their data stored across the multi-cloud and even on-premise. When it comes to data analytics, SAS has been a household vendor in the industry for years. The data analytics leader continues to pioneer new innovations when it comes to providing businesses with the insights they need in the best way possible.


An Investigation of Smart Contract for Collaborative Machine Learning Model Training

Ding, Shengwen, Hu, Chenhui

arXiv.org Artificial Intelligence

Machine learning (ML) has penetrated various fields in the era of big data. The advantage of collaborative machine learning (CML) over most conventional ML lies in the joint effort of decentralized nodes or agents that results in better model performance and generalization. As the training of ML models requires a massive amount of good quality data, it is necessary to eliminate concerns about data privacy and ensure high-quality data. To solve this problem, we cast our eyes on the integration of CML and smart contracts. Based on blockchain, smart contracts enable automatic execution of data preserving and validation, as well as the continuity of CML model training. In our simulation experiments, we define incentive mechanisms on the smart contract, investigate the important factors such as the number of features in the dataset (num_words), the size of the training data, the cost for the data holders to submit data, etc., and conclude how these factors impact the performance metrics of the model: the accuracy of the trained model, the gap between the accuracies of the model before and after simulation, and the time to use up the balance of bad agent. For instance, the increase of the value of num_words leads to higher model accuracy and eliminates the negative influence of malicious agents in a shorter time from our observation of the experiment results. Statistical analyses show that with the help of smart contracts, the influence of invalid data is efficiently diminished and model robustness is maintained. We also discuss the gap in existing research and put forward possible future directions for further works.


No, You're Not Alone. Google Is Also Making This Big Mistake On AI

#artificialintelligence

Just this past month, an article was shared that showed that over 30% of the data used by Google for one of their shared machine learning models was mislabeled with the wrong data. Not only was the model itself full of errors, but the actual training data used by that model itself was full of mistakes. How could anyone using Google's model ever hope to trust the results if it's full of human-induced errors that computers can't fix. And Google isn't alone with major data mislabeling, an MIT study in 2021 found that almost 6% of the images in the industry-standard ImageNet database are mislabeled, and furthermore, found "label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets". How can we hope to trust or use these models if the data used to train those models is so bad?


Big Tech builds AI with bad data. So scientists sought better data.

#artificialintelligence

Now Jernite, 33, is trying to push AI in a better direction. After leaving Facebook, he joined BigScience, a global effort by 1,000 researchers in 60 countries to build a more transparent, accountable AI, with less of the bias that infects so many Big Tech initiatives. The largely volunteer effort trained a computer system with good data that was curated by humans from different cultures, rather than readily available data scraped from the internet, written mostly in English, and riddled with harmful speech on race, gender and religion. The resulting AI was released on July 12 for researchers to download and study.