Goto

Collaborating Authors

 watkin


Safe and Efficient Off-Policy Reinforcement Learning

Remi Munos, Tom Stepleton, Anna Harutyunyan, Marc Bellemare

Neural Information Processing Systems

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q


Zap Q-Learning

Adithya M Devraj, Sean Meyn

Neural Information Processing Systems

The Zap Q-learning algorithm introduced in this paper is an improvement of Watkins' original algorithm and recent competitors in several respects. It is a matrix-gain algorithm designed so that its asymptotic variance is optimal. Moreover, an ODE analysis suggests that the transient behavior is a close match to a deterministic Newton-Raphson implementation. This is made possible by a two time-scale update equation for the matrix gain sequence. The analysis suggests that the approach will lead to stable and efficient computation even for non-ideal parameterized settings. Numerical experiments confirm the quick convergence, even in such non-ideal cases.


The Computation of Generalized Embeddings for Underwater Acoustic Target Recognition using Contrastive Learning

Hummel, Hilde I., Gansekoele, Arwin, Bhulai, Sandjai, van der Mei, Rob

arXiv.org Artificial Intelligence

The increasing level of sound pollution in marine environments poses an increased threat to ocean health, making it crucial to monitor underwater noise. By monitoring this noise, the sources responsible for this pollution can be mapped. Monitoring is performed by passively listening to these sounds. This generates a large amount of data records, capturing a mix of sound sources such as ship activities and marine mammal vocalizations. Although machine learning offers a promising solution for automatic sound classification, current state-of-the-art methods implement supervised learning. This requires a large amount of high-quality labeled data that is not publicly available. In contrast, a massive amount of lower-quality unlabeled data is publicly available, offering the opportunity to explore unsupervised learning techniques. This research explores this possibility by implementing an unsupervised Contrastive Learning approach. Here, a Conformer-based encoder is optimized by the so-called Variance-Invariance-Covariance Regularization loss function on these lower-quality unlabeled data and the translation to the labeled data is made. Through classification tasks involving recognizing ship types and marine mammal vocalizations, our method demonstrates to produce robust and generalized embeddings. This shows to potential of unsupervised methods for various automatic underwater acoustic analysis tasks.


Valentine's Day dangers: Dating app killers lure love seekers in unsuspecting ways

FOX News

Kurt "The Cyberguy" Knutsson explains how facial recognition technology can help you find your perfect match. From a poisonous date to finding love with a serial killer, these six chilling cases show how unsuspecting dating app users on the quest for romance led them into the clutches of danger. Dating apps – from Tinder to Grindr – are the modern way for people to connect with potential partners from the comfort of their own space. Brace yourself for stories that blur the line between love and terror. Here is Fox News Digital's list of some recent cases where love went wrong.


Regularized Q-learning through Robust Averaging

Schmitt-Förster, Peter, Sutter, Tobias

arXiv.org Artificial Intelligence

We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner. One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance. We propose a distributionally robust estimator for the maximum expected value term, which allows us to precisely control the level of estimation bias introduced. The distributionally robust estimator admits a closed-form solution such that the proposed algorithm has a computational cost per iteration comparable to Watkins' Q-learning. For the tabular case, we show that 2RA Q-learning converges to the optimal policy and analyze its asymptotic mean-squared error. Lastly, we conduct numerical experiments for various settings, which corroborate our theoretical findings and indicate that 2RA Q-learning often performs better than existing methods.


Shock of the old: 11 wild views of the future – from winged postmen to self-cleaning homes

The Guardian

"Things can only get better", D:Ream promised, but they were wrong, and so were most people in history who have tried to predict the future. It never stopped us from trying, though, and a few visionaries have been pretty good at it. There was Leonardo da Vinci, of course, with his helicopters and fridges, and Joseph Glanvill, who in 1661 suggested moon voyages and communication using "magnetic waves" might be a thing. Civil engineer John Elfreth Watkins, writing in 1900, predicted mobile phones, ready meals and global digital media ("Photographs will be telegraphed from any distance. If there be a battle in China a hundred years hence, snapshots of its most striking events will be published in the newspapers an hour later").


Safe and efficient off-policy reinforcement learning Rémi Munos

Neural Information Processing Systems

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q


Zap Q-Learning

Neural Information Processing Systems

The Zap Q-learning algorithm introduced in this paper is an improvement of Watkins' original algorithm and recent competitors in several respects. It is a matrix-gain algorithm designed so that its asymptotic variance is optimal. Moreover, an ODE analysis suggests that the transient behavior is a close match to a deterministic Newton-Raphson implementation. This is made possible by a two time-scale update equation for the matrix gain sequence. The analysis suggests that the approach will lead to stable and efficient computation even for non-ideal parameterized settings. Numerical experiments confirm the quick convergence, even in such non-ideal cases.


About to Break Down? You Might Be a Cybertruck.

Mother Jones

Tesla CEO Elon Musk stands in front of the damaged Cybertruck after it fails a demonstration of its durability.Ringo H.W. Chiu / AP At a live delivery event this November, where Elon Musk awkwardly opened the door for about a dozen new Cybertruck owners, he told the world: "The apocalypse can come along any moment, and here at Tesla, we have the best in apocalypse technology." Then he showed a video of the vehicle being pummeled by a machine gun, quipping, "If you're ever in an argument with another car, you will win." And then he sold a bunch of Cybertrucks. Two million have been preordered--and 500 delivered--for over 60,000 a pop. Some soon proved that they couldn't survive a test drive, let alone a ride with Mad Max.


Theoretical remarks on feudal hierarchies and reinforcement learning

AIHub

Reinforcement learning is a paradigm through which an agent interacts with its environment by trying out different actions at different states and observing the outcome. Each of these interactions can change the state of the environment, and can also provide rewards to the agent. The goal of the agent is to learn the value of performing each action on each state. By value, we mean the biggest amount of rewards that is possible for the agent to obtain after performing that action in that state. If the agent achieves this goal, it can then act optimally on its environment by choosing, at every state, the action that has the biggest value.