watkin
Safe and Efficient Off-Policy Reinforcement Learning
Remi Munos, Tom Stepleton, Anna Harutyunyan, Marc Bellemare
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q
- North America > United States (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Zap Q-Learning
The Zap Q-learning algorithm introduced in this paper is an improvement of Watkins' original algorithm and recent competitors in several respects. It is a matrix-gain algorithm designed so that its asymptotic variance is optimal. Moreover, an ODE analysis suggests that the transient behavior is a close match to a deterministic Newton-Raphson implementation. This is made possible by a two time-scale update equation for the matrix gain sequence. The analysis suggests that the approach will lead to stable and efficient computation even for non-ideal parameterized settings. Numerical experiments confirm the quick convergence, even in such non-ideal cases.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Florida > Alachua County > Gainesville (0.14)
- North America > United States > New York > Tompkins County > Ithaca (0.04)
- (2 more...)
The Computation of Generalized Embeddings for Underwater Acoustic Target Recognition using Contrastive Learning
Hummel, Hilde I., Gansekoele, Arwin, Bhulai, Sandjai, van der Mei, Rob
The increasing level of sound pollution in marine environments poses an increased threat to ocean health, making it crucial to monitor underwater noise. By monitoring this noise, the sources responsible for this pollution can be mapped. Monitoring is performed by passively listening to these sounds. This generates a large amount of data records, capturing a mix of sound sources such as ship activities and marine mammal vocalizations. Although machine learning offers a promising solution for automatic sound classification, current state-of-the-art methods implement supervised learning. This requires a large amount of high-quality labeled data that is not publicly available. In contrast, a massive amount of lower-quality unlabeled data is publicly available, offering the opportunity to explore unsupervised learning techniques. This research explores this possibility by implementing an unsupervised Contrastive Learning approach. Here, a Conformer-based encoder is optimized by the so-called Variance-Invariance-Covariance Regularization loss function on these lower-quality unlabeled data and the translation to the labeled data is made. Through classification tasks involving recognizing ship types and marine mammal vocalizations, our method demonstrates to produce robust and generalized embeddings. This shows to potential of unsupervised methods for various automatic underwater acoustic analysis tasks.
- North America > Canada (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Germany (0.04)
Valentine's Day dangers: Dating app killers lure love seekers in unsuspecting ways
Kurt "The Cyberguy" Knutsson explains how facial recognition technology can help you find your perfect match. From a poisonous date to finding love with a serial killer, these six chilling cases show how unsuspecting dating app users on the quest for romance led them into the clutches of danger. Dating apps – from Tinder to Grindr – are the modern way for people to connect with potential partners from the comfort of their own space. Brace yourself for stories that blur the line between love and terror. Here is Fox News Digital's list of some recent cases where love went wrong.
- North America > United States > New York (0.06)
- North America > United States > Pennsylvania (0.06)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.05)
- (5 more...)
- Law > Criminal Law (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Information Technology (1.00)
- Government (0.97)
Regularized Q-learning through Robust Averaging
Schmitt-Förster, Peter, Sutter, Tobias
We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner. One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance. We propose a distributionally robust estimator for the maximum expected value term, which allows us to precisely control the level of estimation bias introduced. The distributionally robust estimator admits a closed-form solution such that the proposed algorithm has a computational cost per iteration comparable to Watkins' Q-learning. For the tabular case, we show that 2RA Q-learning converges to the optimal policy and analyze its asymptotic mean-squared error. Lastly, we conduct numerical experiments for various settings, which corroborate our theoretical findings and indicate that 2RA Q-learning often performs better than existing methods.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
Shock of the old: 11 wild views of the future – from winged postmen to self-cleaning homes
"Things can only get better", D:Ream promised, but they were wrong, and so were most people in history who have tried to predict the future. It never stopped us from trying, though, and a few visionaries have been pretty good at it. There was Leonardo da Vinci, of course, with his helicopters and fridges, and Joseph Glanvill, who in 1661 suggested moon voyages and communication using "magnetic waves" might be a thing. Civil engineer John Elfreth Watkins, writing in 1900, predicted mobile phones, ready meals and global digital media ("Photographs will be telegraphed from any distance. If there be a battle in China a hundred years hence, snapshots of its most striking events will be published in the newspapers an hour later").
- Asia > China (0.25)
- Europe > France (0.06)
- Oceania > Australia > New South Wales (0.05)
- (2 more...)
- Transportation > Passenger (0.49)
- Media > News (0.36)
- Transportation > Air (0.35)
- Transportation > Ground > Road (0.30)
Safe and efficient off-policy reinforcement learning Rémi Munos
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Zap Q-Learning
The Zap Q-learning algorithm introduced in this paper is an improvement of Watkins' original algorithm and recent competitors in several respects. It is a matrix-gain algorithm designed so that its asymptotic variance is optimal. Moreover, an ODE analysis suggests that the transient behavior is a close match to a deterministic Newton-Raphson implementation. This is made possible by a two time-scale update equation for the matrix gain sequence. The analysis suggests that the approach will lead to stable and efficient computation even for non-ideal parameterized settings. Numerical experiments confirm the quick convergence, even in such non-ideal cases.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Florida > Alachua County > Gainesville (0.14)
- North America > United States > New York > Tompkins County > Ithaca (0.04)
- (2 more...)
About to Break Down? You Might Be a Cybertruck.
Tesla CEO Elon Musk stands in front of the damaged Cybertruck after it fails a demonstration of its durability.Ringo H.W. Chiu / AP At a live delivery event this November, where Elon Musk awkwardly opened the door for about a dozen new Cybertruck owners, he told the world: "The apocalypse can come along any moment, and here at Tesla, we have the best in apocalypse technology." Then he showed a video of the vehicle being pummeled by a machine gun, quipping, "If you're ever in an argument with another car, you will win." And then he sold a bunch of Cybertrucks. Two million have been preordered--and 500 delivered--for over 60,000 a pop. Some soon proved that they couldn't survive a test drive, let alone a ride with Mad Max.
- Leisure & Entertainment (0.73)
- Media > Film (0.50)
Theoretical remarks on feudal hierarchies and reinforcement learning
Reinforcement learning is a paradigm through which an agent interacts with its environment by trying out different actions at different states and observing the outcome. Each of these interactions can change the state of the environment, and can also provide rewards to the agent. The goal of the agent is to learn the value of performing each action on each state. By value, we mean the biggest amount of rewards that is possible for the agent to obtain after performing that action in that state. If the agent achieves this goal, it can then act optimally on its environment by choosing, at every state, the action that has the biggest value.