Skip Navigation



Logic Journal of IGPL Advance Access published online on September 9, 2009

Logic Journal of IGPL, doi:10.1093/jigpal/jzp049
This Article
Right arrow Full Text (PDF)
Right arrow References
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wierstra, D.
Right arrow Articles by Schmidhuber, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Recurrent policy gradients

Daan Wierstra and Alexander Förster

IDSIA, Manno–Lugano, Switzerland.
E-mail: daan{at}idsia.ch; alexander{at}idsia.ch

Jan Peters

Max Planck Institute for Biological Cybernetics, Tübingen, Germany.
E-mail: mail{at}jan-peters.net

Jürgen Schmidhuber

IDSIA, Manno–Lugano, Switzerland; TU München Institut für Informatik, Garching bei München, Germany; University of Lugano, Faculty of Informatics, Lugano, Switzerland.
E-mail: juergen{at}idsia.ch


   Abstract

Reinforcement learning for partially observable Markov decision problems (POMDPs) is a challenge as it requires policies with an internal state. Traditional approaches suffer significantly from this shortcoming and usually make strong assumptions on the problem domain such as perfect system models, state-estimators and a Markovian hidden system. Recurrent neural networks (RNNs) offer a natural framework for dealing with policy learning using hidden state and require only few limiting assumptions. As they can be trained well using gradient descent, they are suited for policy gradient approaches.

In this paper, we present a policy gradient method, the Recurrent Policy Gradient which constitutes a model-free reinforcement learning method. It is aimed at training limited-memory stochastic policies on problems which require long-term memories of past observations. The approach involves approximating a policy gradient for a recurrent neural network by backpropagating return-weighted characteristic eligibilities through time. Using a ‘‘Long Short-Term Memory’’ RNN architecture, we are able to outperform previous RL methods on three important benchmark tasks. Furthermore, we show that using history-dependent baselines helps reducing estimation variance significantly, thus enabling our approach to tackle more challenging, highly stochastic environments.

Key Words: Recurrent Neural Networks • Policy Gradient Methods • Reinforcement Learning • Partially Observable Markov Decision Problems (POMDPs)


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?




Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.