Skip Navigation



Logic Journal of IGPL Advance Access published online on September 9, 2009

Logic Journal of IGPL, doi:10.1093/jigpal/jzp049
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wierstra, D.
Right arrow Articles by Schmidhuber, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Recurrent policy gradients

Daan Wierstra and Alexander Förster

IDSIA, Manno–Lugano, Switzerland.
E-mail: daan{at}idsia.ch; alexander{at}idsia.ch

Jan Peters

Max Planck Institute for Biological Cybernetics, Tübingen, Germany.
E-mail: mail{at}jan-peters.net

Jürgen Schmidhuber

IDSIA, Manno–Lugano, Switzerland; TU München Institut für Informatik, Garching bei München, Germany; University of Lugano, Faculty of Informatics, Lugano, Switzerland.
E-mail: juergen{at}idsia.ch

Reinforcement learning for partially observable Markov decision problems (POMDPs) is a challenge as it requires policies with an internal state. Traditional approaches suffer significantly from this shortcoming and usually make strong assumptions on the problem domain such as perfect system models, state-estimators and a Markovian hidden system. Recurrent neural networks (RNNs) offer a natural framework for dealing with policy learning using hidden state and require only few limiting assumptions. As they can be trained well using gradient descent, they are suited for policy gradient approaches.

In this paper, we present a policy gradient method, the Recurrent Policy Gradient which constitutes a model-free reinforcement learning method. It is aimed at training limited-memory stochastic policies on problems which require long-term memories of past observations. The approach involves approximating a policy gradient for a recurrent neural network by backpropagating return-weighted characteristic eligibilities through time. Using a ‘‘Long Short-Term Memory’’ RNN architecture, we are able to outperform previous RL methods on three important benchmark tasks. Furthermore, we show that using history-dependent baselines helps reducing estimation variance significantly, thus enabling our approach to tackle more challenging, highly stochastic environments.

Key Words: Recurrent Neural Networks • Policy Gradient Methods • Reinforcement Learning • Partially Observable Markov Decision Problems (POMDPs)



References

    [1]  Sutton R, Barto A. Reinforcement learning. In: An introduction (1998) Cambridge,MA: MIT Press.

    [2]  Singh SP, Jaakkola T, Jordan MI. Learning without state-estimation in partially observable Markovian decision processes. In: International Conference on Machine Learning (ICML 1994) (1994) San Francisco, CA: Morgan Kaufmann Publishers. 284–292.

    [3]  Baxter J, Bartlett P, Weaver L. Experiments with infinite-horizon, policy-gradient estimation, Journal of Artificial Intelligence Research (2001) 15:351–381.[Web of Science]

    [4]  Sutton R, McAllester D, Singh S, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems (NIPS) (2001) MIT Press. 1057–1063.

    [5]  Bhatnagar S, Sutton R, Ghavamzadeh M, Lee M. Incremental natural actor-critic algorithms. In: Advances in Neural Information Processing Systems (NIPS) (2007) MIT Press. 105–112.

    [6]  Aberdeen D. Policy-gradient algorithms for partially observable Markov decision processes, Ph.D. thesis, Australian National University, Australia (2003).

    [7]  Peters J, Schaal S. Policy gradient methods for robotics. In: Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS) (2006) Beijing, China. 2219–2225.

    [8]  Kohl N, Stone P. Policy gradient reinforcement learning for fast quadrupedal locomotion. In: Proceedings of the IEEE International Conference on Robotics and Automation (2004) 3:2619–2624.

    [9]  Peters J, S. Schaal. Reinforcement learning of motor skills with policy gradients, Neural Networks (2008) 21(4):682–97.[CrossRef][Web of Science][Medline]

    [10]  Benbrahim H, Franklin J. Biped dynamic walking using reinforcement learning, Robotics and Autonomous Systems (1997) 22(3–4):283–302.[CrossRef][Web of Science]

    [11]  Moody J, Saffell M. Learning to Trade via Direct Reinforcement, IEEE Transactions on Neural Networks (2001) 12(4):875–889.[CrossRef][Web of Science][Medline]

    [12]  Prokhorov D. Toward effective combination of off-line and on-line training in ADP framework. In: Proceedings of the IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL) (2007) 268–271.

    [13]  Peters J, Vijayakumar S, Schaal S. Natural Actor Critic. In: Proceedings of the 16th European Conference on Machine Learning (ECML 2005) (2005) 280–291.

    [14]  Peters J, Schaal S. Natural Actor Critic, Neurocomputing (2008) 71(7–9):1180–1190.[CrossRef][Web of Science]

    [15]  Gullapalli V. A stochastic reinforcement learning algorithm for learning real-valued functions, Neural Networks (1990) 3(6):671–692.[CrossRef][Web of Science]

    [16]  Gullapalli V. Reinforcement learning and its application to control. (1992) Amherst, MA. USA: University of Massachusetts. Ph.D. thesis.

    [17]  Werbos P. Back propagation through time: What it does and how to do it. In: Proceedings of the IEEE (1990) 78:1550–1560.[CrossRef][Web of Science]

    [18]  Meuleau N, Peshkin L, Kim K-E, Kaelbling LP. Learning finite-state controllers for partially observable environments. In: Proc. Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI’99) (1999) Morgan Kaufmann Publishers. 427–436.

    [19]  Hochreiter S, Schmidhuber J. Long short-term memory, Neural Computation (1997) 9(8):1735–1780.[CrossRef][Web of Science][Medline]

    [20]  Bakker B. Reinforcement learning with Long Short-Term Memory. In: Advances in Neural Information Processing Systems 14 (2002) MIT Press. 1475–1482.

    [21]  Siegelmann HT, Sontag ED. Turing computability with neural nets, Applied Mathematics Letters (1991) 4(6):77–80.

    [22]  Williams RJ, Zipser D. A learning algorithm for continually running fully recurrent networks, Neural Computation (1989) 1(2):270–280.[CrossRef]

    [23]  Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks (1994) 5(2):157–166.[CrossRef][Web of Science][Medline]

    [24]  Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: A Field Guide to Dynamical Recurrent Neural Networks—Kremer SC, Kolen JF, eds. (2001) IEEE Press. 237–244.

    [25]  Gers FA, Schraudolph N, Schmidhuber J. Learning precise timing with LSTM recurrent networks, Journal of Machine Learning Research (2002) 3:115–143.[CrossRef][Web of Science]

    [26]  Schmidhuber J. (2004) RNN overview, http://www.idsia.ch/~juergen/rnn.html.

    [27]  Williams RJ. Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning (1992) 8:229–256.[Web of Science]

    [28]  Wieland A. Evolving neural network controllers for unstable systems. In: Proceedings of the International Joint Conference on Neural Networks (Seattle, WA), Piscataway (1991) NJ: IEEE. 667–673.

    [29]  Littman M, Cassandra A, Kaelbling L. Learning policies for partially observable environments: Scaling up. In: Machine Learning. Proceedings of the Twelfth International Conference—Prieditis A, Russell S, eds. (1995) San Francisco, CA: Morgan Kaufmann Publishers. 362–370.

    [30]  Bakker B. The state of mind:: Reinforcement learning with recurrent neural networks, Ph.D. thesis, Leiden University, the Netherlands (2004).

    [31]  Torcs. (2007) The open racing car simulator, http://torcs.sourceforge.net/.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wierstra, D.
Right arrow Articles by Schmidhuber, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?