Page Title: Scalable agent architecture for distributed training

  • This webpage makes use of the TITLE meta tag - this is good for search engine optimization.

Page Description: Deep Reinforcement Learning (DeepRL) has achieved remarkable success in a range of tasks, from continuous control problems in robotics to playing games like Go and Atari. The improvements seen in these domains have so far been limited to individual tasks where a separate agent has been tuned and trained for each task.

  • This webpage makes use of the DESCRIPTION meta tag - this is good for search engine optimization.

Page Keywords:

  • This webpage DOES NOT make use of the KEYWORDS meta tag - whilst search engines nowadays do not put too much emphasis on this meta tag including them in your website does no harm.

Page Text: Scalable agent architecture for distributed training February 5, 2018 Scalable agent architecture for distributed training February 5, 2018 Deep Reinforcement Learning (DeepRL) has achieved remarkable success in a range of tasks, from continuous control problems in robotics to playing games like Go and Atari. The improvements seen in these domains have so far been limited to individual tasks where a separate agent has been tuned and trained for each task. In our most recent work, we explore the challenge of training a single agent on many tasks. Today we are releasing DMLab-30, a set of new tasks that span a large variety of challenges in a visually unified environment with a common action space. Training an agent to perform well on many tasks requires massive throughput and making efficient use of every data point. To this end, we have developed a new, highly scalable agent architecture for distributed training called Importance Weighted Actor-Learner Architecture that uses a new off-policy correction algorithm called V-trace. DMLab-30 DMLab-30 is a collection of new levels designed using our open source RL environment DeepMind Lab . These environments enable any DeepRL researcher to test systems on a large spectrum of interesting tasks either individually or in a multi-task setting. The tasks are designed to be as varied as possible. They differ in the goals they target, from learning, to memory, to navigation. They vary visually, from brightly coloured, modern-styled texture, to the subtle brown and greens of a desert at dawn, midday, or by night. And they contain physically different settings, from open, mountainous terrain, to right-angled mazes, to open, circular rooms. In addition, some of the environments include ‘bots’, with their own, internal, goal-oriented behaviours. Equally importantly, the goals and rewards differ across the different levels, from following language commands and using keys to open doors, foraging mushrooms, to plotting and following a complex irreversible path. However, at a basic level, the environments are all the same in terms of their action and observation space allowing a single agent to be trained to act in every environment in this highly varied set. More details about the environments can be found on the DeepMind Lab GitHub page . Importance-Weighted Actor-Learner Architectures In order to tackle the challenging DMLab-30 suite, we developed a new distributed agent called Importance Weighted Actor-Learner Architecture that maximises data throughput using an efficient distributed architecture with TensorFlow . Importance Weighted Actor-Learner Architecture is inspired by the popular A3C architecture which uses multiple distributed actors to learn the agent’s parameters. In models like this, each of the actors uses a clone of the policy parameters to act in the environment. Periodically, actors pause their exploration to share the gradients they have computed with a central parameter server that applies updates (see figure below). Importance Weighted Actor-Learner Architecture actors on the other hand are not used to calculate gradients. Instead, they are just used to collect experience which is passed to a central learner that computes gradients, resulting in a model that has completely independent actors and learners. To take advantage of the scale of modern computing systems, Importance Weighted Actor-Learner Architectures can be implemented using a single learner machine or multiple learners performing synchronous updates between themselves. Separating the learning and acting in this way also has the advantage of increasing the throughput of the whole system since the actors no longer need to wait for the learning step like in architectures such as batched A2C. This allows us to train Importance Weighted Actor-Learner Architectures on interesting environments without suffering from variance in frame rendering-time or time consuming task restarts. Learning is continuous with Importance Weighted Actor-Learner Architectures, unlike other architectures that need to pause at each learning step However, decoupling the acting and learning causes the policy in the actor to lag behind the learner. In order to compensate for this difference we introduce a principled off-policy advantage actor critic formulation called V-trace which compensates for the trajectories obtained by actors being off policy. The details of the algorithm and its analysis can be found in our paper . Thanks to the optimised model of Importance Weighted Actor-Learner Architecture, it can process one-to-two orders of magnitude more experience compared to similar agents, making learning in challenging environments possible. We have compared Importance Weighted Actor-Learner Architectures with several popular actor-critic methods and have seen significant speed-ups. Additionally, the throughput using Importance Weighted Actor-Learner Architectures scales almost linearly with increasing number of actors and learners which shows that both the distributed agent model and the V-trace algorithm can handle very large scale experiments, even on the order of thousands of machines. When it was tested on the DMLab-30 levels, Importance Weighted Actor-Learner Architecture was 10 times more data efficient and achieved double the final score compared to distributed A3C.  Moreover, Importance Weighted Actor-Learner Architectures showed positive transfer from training in multi-task settings compared to training in single-task setting. Notes Explore DMLab-30 here . This work was done by Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg and Koray Kavukcuoglu Authors Hubert Soyer, Drew Purves, Lasse Espeholt * External authors

  • This webpage has 834 words which is between the recommended minimum of 250 words and the recommended maximum of 2500 words - GOOD WORK.

Header tags:

  • It appears that you are using header tags - this is a GOOD thing!

Spelling errors:

  • This webpage has 2 words which may be misspelt.

Possibly mis-spelt word: Scalable

Suggestion: Salable
Suggestion: Callable
Suggestion: Calculable
Suggestion: Scaleless
Suggestion: Scrabble

Possibly mis-spelt word: DeepRL

Suggestion: Deep

Broken links:

  • This webpage has no broken links that we can detect - GOOD WORK.

Broken image links:

  • This webpage has no broken image links that we can detect - GOOD WORK.

CSS over tables for layout?:

  • It appears that this page uses DIVs for layout this is a GOOD thing!

Last modified date:

  • We were unable to detect what date this page was last modified

Images that are being re-sized:

  • This webpage has no images that are being re-sized by the browser - GOOD WORK.

Images that are being re-sized:

  • This webpage has no images that are missing their width and height - GOOD WORK.

Mobile friendly:

  • After testing this webpage it appears NOT to be mobile friendly - this is NOT a good thing!

Links with no anchor text:

  • This webpage has no links that are missing anchor text - GOOD WORK.

W3C Validation:

Print friendly?:

  • It appears that the webpage does NOT use CSS stylesheets to provide print functionality - this is a BAD thing.

GZIP Compression enabled?:

  • It appears that the serrver does NOT have GZIP Compression enabled - this is a NOT a good thing!