Abstract
Solving high-dimensional, continuous robotic tasks is a challenging optimization problem. Model-based methods that rely on zero-order optimizers like the cross-entropy method (CEM) have so far shown strong performance and are considered state-of-the-art in the model-based reinforcement learning community. However, this success comes at the cost of high computational complexity, being therefore not suitable for real-time control. In this paper, we propose a technique to jointly optimize the trajectory and distill a policy, which is essential for fast execution in real robotic systems. Our method builds upon standard approaches, like guidance cost and dataset aggregation, and introduces a novel adaptive factor which prevents the optimizer from collapsing to the learner's behavior at the beginning of the training. The extracted policies reach unprecedented performance on challenging tasks as making a humanoid stand up and opening a door without reward shaping
Media
Experiments Videos
Fetch Pick and Place
In the (sparse) Fetch Pick and Place task, the policy guided iCEM expert reaches 100% of the goals.
Our policy reaches many of the goals as well.
The SAC baseline is not able to learn a useful policy.
Door
In the (sparse) door opening task, the iCEM expert manages to open the door without any problems.
Also our policy manages to open the door most of the time.
The SAC baseline does not find the right solution.
Huamnoid Standup
In the humanoid stand up task, the iCEM expert manages to stand up and balance most of the time over the full task horizon.
Our policy manages to stand up but does not manage to recover from most of the falls due to the many possible ways of falling.
The SAC baseline does not manage to stand up. It only manages to sit up. This is one of the possible local optima of the cost function.