Conference Paper at ICLR 2025

Noise-conditioned Energy-based Annealed Rewards (NEAR)

A Generative Framework for Imitation Learning from Observation

Anish Abhijit Diwan^{1 *}, Julen Urain^2,5, Jens Kober^{1 †}, Jan Peters^{2,3,4,5 †},

¹ Cognitive Robotics, TU Delft ² Intelligent Autonomous Systems, TU Darmstadt
³Hessian Center for Artificial Intelligence (Hessian.ai) ⁴Center for Cognitive Science, TU Darmstadt
⁵German Research Center for AI (DFKI)
^†Equal supervision

arXiv Code 💾 Dataset 📄 Cite

Abstract

This paper introduces a new imitation learning framework based on energy-based generative models capable of learning complex, physics-dependent, robot motion policies through state-only expert motion trajectories. Our algorithm, called Noise-conditioned Energy-based Annealed Rewards (NEAR), constructs several perturbed versions of the expert’s motion data distribution and learns smooth, and well-defined representations of the data distribution’s energy function using denoising score matching. We propose to use these learnt energy functions as reward functions to learn imitation policies via reinforcement learning. We also present a strategy to gradually switch between the learnt energy functions, ensuring that the learnt rewards are always well-defined in the manifold of policy-generated samples. We evaluate our algorithm on complex humanoid tasks such as locomotion and martial arts and compare it with state-only adversarial imitation learning algorithms like Adversarial Motion Priors (AMP). Our framework sidesteps the optimisation challenges of adversarial imitation learning techniques and produces results comparable to AMP in several quantitative metrics across multiple imitation settings.

Videos of NEAR Policies

This demo shows the policies learnt using NEAR in several contact-rich, physics-dependent imitation tasks. We use the open-source Isaac Gym benchmark environment and compare our method with Adversarial Motion Priors (AMP) [2]. Further, we also train the agent in a goal-conditioned setting to achieve stylised global goals such as target reaching, and with temporally and spatially composed learnt rewards to learn hybrid motions that are not directly present in the expert dataset.

Learnt Rewards Only

Composed Rewards

Energy Functions as Reward Functions

NEAR uses energy-based generative models to learn data-informed reward functions for imitation learning (IL). Given a dataset of the expert's state-only features $\mathcal{M} \equiv \{ x \}$, we train a modified version of Noise-conditioned Score Networks [1] to learn a conditional parameterised energy function $e_{\theta}(x', \sigma): \mathbb{R}^D \rightarrow \mathbb{R}$.

$e_{\theta}$ approximates the energy of samples $x'$ in a perturbed data distribution obtained by the local addition of Gaussian noise $\mathcal{N}(x,\sigma)$ to each sample $x$ in the expert's data distribution $p_D$. Our paper leverages the fact that the energy of a sample $x$ is essentially just a scalar-valued measure of the closeness of $x$ to the $p_D$, meaning that it can be used as a reward signal to guide a policy to generate imitation motions.

Energy-based reward functions are better than adversarial reward functions (current SOTA in observation-based IL) for several reasons. They are not prone to issues like instability or mode-collapse, are able to capture the expert's distribution more accurately, and are also smooth in the sample space. This means that the learnt rewards are always non-ambiguous and offer a smooth signal for improvement. Below, we compare the mechanisms of energy-based and adversarial imitation learning through a 2D toy-task.

A comparison of reward functions (probability density approximations) learnt in a 2D target-reaching imitation task (left). In this task, an agent aims to reach a goal and expert demonstrations ($p_D$) pass through an L-shaped maze. The learnt reward function is expected to encourage the agent to pass through the maze. In the middle, we show $\texttt{rew}(s′ \vert s)$ for all reachable states around a state $s$ (green circle) at different training epochs. On the right, we show an illustration of the non- smooth reward landscape of adversarial IL. The energy-based reward is a smooth (with continuous gradients), accurate representation of $p_D$ and is constant regardless of the distribution of policy- generated motions ($p_G$). In contrast, the adversarial reward is non-smooth and prone to instability. Additionally, it changes depending on $p_G$ (the discriminator tends to minimise policy predictions) and can provide non-stationary reward signals.

Training Pipeline

NEAR uses a two-step training procedure. We first learnt the reward function using NCSN and then use this reward function to train a policy using GPU accelerated reinforcement learning. Optionally, an envionment supplied task-reward and goal can also be incorporated in the problem to learn stylised motions like target-reaching.

We also introduce an annealing method inspired by Annealed Langevin Dynamics [1] to gradually switch between these learnt energy functions depending on the agent's progress in the imitation task. Because of the dilated nature of these energy functions, they are prone to be constant-valued depending on the level of dilation and manifold of samples generated by the current policy. Annealing ensures that the reward function being used in reinforcement learning is always well defined, non-constant, and smooth in the manifold of policy generated samples.

Dataset

Below we show examples of the input data required for these algorithms. We use open-source humanoid motion-capture clips for a variety of complex tasks. The input to NEAR is simply a set of state-transition features $\{ (s,s') \}$ from these motions, where each state $s$ is a vector of the cartesian positions and velocities of the character's joints.

BibTeX

@inproceedings{diwan2025noise,
        title={Noise-conditioned Energy-based Annealed Rewards (NEAR): A Generative Framework for Imitation Learning from Observation},
        author={Diwan, Anish Abhijit and Urain, Julen and Kober, Jens and Peters, Jan},
        booktitle={International Conference on Learning Representations (ICLR)},
        year={2025},
        note={Paper accepted. Jens Kober and Jan Peters supervised equally.}
      }

References

Song Y, Ermon S. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems. 2019;32.
Peng XB, Ma Z, Abbeel P, Levine S, Kanazawa A. Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG). 2021 Jul 19;40(4):1-20.