Maximum Entropy Reinforcement Learning
When learning to act from interactions in a complex environment, it is often beneficial to keep your options open. A policy that highly prefers only one action tends to overly commit to trajectories that are promising in the short term but might not be globally optimal. Such a policy would also fail to explore and might miss out on higher potential returns in the future. Mathematically, this translates to the idea of learning a policy that has a high entropy. This post is an intuitive walkthrough of the maximum entropy RL results from [Ziebart, et al. (2008), Ziebart, et al. (2010)]. We will look at the definition of soft value functions, the soft version of the Bellman equations, and lead up to basic results that are the starting points for several popular RL frameworks (SAC, generative IRL etc.).
Soft Value Functions
By definition, the value of a state \(s\) (under policy \(\pi\)) in an MDP is the expected discounted return under the policy starting from that state. The value of an action is the same but assuming that the first action is given to us (not necessarily the one prescribed by the policy). It turns out that the state value is just the expected action value.
\[\begin{align*} V_{\pi}(s) &= \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \vert s_0 = s \right] \\ Q_{\pi}(s, a) &= \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \vert s_0 = s, a_0 = a\right] \\\\ V_{\pi}(s) &= \sum_{a \sim \pi(a \vert s)} \pi(a \vert s) Q_{\pi}(s, a) = \mathbb{E}_{a \sim \pi} \left[ Q(s, a) \right] \end{align*}\]In MaxEntRL we wish to maximise the step-wise entropy of the policy (at each state). The entropy is defined by
\[H(\pi(a \vert s)) = \mathbb{E}_{a,s \sim \rho_{\pi}} \left[ - \log \pi(a \vert s) \right]\]Given this criteria, the definition of the return changes, causing the definition of the value functions to change. The entropy regularised value functions are called “soft” value functions (in some papers these are written as \(\tilde{V_{\pi}}\) and \(\tilde{Q_{\pi}}\)). Here \(\alpha\) is a tradeoff parameter to control the influence of entropy.
\[\begin{align*} V_{\pi}^{\text{soft}}(s) &= \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) + \underbrace{\gamma^t \alpha H(\pi(. \vert s_t))}_{\text{entropy}} \vert s_0 = s \right] \\ Q_{\pi}^{\text{soft}}(s, a) &= \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) + \underbrace{\sum_{t=1}^{\infty} \gamma^t \alpha H(\pi(. \vert s_t))}_{\text{notice summation from } t=1} \vert s_0 = s, a_0 = a \right] \\ V_{\pi}^{\text{soft}}(s) &= \mathbb{E}_{a \sim \pi} \left[ Q_{\pi}^{\text{soft}}(s, a) + \underbrace{\alpha H(\pi(. \vert s))}_{\text{entropy of } \pi \text{ at current } s} \right] \\\\ V_{\pi}^{\text{soft}}(s) &= \mathbb{E}_{a \sim \pi} \left[ Q_{\pi}^{\text{soft}}(s, a) - \alpha \log \pi(a \vert s) \right] \\\\ \end{align*}\]In the soft action value definition, we only consider the entropy contribuion from \(s'\) onwards since the action \(a\) is already given to us. So no matter what the policy says, its entropy contribution should be zero. In the final expression, \(\alpha H(\pi(. \vert s))\) is hence the added contribution of entropy at state \(s\) that comes from the state value function (since the entropy contribution at state \(s\) from the action value function is zero). From \(s'\) onwards, the entropy contribution of the action value is already accounted for in the state value.
Soft Bellman Equations
We can also compute a soft version of the Bellman equations by using these soft value functions.
Traditionally,
In the soft case:
MaxEntRL: Optimal Policy & Value Functions
From soft Bellman equations and the definitions of the value functions, we know that for any policy \(\pi\) 1,
\[\begin{align*} V_{\pi}^{\text{soft}}(s) &= \mathbb{E}_{a \sim \pi} \left[ Q_{\pi}^{\text{soft}}(s, a) - \log \pi(a \vert s) \right] \\\\ Q_{\pi}^{\text{soft}}(s, a) &= R(s, a) + \gamma \: \mathbb{E}_{s' \sim P(s' \vert s,a)} \left[\mathbb{E}_{a' \sim \pi} \left[ Q_{\pi}^{\text{soft}}(s', a') - \log \pi(a' \vert s') \right] \right] \end{align*}\]On solving the MaxEntIRL lagrangian, we obtain a closed form expression of the optimal policy and the related value functions. For an optimal policy \(\pi^*\) 2,
\(\begin{align*} \pi^*(a \vert s) &= \exp(Q^*(s,a) - V^*(s)) \\\\ &= \frac{\exp(Q^*(s,a))}{Z^*(s)} \\\\ \text{where } Z^*(s) &= \sum_{a'} \exp(Q^*(s,a')) \text{ enforces } \sum \pi(. \vert s) = 1 \end{align*}\) 3
Given this optimal policy, it can then be shown that (proof)
\[\begin{align*} V^*(s) &= \log \sum_{a} \exp(Q^*(s,a)) \\\\ Q^*(s,a) &= R(s,a) + \gamma \: \mathbb{E}_{s' \sim P(s' \vert s,a)} \left[ V^*(s') \right] \\ Q^*(s,a) &= R(s,a) + \gamma \: \mathbb{E}_{s' \sim P(s' \vert s,a)} \left[ \log \sum_{a'} \exp(Q^*(s',a')) \right] \end{align*}\]MaxEntRL In Continuous Settings
\(\pi^*\) can be computed in closed form when \(\left \lVert \mathcal{A} \right \rVert < \infty\). But in continous spaces, \(Z^*(s) = \int_{a} \exp(Q^*(s,a))\). This integral cannot be computed directly when \(\left \lVert \mathcal{A} \right \rVert = \infty\). It can be written as an expectation \(Z^*(s) = \mathbb{E}_{a \sim \pi*} \left[ \frac{\exp(Q^*(s,a))}{\pi^*(a \vert s)} \right]\) and approximated by sampling. However, this procedure is computationally expensive.
In continuous spaces, popular algorithms often use variational inference to approximate \(\pi^*\). The policy optimisation problem then becomes
\[\pi^*(a \vert s) = \min_{\pi' \in \: \Pi} \text{KL} \left( \pi'(a \vert s) \Vert \frac{Q^{\text{old}}(s,a)}{Z^{\text{old}}(s)} \right)\]Derivation Expansions
Vsoft In Terms Of Qsoft
\[\begin{align*} V_{\pi}^{\text{soft}}(s) &= \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) + \underbrace{\gamma^t \alpha H(\pi(. \vert s_t))}_{\text{entropy}} \vert s_0 = s \right] \\ &= \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) + \sum_{t=1}^{\infty} \gamma^t \alpha H(\pi(. \vert s_t)) + \gamma^0 \alpha H(\pi(. \vert s)) \right] \\ &= \mathbb{E}_{a \sim \pi} \left[ Q_{\pi}^{\text{soft}}(s, a) + \alpha H(\pi(. \vert s)) \right] \end{align*}\]In the regular RL case, the state value fn. is the expected action value fn. In the entropy regularised case, it is the expected action value fn. + the entropy of the policy at the current state.
Equivalent Forms Of \(\pi^*\)
Given the definition of \(V^*\) and \(Z^*\),
\[\begin{align*} V^*(s) &= \log Z^*(s) \implies Z^*(s) = \exp(V^*(s)) \\\\ \pi^*(a \vert s) &= \frac{\exp(Q^*(s,a))}{\exp(V^*(s))} = \exp(Q^*(s,a) - V^*(s)) \end{align*}\]Computing \(Z^*(s)\) With Importance Sampling
\[\begin{align*} Z^*(s) &= \int_{a} \exp(Q^*(s,a)) \\ &= \int_{a} \frac{\pi^*(a \vert s)}{\pi^*(a \vert s)} \exp(Q^*(s,a)) \\ &= \mathbb{E}_{a \sim \pi*} \left[ \frac{\exp(Q^*(s,a))}{\pi^*(a \vert s)} \right] \end{align*}\]Note: the sampling distribution does not have to be \(\pi^*\) but the accuracy of the estimate relies on the sampling distribution.
References
Ziebart, Brian D and Maas, Andrew L and Bagnell, J Andrew and Dey, Anind K and others (2008). Maximum entropy inverse reinforcement learning..
Ziebart, Brian D and Bagnell, J Andrew and Dey, Anind K (2010). Modeling interaction via the principle of maximum causal entropy.
-
Most papers assume that the reward is not a function of \(s'\) and hence take it out of the expectation (reward: \(R(s,a)\)). We also drop \(\alpha\) without any loss of generality. ↩
-
We drop the superscript “soft” for better readability, but the value functions are all for the soft case. ↩
-
Here \(a'\) means all actions instead of the “next action”. ↩