1 Introduction
In Reinforcement Learning (RL;
Sutton & Barto 1998), policy-evaluation refers to the problem of evaluating the value function – a mapping from states to their long-term discounted return under a given policy, using sampled observations of the system dynamics and reward. Policy-evaluation is important both for assessing the quality of a policy, but also as a sub-procedure for policy optimization (Sutton & Barto, 1998).For systems with large or continuous state-spaces, an exact computation of the value function is often impossible. Instead, an approximate value-function is sought using various function-approximation techniques (Sutton & Barto 1998; a.k.a. approximate dynamic-programming; Bertsekas 2012
). In this approach, the parameters of the value-function approximation are tuned using machine-learning inspired methods, often based on the
temporal-difference idea (TD;Sutton & Barto 1998).The method generating the sampled data leads to two different types of policy evaluation. In the on-policy case, the samples are generated by the target-policy – the policy under evaluation, while in the off-policy setting, a different behavior-policy generates the data. In the on-policy setting, TD methods are well understood, with classic convergence guarantees and approximation-error bounds, based on a contraction property of the projected Bellman operator underlying TD (Bertsekas & Tsitsiklis, 1996). For the off-policy case, however, standard TD methods no longer maintain this contraction property, the error bounds do not hold, and these methods may even diverge (Baird, 1995).
Recently, Sutton et al. (2015) proposed the emphatic TD (ETD) algorithm: a modification of the TD idea that can be shown to converge off-policy (Yu, 2015). In this paper, we show that the projected Bellman operator underlying ETD also possesses a contraction property, which allows us to derive approximation-error bounds for ETD.
In recent years, several different off-policy policy-evaluation algorithms have been proposed and analyzed, such as importance-sampling based least-squares TD (Yu, 2012), gradient-based TD (Sutton et al., 2009), and ETD (Sutton et al., 2015). While these algorithms were shown to converge, to our knowledge there are no guarantees on the error of the converged solution. The only exception that we are aware of, is a contraction-based argument for importance-sampling based LSTD, under the restrictive assumption that the behavior and target policies are very similar (Bertsekas & Yu, 2009). This paper presents the first approximation-error bounds for off-policy policy evaluation under general target and behavior policies.
2 Preliminaries
We consider an MDP , where is the state space, is the action space,
is the transition probability matrix,
is the reward function, is the discount factor, and is the initial state distribution.Given a target policy , our goal is to evaluate the value function:
Temporal difference methods (Sutton & Barto, 1998), approximate the value function by
where are state features, and are weights, and use sampling to find a suitable . Let denote a behavior policy that generates the samples according to and . We denote by the ratio , and we assume, similarly to Sutton et al. (2015), that and are such that is well-defined for all .
Let denote the Bellman operator for policy , given by
where and
are the reward vector and transition matrix induced by policy
, and let denote a matrix whose columns are the feature vectors for all states. Let and denote the stationary distributions over states induced by the policies and , respectively. For some satisfying element-wise, we denote by a projection to the subspace spanned by with respect to the -weighted Euclidean-norm.Similarly to Sutton et al. (2015), we divide the analysis to the ‘pure bootstrapping’ case , and the more general case with . The ETD() algorithm iteratively updates the weight vector according to:
The emphatic weight vector is defined by
(1) |
The ETD() algorithm iteratively updates the weight vector according to
where is a known given function signifying the importance of the state. Note that Sutton et al. (2015) consider state-dependent discount factor and bootstrapping parameter , while in this paper we consider the special case where and are constant.
The emphatic weight vector is defined by
(2) |
where:
Notice that in the case of general , the Bellman operator is:
(3) |
Mahmood et al. (2015) show that ETD converges to some that is a solution of the projected fixed-point equation:
In this paper, we establish that the projected Bellman operator is a contraction, which allows us to bound the error .
3 Results
We start from ETD(). It is well known that is a -contraction with respect to the -weighted Euclidean norm (Bertsekas & Tsitsiklis, 1996). However, it is not immediate that the concatenation is a contraction in any norm. Indeed, for the TD(0) algorithm Sutton & Barto (1998), a similar representation as a projected Bellman operator holds, but it may be shown that in the off-policy setting the algorithm diverges (Baird, 1995).
The following theorem shows that for ETD(), the projected Bellman operator is indeed a contraction.
Theorem 1.
Denote by , then is a -contraction with respect to the Euclidean -weighted norm, namely,
Proof.
Let . We have
(4) |
where (a) follows from the Jensen inequality:
(5) |
and (b) is by the definition of in (1).
Notice that for every :
(6) |
Therefore:
(7) |
and:
(8) |
Hence, is a -contraction. Since is a non-expansion in the -weighted norm (Bertsekas & Tsitsiklis, 1996), is a -contraction as well. ∎
Notice that obtains values ranging from (when there is a state visited by the target policy, but not the behavior policy), to (when the two policies are identical). In the latter case we obtain the classical bound: . This result resembles that of Kolter (2011) who used the discrepancy between the behavior and the target policy to bound the TD-error.
An immediate consequence of Theorem 1 is the following error bound, based on Lemma 6.9 of Bertsekas & Tsitsiklis (1996).
Corollary 1.
We have
In a sense, the error is the best approximation we can hope for, within the capability of our linear approximation architecture. Corollary 1 guarantees that we are not too far away from it.
Now we move on to the analysis of ETD():
Theorem 2.
is a -contraction with respect to the Euclidean -weighted norm, where . Namely,
Proof.
The proof is almost identical to the proof of Theorem 1, only now we cannot apply Jensen’s inequality directly, since the rows of do not sum to . However:
(9) |
and each entry of is positive. Therefore will hold for Jensen’s inequality. Let , we have
(10) |
where (a) follows from the Jensen inequality and (b) from Equation 2.
Therefore:
(11) |
and:
(12) |
Hence, is a -contraction. Since is a non-expansion in the -weighted norm (Bertsekas & Tsitsiklis, 1996), is a -contraction as well. ∎
As before, Theorem 2 leads to the following error bound, based on Theorem 1 of Tsitsiklis & Van Roy (1997).
Corollary 2.
We have
We now show in an example that our contraction modulus bounds are tight.
Example
Consider an MDP with two states: Left and Right. In each state there are two identical actions leading to either Left or Right deterministically. The behavior policy will choose Right with probability , and the target policy will choose Left with probability . Calculating the quantities of interest:
So for :
and for small we obtain that .
4 Discussion
Interestingly, the ETD error bounds in Corollary 1 and 2 are more conservative by a factor of square root than the error bounds for standard on-policy TD (Bertsekas & Tsitsiklis, 1996; Tsitsiklis & Van Roy, 1997). Thus, it appears that there is a price to pay for off-policy convergence. Future work should address the implications of the different norms in these bounds.
Nevertheless, we believe that the results in this paper motivate ETD (or its least-squares counterpart; Yu 2015) as the method of choice for off-policy policy-evaluation in MDPs.
References
- Baird (1995) Baird, L. Residual algorithms: Reinforcement learning with function approximation. In ICML, 1995.
- Bertsekas (2012) Bertsekas, D. Dynamic Programming and Optimal Control, Vol II. Athena Scientific, 4th edition, 2012.
- Bertsekas & Tsitsiklis (1996) Bertsekas, D. and Tsitsiklis, J. Neuro-Dynamic Programming. Athena Scientific, 1996.
- Bertsekas & Yu (2009) Bertsekas, D. and Yu, H. Projected equation methods for approximate solution of large linear systems. Journal of Computational and Applied Mathematics, 227(1):27–50, 2009.
- Kolter (2011) Kolter, J Zico. The fixed points of off-policy td. In Advances in Neural Information Processing Systems, pp. 2169–2177, 2011.
- Mahmood et al. (2015) Mahmood, A. R., Yu, H., White, M., and Sutton, R. S. Emphatic Temporal-Difference Learning. ArXiv e-prints, 2015.
- Sutton & Barto (1998) Sutton, R. S. and Barto, A. Reinforcement learning: An introduction. Cambridge Univ Press, 1998.
- Sutton et al. (2009) Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In ICML, 2009.
- Sutton et al. (2015) Sutton, R. S., Mahmood, A. R., and White, M. An emphatic approach to the problem of off-policy temporal-difference learning. CoRR, abs/1503.04269, 2015.
- Tsitsiklis & Van Roy (1997) Tsitsiklis, John N and Van Roy, Benjamin. An analysis of temporal-difference learning with function approximation. Automatic Control, IEEE Transactions on, 42(5):674–690, 1997.
- Yu (2012) Yu, H. Least squares temporal difference methods: An analysis under general conditions. SIAM Journal on Control and Optimization, 50(6):3310–3343, 2012.
- Yu (2015) Yu, H. On convergence of emphatic temporal-difference learning. In COLT, 2015.
Comments
There are no comments yet.