Convex Q Learning in a Stochastic Environment: Extended Version
Published in arXiv preprint, 2023
Recommended citation: Lu, Fan, and Sean Meyn. "Convex Q Learning in a Stochastic Environment: Extended Version." arXiv preprint arXiv:2309.05105 (2023).
The paper introduces the first formulation of convex Q-learning for Markov decision processes with function approximation. The algorithms and theory rest on a relaxation of a dual of Manne’s celebrated linear programming characterization of optimal control. The main contributions firstly concern properties of the relaxation, described as a deterministic convex program: we identify conditions for a bounded solution, and a significant relationship between the solution to the new convex program, and the solution to standard Q-learning. The second set of contributions concern algorithm design and analysis:
- A direct model-free method for approximating the convex program for Q-learning shares properties with its ideal. In particular, a bounded solution is ensured subject to a simple property of the basis functions;
- The proposed algorithms are convergent and new techniques are introduced to obtain the rate of convergence in a mean-square sense;
- The approach can be generalized to a range of performance criteria, and it is found that variance can be reduced by considering “relative” dynamic programming equations;
- The theory is illustrated with an application to a classical inventory control problem. This is an extended version of an article to appear in the forthcoming IEEE Conference on Decision and Control.
Recommended citation: Lu, Fan, and Sean Meyn. “Convex Q Learning in a Stochastic Environment: Extended Version.” arXiv preprint arXiv:2309.05105 (2023).