Publications

Enhancing Wound Healing via Deep Reinforcement Learning for Optimal Therapeutics

Published in Royal Society Open Science (To Appear), 2024

Finding the optimal treatment strategy to accelerate wound healing is of utmost importance, but it presents a formidable challenge due to the intrinsic nonlinear nature of the process. We propose an adaptive closed-loop control framework that incorporates deep learning, optimal control, and reinforcement learning to accelerate wound healing. By adaptively learning a linear representation of nonlinear wound healing dynamics using deep learning and interactively training a deep reinforcement learning (DRL) agent for tracking the optimal signal derived from this representation without the need for intricate mathematical modeling, our approach has not only successfully reduced the wound healing time by 45.56% compared to the one without any treatment, but also demonstrates the advantages of offering a safer and more economical treatment strategy. The proposed methodology showcases a significant potential for expediting wound healing by effectively integrating perception, predictive modeling, and optimal adaptive control, eliminating the need for intricate mathematical models.

Recommended citation: Lu Fan, Zlobina Ksenia, Rondoni Nicholas, Teymoori Sam and Gomez, Marcella. "Enhancing Wound Healing via Deep Reinforcement Learning for Optimal Therapeutics." In Royal Society Open Science (To Appear)

Convex Q Learning in a Stochastic Environment: Extended Version

Published in arXiv preprint, 2023

The paper introduces the first formulation of convex Q-learning for Markov decision processes with function approximation. The algorithms and theory rest on a relaxation of a dual of Manne’s celebrated linear programming characterization of optimal control. The main contributions firstly concern properties of the relaxation, described as a deterministic convex program: we identify conditions for a bounded solution, and a significant relationship between the solution to the new convex program, and the solution to standard Q-learning. The second set of contributions concern algorithm design and analysis: 1. A direct model-free method for approximating the convex program for Q-learning shares properties with its ideal. In particular, a bounded solution is ensured subject to a simple property of the basis functions; 2. The proposed algorithms are convergent and new techniques are introduced to obtain the rate of convergence in a mean-square sense; 3. The approach can be generalized to a range of performance criteria, and it is found that variance can be reduced by considering “relative” dynamic programming equations; 4. The theory is illustrated with an application to a classical inventory control problem. This is an extended version of an article to appear in the forthcoming IEEE Conference on Decision and Control.

Recommended citation: Lu, Fan, and Sean Meyn. "Convex Q Learning in a Stochastic Environment: Extended Version." arXiv preprint arXiv:2309.05105 (2023).

Convex Analytic Theory for Convex Q-Learning

Published in 2022 IEE 61st Conference on Decision and Control (CDC), 2022

In recent years there has been a collective research effort to find new formulations of reinforcement learning that are simultaneously more efficient and more amenable to analysis. This paper concerns one approach that builds on the linear programming (LP) formulation of optimal control of Manne. A primal version is called logistic Q-learning, and a dual variant is convex Q-learning. This paper focuses on the latter, while building bridges with the former. The main contributions follow: 1. The dual of convex Q-learning is not precisely Manne’s LP or a version of logistic Q-learning, but has similar structure that reveals the need for regularization to avoid over-fitting. 2. A sufficient condition is obtained for a bounded solution to the Q-learning LP. 3. Simulation studies reveal numerical challenges when addressing sampled-data systems based on a continuous time model. The challenge is addressed using state-dependent sampling. The theory is illustrated with applications to examples from OpenAI gym. It is shown that convex Q-learning is successful in cases where standard Q-learning diverges, such as the LQR problem.

Recommended citation: Lu, Fan, et al. "Convex analytic theory for convex Q-learning." 2022 IEEE 61st Conference on Decision and Control (CDC). IEEE, 2022.

Sufficient Exploration for Convex Q-learning

Published in arXiv preprint, 2022

In recent years there has been a collective research effort to find new formulations of reinforcement learning that are simultaneously more efficient and more amenable to analysis. This paper concerns one approach that builds on the linear programming (LP) formulation of optimal control of Manne. A primal version is called logistic Q-learning, and a dual variant is convex Q-learning. This paper focuses on the latter, while building bridges with the former. The main contributions follow: (i) The dual of convex Q-learning is not precisely Manne’s LP or a version of logistic Q-learning, but has similar structure that reveals the need for regularization to avoid over-fitting. (ii) A sufficient condition is obtained for a bounded solution to the Q-learning LP. (iii) Simulation studies reveal numerical challenges when addressing sampled-data systems based on a continuous time model. The challenge is addressed using state-dependent sampling. The theory is illustrated with applications to examples from OpenAI gym. It is shown that convex Q-learning is successful in cases where standard Q-learning diverges, such as the LQR problem.

Recommended citation: Lu, Fan, Prashant Mehta, Sean Meyn, and Gergely Neu. "Sufficient Exploration for Convex Q-learning." arXiv preprint arXiv:2210.09409 (2022).

Model-Free Characterizations of the Hamilton-Jacobi-Bellman Equation and Convex Q-Learning in Continuous Time

Published in arXiv preprint, 2022

Convex Q-learning is a recent approach to reinforcement learning, motivated by the possibility of a firmer theory for convergence, and the possibility of making use of greater a priori knowledge regarding policy or value function structure. This paper explores algorithm design in the continuous time domain, with finite-horizon optimal control objective. The main contributions are (i) Algorithm design is based on a new Q-ODE, which defines the model-free characterization of the Hamilton-Jacobi-Bellman equation. (ii) The Q-ODE motivates a new formulation of Convex Q-learning that avoids the approximations appearing in prior work. The Bellman error used in the algorithm is defined by filtered measurements, which is beneficial in the presence of measurement noise. (iii) A characterization of boundedness of the constraint region is obtained through a non-trivial extension of recent results from the discrete time setting. (iv) The theory is illustrated in application to resource allocation for distributed energy resources, for which the theory is ideally suited.

Recommended citation: Lu, F., Mathias, J., Meyn, S. and Kalsi, K., 2022. Model-Free Characterizations of the Hamilton-Jacobi-Bellman Equation and Convex Q-Learning in Continuous Time. arXiv preprint arXiv:2210.08131.

Convex Q-Learning

Published in 2021 American Control Conference (ACC), 2021

It is well known that the extension of Watkins’ algorithm to general function approximation settings is challenging: does the “projected Bellman equation” have a solution? If so, is the solution useful in the sense of generating a good policy? And, if the preceding questions are answered in the affirmative, is the algorithm consistent? These questions are unanswered even in the special case of Q-function approximations that are linear in the parameter. The challenge seems paradoxical, given the long history of convex analytic approaches to dynamic programming. Our main contributions are summarized as follows: (i) A new class of convex Q-learning algorithms is introduced based on a convex relaxation of the Bellman equation. Convergence is established under general conditions for linear function approximation. (ii) A batch implementation appears similar to LSPI and DQN algorithms, but the difference is substantial: while convex Q-learning solves a convex program that approximates the Bellman equation, theory for DQN is no stronger than for Watkins algorithm with function approximation. These results are obtained for deterministic nonlinear systems with total cost criterion. Extensions are proposed.

Recommended citation: Lu, Fan, Prashant G. Mehta, Sean P. Meyn, and Gergely Neu. "Convex Q-learning." In 2021 American Control Conference (ACC), pp. 4749-4756. IEEE, 2021.

Zap Q-Learning With Nonlinear Function Approximation

Published in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020

Zap Q-learning is a recent class of reinforcement learning algorithms, motivated primarily as a means to accelerate convergence. Stability theory has been absent outside of two restrictive classes: the tabular setting, and optimal stopping. This paper introduces a new framework for analysis of a more general class of recursive algorithms known as stochastic approximation. Based on this general theory, it is shown that Zap Q-learning is consistent under a non-degeneracy assumption, even when the function approximation architecture is nonlinear. Zap Q-learning with neural network function approximation emerges as a special case, and is tested on examples from OpenAI Gym. Based on multiple experiments with a range of neural network sizes, it is found that the new algorithms converge quickly and are robust to choice of function approximation architecture.

Recommended citation: Chen, Shuhang, Adithya M. Devraj, Fan Lu, Ana Busic, and Sean Meyn. "Zap Q-learning with nonlinear function approximation." Advances in Neural Information Processing Systems 33 (2020): 16879-16890.

Adaptive leader-follower formation control and obstacle avoidance via deep reinforcement learning

Published in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019

We propose a deep reinforcement learning (DRL) methodology for the tracking, obstacle avoidance, and formation control of nonholonomic robots. By separating vision-based control into a perception module and a controller module, we can train a DRL agent without sophisticated physics or 3D modeling. In addition, the modular framework averts daunting retrains of an image-to-action end-to-end neural network, and provides flexibility in transferring the controller to different robots. First, we train a convolutional neural network (CNN) to accurately localize in an indoor setting with dynamic foreground/background. Then, we design a new DRL algorithm named Momentum Policy Gradient (MPG) for continuous control tasks and prove its convergence. We also show that MPG is robust at tracking varying leader movements and can naturally be extended to problems of formation control. Leveraging reward shaping, features such as collision and obstacle avoidance can be easily integrated into a DRL controller.

Recommended citation: Zhou, Yanlin, Fan Lu, George Pu, Xiyao Ma, Runhan Sun, Hsi-Yuan Chen, and Xiaolin Li. "Adaptive leader-follower formation control and obstacle avoidance via deep reinforcement learning." In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4273-4280. IEEE, 2019.

Modular Platooning and Formation Control

Published in ICML 2019 Workshop RL4RealLife, 2019

Moving vehicles in formation, or platooning, can dramatically increase road capacity. While traditional control methods can manage fleets of vehicles, they do not address the issues of dynamic road conditions and scalability (i.e., sophisticated control law redesign and physics modeling). We propose a modular framework that averts daunting retrains of an image-to-action neural network, provides flexibility in transferring to different robots/cars, while also being more transparent than previous approaches. First, a convolutional neural network was trained to localize in an indoor setting with dynamic foreground/background. Then, we design a new deep reinforcement learning algorithm named Momentum Policy Gradient (MPG) for continuous control tasks and prove its convergence. MPG is sucessfully applied to the platooning problem with obstacle avoidance and intra-group collision avoidance.

Recommended citation: Zhou, Yanlin, George Pu, Fan Lu, Xiyao Ma, and Xiaolin Li. "Modular Platooning and Formation Control." (2019).

RetailNet: Enhancing Retails of Perishable Products with Multiple Selling Strategies via Pair-Wise Multi-Q Learning

Published in ICML 2019 Workshop RL4RealLife, 2019

We propose RetailNet, an end-to-end reinforcement learning (RL)-based neural network, to achieve efficient selling strategies for perishable products in order to maximize retailers’ long-term profit. We design pair-wise multi-Q network for Q value estimation to model each state-action pair and to capture the interdependence between actions. Generalized Advantage Estimation (GAE) and Entropy are incorporated into the loss function for balancing the tradeoff between exploitation and exploration. Experiments show that RetailNet efficiently produces the near-optimal solution, providing practitioners valuable guidance on their inventory replenishment, pricing, and products display strategies in the retailing industry.

Recommended citation: Ma, Xiyao, Fan Lu, Xiajun Amy Pan, Yanlin Zhou, and Xiaolin Andy Li. "RetailNet: Enhancing Retails of Perishable Products with Multiple Selling Strategies via Pair-Wise Multi-Q Learning." (2019).

Convex Q-Learning: Theory and Applications

Published in University of Florida ProQuest Dissertations Publishing, 2023.30424301., 2013

Reinforcement learning has proven to be a highly effective technique for decision-making in complex and dynamic environments. One of the most widely used algorithms in this field is Q-learning, which enables agents to learn a policy by iteratively updating estimates of the Q-function. However, Q-learning has its limitations, particularly in handling high-dimensional state spaces. To address these challenges, recent research effort has focused on developing new formulations of reinforcement learning algorithms that are more efficient and amenable to analysis. One promising approach is to design algorithms based on the linear programming formulation of optimal control, which leverages the reliability and speed of convex optimization. By using general linear function approximation methods and the LP approach to dynamic programming, a new class of Q-learning algorithms called convex Q-learning has been proposed in this dissertation, along with a sequence of theoretical results. The dissertation also presents a theoretical analysis of the convergence properties of convex Q-learning, demonstrating that the algorithm converges to the optimal policy with high probability. It explores the practical applications of convex Q-learning in various domains, including robotics, inventory control, and resource allocation for DERs.

Recommended citation: Lu, F., 2023. Convex Q-Learning: Theory and Applications (Doctoral dissertation, University of Florida).