We propose a method for quantifying the similarity of learned reward functions without performing policy learning and evaluation.
Why should MuZero perform better than other DRL algorithms?
One potential method for validating autonomous vehicles is to evaluate them in a simulator. For this to work, you need highly realistic models of human driving behavior. Existing research learned human driver models using generative adversarial imitation learning, but did so in a single-agent environment. As a result, the model fails when you execute many of the learned policies simultaneously. This research performs training in a multi-agent setting to address this problem.
We learn UAV collision avoidance policies directly from a simulator with a Deep Q-Network (DQN). This approach not only solves for policies more quickly than value iteration, but also arrives at safer and more efficient solutions by learning a direct mapping from the state space instead of discretizing it.
Tips and heuristics for training Deep Q-Networks