come up with a simple dataset and see if the DQN can correctly learn values for it
an example is a contextual bandit problem where you have two possible states, and two actions, where one action is +1 and the other -1
generally, an rl method should work on 1-step and 2-step mdps both with and without random rewards and transitions
updating the target network
check freeze rate
this should be between every 100 updates and every 40000
batch size
this can be hard to set and depends on CPU vs GPU as well as the state dimensionality
start with 32 and increase
the goal should be to have each experience replayed some number of times on average
this is determined by the size of the replay memory and batch size
replay memory
replay memory size 1,000 to 10,000,000
action selection
softmax vs e-greedy
softmax typically works better if the temperature is tuned well
learning rate
try values between 0.01 and 0.00001 generally
state sampling
only relevant when initial-state distribution can be controlled
if the important rewards are common then uniform sampling of the initial state might work
if the important reward is rare then you’ll need to oversample these states
to do this in a principled manner, you need to know the relative probability of sampling states in the original MDP versus the proposal MDP (i.e., you need to do importance sampling)
prioritized sampling can also help if you are dealing with rare, significant rewards
normalizing input
mean center and normalize
to get the stats, if it’s stationary, run it for 100,000 samples and compute mean and std dev
or if the range of the input is available then you can use that
subtract the average of the end points
divide by half the total range
max episode length
when ending an episode not at a terminal state, be careful not to label that as a terminal state
this value will impact the distribution over the state space
terminal states
you need to handle this so that the target value is zero if it’s terminal