We are searching data for your request:
Upon completion, a link will appear to access the found materials.
I've been working on modeling some phenomena involving real-time control in an environment with inherent rewards (specifically, playing a 'pong'-like game), and it's increasingly looking like reinforcement learning by itself won't cut it computationally (I'm currently using a temporal difference back propagation neural network).
One possible supplemental learning mechanism is to also have the model predict the environment's future state, from which it can learn from in a supervised manner using standard feed forward back propagation.
My current thinking on synthesizing these mechanisms is to have the input layer feed into a hidden layer, which in turn feeds into both a reward predicting layer and a separate state predicting layer. In training this net, I simply change the weights via reinforcement learning first and then change them again to account for the state prediction error via back prop.
So my question is this: Are there any problems you can foresee arising from this architecture? Additionally, has this combination of learning mechanisms been done before, and if so was it done in a similar way?
I'm not sure I fully understand your design; perhaps you can clarify what you want your network to learn, why TD-learning "isn't cutting it", and what you mean by 'reinforcement' and 'prediction' learning. In particular, TD-learning is a reinforcement learning model, and it does reward based on predicted (and not just observed) outcomes. However, you seem to be describing reinforcement and prediction learning as orthogonal models, so again, I'm not sure I understand correctly.
As a general suggestion, you might consider using an Elman/Jordan network (e.g. recurrent neural network/RNN). Rather than relying on knowledge of only the current state to make a prediction about the next state, RNN can learn to recognize sequences of events. This is especially useful for predicting future states in a task that unfolds over time (e.g. ). I suggest this mostly because you say your task is a 'real-time control' task, but without more knowledge of your task I don't really know if this is appropriate.
As for your suggestion of using two different learning mechanisms to modify a single set of weights-- I don't have the answer, but it seems counterintuitive to me. You're using two different optimization techniques on a single set of data. If the techniques disagree, your network will probably never learn its connection weights. If they agree (i.e., converge on the same answer) then I'm not certain you're adding any value by having two learning mechanisms.
 Elman (1990). Finding structure in time. Cognitive Science, 14, 179-211. Retrieved from http://synapse.cs.byu.edu/~dan/678/papers/Recurrent/Elman.pdf