Sergio Hernandez, a Spanish mathematician, recently shared some very interesting results on the OpenAI gym environment which are based on a relatively unknown paper published by Dr. Wissner-Gross, a physicist trained at MIT. What is impressive about Wissners meta-heuristic is that it is succinctly described by three equations which try to maximize the future freedom of your agent. In this analysis, I summarize the method, present its strengths and weaknesses, and attempt to improve it by making an important modification to one of the equations.
In the following summary of Wissners meta-heuristic, its assumed that the agent has access to an approximate or exact simulator. A close reading of the original paper [1] will show that this assumption is actually necessary.
For any open thermodynamic system, we treat the phase-space paths taken by the system over the time interval as microstates and partition them into macrostates using the equivalence relation[1]:
As a result, we can identify each macrostate with a unique present system state . This defines a notion of causality over a time interval.
We can define the causal path entropy of a macrostate with the associated present system state as the path integral:
where we have:
In (3) we basically integrate over all possible paths taken by the open systems environment. In practice, this integral is intractable and we must resort to approximations and the use of a sampling algorithm like Hamiltonian Monte Carlo [3].
A path-based causal entropic force may be expressed as:
where and are two free parameters. This force basically brings us closer to macrostates that maximize . In essence the combination of equations (2), (3) and (4) maximize the number of future options of our agent. This isnt very different from what most people try to do in life but this meta-heuristic does have very important limitations.
The Causal Entropic paper makes the implicit assumption that we have access to a reliable simulator of future states. In the case of the OpenAI environments this isnt a problem because environment simulators are provided but in general its a hard problem. Two useful approaches to this problem are suggested by [4] and [5] using recurrent neural networks.
Maximizing your number of future options is not always a good idea. Sometimes fewer options are better provided that these are more useful options. This is why for example, football players dont always rush to the center of a football pitch, although from that position they would maximize their number of future states i.e. possible positions on the pitch.
In the next section I would like to show that its possible to find a practical solution to the second limitation by modifying (3).
Assuming that a recurrent neural network is used to define potential macrostates , its reasonable to assume that our agents understanding of the future evolves with time and therefore macrostates are a function of time. So we have rather than . In other words, our simulator which might be an RNN, will probably change its parameters and even its topology over time.
In order to resolve the second limitation and encourage the agent to make confident decisions, I propose that we replace with where:
This not only has the added value of simplifying calculations but also allows us to disentangle the relative contributions of utility and uncertainty. It must also be noted that the two expressions in (5) can be calculated in parallel although the uncertainty calculation is more computationally expensive.
If we assume that the agents perception of the future doesnt change much, it might perceive some future states to be ideal. This is consistent with the empirical observation that many people believe certain accomplishments would bring them genuine happiness. In other words, if the state space is compact and approximately time-invariant the agents optimal future macrostate converges to a fixed point [6].
While the notion of Causal Path Utility just occurred to me today, I believe that this is a very promising approach which I shall follow-up with concrete implementations very soon.
Causal Entropic Forces (A. D. Wissner-Gross & C.E. Freer. 2013. Physical Review Letters.)
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning (Yarin Gal & Zoubin Ghahramani. 2016. ICML. )
Stochastic Gradient Hamiltonian Monte Carlo ( Tianqi Chen, Emily Fox & Carlos Guestrin. 2014. ICML.)
Recurrent Environment Simulators (Silvia Chappa et al. 2017. ICLR.)
On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models (J. Schmidhuber. 2015.)
Fixed Point Theorems with Applications to Economics and Game Theory (Border, Kim C. 1985. Cambridge University Press.)