Post by Steve Draper on Dec 18, 2013 15:47:15 GMT -8
I recently found I had some unexpected losses in certain games on Tiltyard, notably Reversi. After some investigation I found a considerable gain in playing strength (in Reversi specifically) from tuning the exploration bias in the UCT selection formula down somewhat. As I had spent some time tuning this parameter prior to the Coursera competition this was not entirely expected, so I have since done a little more investigation to try to get a feel for the optimal setting across a range of games.
What I found was that there is no clear metric, which I am able to derive from meta-gaming simulation which correlates well with the empirical optimal value for this parameter (there are some that correlate well for MANY games, but with some outlying counter-examples). As such I'm currently testing a formula that works 'ok' for most games, and resigning myself to having to tune the parameter via learning (once I have the game isomorphism detection code to allow to yo persist data and apply it to the correct games).
What correlation there is seems to boil down to deep game trees responding better to smaller exploration bias. My theory is that in deep trees the 'signals' from the eventual terminal nodes are necessarily from longer, and therefore lower confidence, probes. Such lower confidence estimates of node value benefit more from a higher density of visits, which amounts to a bit less exploration across the tree (and hence lower exploration bias).
However, I'm wondering about better ways to dynamically tune this value, possibly even within the exploration of different branches of the game tree, and almost certainly across different roles in asymmetric games (e.g. - sheep & wolf, where I suspect applying different bias to sheep decision nodes than to wolf ones may be productive).
If my working theory about more diffuse signals from longer probe paths is correct, it suggests back propagating a 'distance to terminal state' measure up the MCTS tree, and perhaps varying the bias depending on this.
Anyone have any insights to offer in this area?
What I found was that there is no clear metric, which I am able to derive from meta-gaming simulation which correlates well with the empirical optimal value for this parameter (there are some that correlate well for MANY games, but with some outlying counter-examples). As such I'm currently testing a formula that works 'ok' for most games, and resigning myself to having to tune the parameter via learning (once I have the game isomorphism detection code to allow to yo persist data and apply it to the correct games).
What correlation there is seems to boil down to deep game trees responding better to smaller exploration bias. My theory is that in deep trees the 'signals' from the eventual terminal nodes are necessarily from longer, and therefore lower confidence, probes. Such lower confidence estimates of node value benefit more from a higher density of visits, which amounts to a bit less exploration across the tree (and hence lower exploration bias).
However, I'm wondering about better ways to dynamically tune this value, possibly even within the exploration of different branches of the game tree, and almost certainly across different roles in asymmetric games (e.g. - sheep & wolf, where I suspect applying different bias to sheep decision nodes than to wolf ones may be productive).
If my working theory about more diffuse signals from longer probe paths is correct, it suggests back propagating a 'distance to terminal state' measure up the MCTS tree, and perhaps varying the bias depending on this.
Anyone have any insights to offer in this area?