|
Post by maciej on Aug 4, 2014 14:50:41 GMT -8
I am probably one of the few not to have gone for the propnet way of doing Monte Carlo simulations. Instead, my player is using an interpreter written entirely from scratch. It's in C++ with some memory tricks but still I think I am far behind optimized propnets (I mean really far). I would like to make a choice of retaining or dropping my interpreter forever. Could someone from the current top4, preferably Sancho, give me some numbers which will talk to me? I mean the average number of complete depth charges (simulations from the initial state to the end) of some sample games, on one thread on an off-the-shelf computer. We can use the selection of games used in the competition. If the numbers are an order of magnitude better, then there is no point of not switching to propnet I suppose I am aware that my result can be better when propnet is enormously big (chess) or hex which is somehow good for my method.
|
|
|
Post by Sam Schreiber on Aug 4, 2014 17:38:49 GMT -8
TurboTurtle is no longer in the top four, but I'll give you the performance numbers anyway, in case they're useful. All of these are measured in terms of the number of UCT full game simulations that are done from the initial game state, per second, using a propnet-based state machine. Each game simulation consists of a guided search through the in-memory portion of the game tree, followed by random play until a terminal state is reached. All games used are those available on the base game repository on games.ggp.org/base, under the given name. reversi: ~2900 simulations/second pentago: ~30000 simulations/second ticTacToe: ~360000 simulations/second speedChess: ~275 simulations/second connectFour: ~45000 simulations/second dotsAndBoxes: ~26000 simulations/second cephalopodMicro: ~12000 simulations/second englishDraughts: ~9000 simulations/second nineBoardTicTacToe: ~30000 simulations/second edit: Please note, these numbers are for 4 threads; I don't have the benchmarks for a single thread handy at the moment, though connectFour on a single thread is ~23000 simulations/second for TurboTurtle, so that's probably a good place to estimate the others.
|
|
|
Post by Steve Draper on Aug 4, 2014 18:00:09 GMT -8
I will get you single thread numbers for Sancho's state machine tomorrow (it's in the middle of running a test set currently which I don't want to interrupt). I'll use the same set of games Sam posted for above to keep it consistent - if you have specific games you want me to include in addition let me know...
|
|
|
Post by alandau on Aug 4, 2014 21:25:39 GMT -8
The following are single-threaded results and reflect just the pure state machine operations of running depth charges and gathering goal values, not in the context of running the player (e.g. building or traversing a game tree).
First are numbers for the state machine used in Alloy 0.6, which has been around in basically its present form for the past three years. This was my second attempt at a propnet-based state machine; my first attempt was quite naive and was not clearly better than the GGP-Base ProverStateMachine.
reversi: ~250 simulations/second pentago: ~1600 simulations/second ticTacToe: ~28000 simulations/second speedChess: ~45 simulations/second connectFour: ~11000 simulations/second dotsAndBoxes: ~1300 simulations/second cephalopodMicro: ~950 simulations/second englishDraughts: ~700 simulations/second nineBoardTicTacToe: ~2400 simulations/second
And now numbers for the more recent propnet state machine used in Alloy 0.8, as used in the most recent competition:
reversi: ~14000 simulations/second pentago: ~48000 simulations/second ticTacToe: ~740000 simulations/second speedChess: ~900 simulations/second connectFour: ~140000 simulations/second dotsAndBoxes: ~42000 simulations/second cephalopodMicro: ~20000 simulations/second englishDraughts: ~10000 simulations/second nineBoardTicTacToe: ~52000 simulations/second
However, I can only get these kinds of numbers by running batches of simultaneous depth charges in ways that seem to reduce their efficacy. Sam, you'll recall the results of the GPU experiment at last year's competition...
|
|
|
Post by maciej on Aug 5, 2014 3:27:41 GMT -8
Thank you guys. I didn't necessarily mean top4, but just a fast state machine. Alex, are you sure that 140K simulations/sec in connectFOur is not a mistake? That's just 7 microseconds for a simulation, impressive.
Ok, so my approach is from 1.5 to 300 times slower. I guess it's not worth continuing with it anymore.
|
|
rxe
Junior Member
Posts: 61
|
Post by rxe on Aug 5, 2014 5:54:57 GMT -8
Indeed, very impressive. galvanise's propnet state machine is quite a bit slower for anyone that is interested (1500 per second/core for englishDraughts and 140 for speedchess, 350 for reversi), but i never got around to fully optimizing it and generally focused at throwing more cores at the problem!
Alex, "I can only get these kinds of numbers by running batches of simultaneous depth charges in ways that seem to reduce their efficacy. " - sorry I think I am lacking some context, but would love to understand this statement a bit more please.
|
|
|
Post by Steve Draper on Aug 5, 2014 7:16:20 GMT -8
Alex's numbers are indeed very impressive. I assume what he says about batching means he is in effect running several states through the propnet at once (synchronously), similar to the suggestion made in the thread on GPU based propnets where each state is encoded in one bit of a word, so <word size> states propagate at once. Whether that i the case or not, I too would love to read more about whatever it is Alex is doing!
Anyway, here are Sancho's numbers. These are depth charges from the initial state, single threaded (no MCTS tree involved). Sancho actually does several of these in parallel on a rollout thread pool to feed its MCTS tree (which is on another thread again), so these numbers reflect pure propnet performance only.
reversi: ~5100 simulations/second
pentago: ~21000 simulations/second
ticTacToe: ~163000 simulations/second
speedChess: ~6200 simulations/second
connectFour: ~37000 simulations/second
dotsAndBoxes: ~49000 simulations/second
cephalopodMicro: ~17000 simulations/second
englishDraughts: ~11000 simulations/second
nineBoardTicTacToe: ~14000 simulations/second
All code involved is Java, running on the Oracle HotSpot JVM. The machine is a single CPU Windows PC (i7 4770, standard clocked)
Edit - it's interesting, looking at various peoples' figures, how we differ in which games are relatively faster. It would be nice to understand why that is the case. My only current theory is that some people are running the entire propnet, in toplogical order, for each state change, whereas others are propagating only components specifically known to have changed. The first approach has less overhead on figuring out what components to recalculate [and less branching as a result], whereas the second enconomizes on the total number of calculations performed. The second approach should yield relatively better results as propnet complexity and/or move-locality increases (and a higher proportion of the net is therefore unchanged for any given state change). Sancho is in the second category, which (my hypothesis anyway) is why it is relatively poor at TicTacToe, but strong at SpeedChess [reversi is an outlier because Sancho factorizes the goal calculation out into a separate state machine that it only runs once a terminal state is encountered]
|
|
|
Post by maciej on Aug 5, 2014 9:03:15 GMT -8
I can get close in ticTacToe, connectFour and nineBoardTicTacToe, but for some other games I am two orders of magnitude behind. I am aware that we are talking about trees, in which with a linear increase of simulation time you would usually get at most logarithmic increase of the depth, but I noticed that sometimes it is crucial (especially in tactical positions which just need to be calculated before the opponent). I've been thinking about employing GPGPU, especially since I have completed the CUDA MOOC on Udacity, but I don't have a good idea to tackle this problem yet. While I see much more potential with the GPU propnets than any symbol-based interpreters, this is still not an ideal scenario for CUDA which is memory predictability, lots of independent calculations, none or very degenerated branching logic. I think that a propagation is not parallel enough (components outputs depend on others inputs) to be effectively handled, but I am sure someone may come up with a neat idea sooner or later
|
|
|
Post by alandau on Apr 16, 2015 18:40:16 GMT -8
I came across a paper today on the performance of non-propnet methods: cgi.cse.unsw.edu.au/~mit/GGP/GIGA-13-Proceedings.pdf#page=55 (starts on page 55) The best provers they found (those used in Fluxplayer and CadiaPlayer) gave about 933 state expansions (not rollouts) per second on Connect Four, and 12,000 on Tic-Tac-Toe. The paper does mention propnets, but it looks like they used the old PropNetFactory instead of the OPNF because they were having problems with most of the games they tried. (It was also a particularly hard selection of games for propnet generation; they included Amazons and Othello-2007.)
|
|