rxe
Junior Member
Posts: 61
|
Post by rxe on Mar 4, 2016 11:49:22 GMT -8
1-ply is fair if pitting a value network against a policy network (my preference was to train a policy network). A deeper search is advantageous for seeing solutions towards the end of a breakthrough like game, regardless of an evaluation function. What is a "policy network"? I only found references to it in the AlphaGo paper abstract, but I don't have the paper as it doesn't seem to be freely available (does anybody have it?). Sorry didn't realise it was AlphaGo terminology. livestream.com/oxuni/StracheyLectureDrDemisHassabis starting at around 39 minutes gives a pretty decent overview without having to read the paper.
|
|
|
Post by Andrew Rose on Mar 5, 2016 2:46:18 GMT -8
> I'm still unclear when you guys think the NN training should happen.
Training for Breakthrough is to happen between now and when we compete. Training for the other game (not yet announced) is to happen in the 7 days between its announcement and the competition.
|
|
qfwfq
New Member
Posts: 29
|
Post by qfwfq on Mar 7, 2016 12:11:36 GMT -8
As we train offline, to be able to use propositions as NN input features we'll need the GDL to _not_ be obfuscated.
|
|
|
Post by Andrew Rose on Mar 8, 2016 3:07:44 GMT -8
AFAIK a policy network is one which directly selects the next move to play (rather than providing a heuristic evaluation function for a position and then relying on a 1-ply "search" to select the move leading to the position with the highest heuristic evaluation).
|
|
|
Post by Andrew Rose on Mar 8, 2016 3:12:32 GMT -8
qfwfq You can always de-obfuscate back to some canonical representation. (That's what Sancho does anyway as a side-effect of its game recognition code.) Also, even without obfuscation, if you're using GGP-base, you'll also need to make some modifications to ensure that propositions are iterated over in the same order on every match. (I sort into alphabetical order. This also helps when dumping out a state for debugging purposes. Any stable ordering will do.)
|
|
qfwfq
New Member
Posts: 29
|
Post by qfwfq on Mar 8, 2016 13:37:21 GMT -8
Andrew Rose, for this first experiment I would rather spend the time on the NN than writing GDL de-obfuscation code.
|
|
|
Post by alandau on Mar 16, 2016 20:46:02 GMT -8
I have too much else to do (and too little experience with neural networks) to join in. But I'd be happy to be the neutral third party selecting the second game.
|
|
rxe
Junior Member
Posts: 61
|
Post by rxe on Mar 28, 2016 17:58:41 GMT -8
Anyone had any success with their NN players? Unfortunately I haven't made much progress (almost none), but have been playing with NNs in general.
I am having a hard time coming up with a way of generating useful training data. My current idea is to pre-train a policy network with mostly random playouts, and then add an extra layer and further train with say 10k MCTS samples.
|
|
|
Post by Andrew Rose on Mar 29, 2016 0:18:28 GMT -8
Richard,
When I get to it, I'm planning to use reinforcement learning to produce the (ever-changing) training data.
However, I too am struggling to make initial progress. I thought I'd begin with training a simple MLP NN to play Tic-Tac-Toe. To make my task super-easy, I generate all the correct answers up-front and use that as the training data. I don't even split into training/validation/test sets. I train the network on the full set of (~5K) data points and then evaluate it at those same points. Trying to get a network that's (a) stable and (b) does a good job of learning has proved surprisingly hard.
My current network takes ~10K training iterations (where each iteration trains all states that have displayed an error > 3 at any point in the last 100 training iterations) by which point the average error is < 1 and the maximum error is ~10 (which is my arbitrary cut-off for declaring it reasonably well trained).
I read a somewhat discouraging SO post yesterday which pointed out that, because the position evaluation function isn't smooth in its inputs, approaches like this are really just training a big lookup table. They have no ability to generalise. In order to make the position evaluation function smooth, it's inputs would need to be more useful features than just the contents of each cell. I'm not completely sure to what extent this is necessarily true and how the recent progress in deep neural networks might change that. Can they effectively extract their own features?
Lots of reading done. Clearly lots more to go.
I can't really believe this is an impossible task. If they've managed to train a deep neural network to play many Atari 2600 games better than humans (and they have) then surely we can get it to work for abstract strategy games too.
|
|
rxe
Junior Member
Posts: 61
|
Post by rxe on Mar 29, 2016 12:55:45 GMT -8
Thanks Andrew. Yeah I am thinking also about trying some simpler game.
As a general question to your approach: what do you use as your output vector? Are you using a single score? And if so what do you use as the evaluation score (1.00/0.5/0 for win/draw/loss) or something you fabricate?
For my policy net, it is a convnet, where the output vector is all the grounded legals and make use of a softmax layer to assign a probability to them (courtesy of the Keras/Theano frameworks for implementation). Training output is a one-hot vector. I am still figuring out what the best way is to combine players into one network, but currently looking at condensing all the legals so that there is only one set of legals for both players (using reflection of the board if necessary). The inputs is a tensor for a 2D board, with 2 feature channels for player and opponent, AND maybe some other information that might be interesting (I am sort of dreaming ahead here). This is a wip progress, I haven't even got as far as mapping this out for breakthrough. TTT would be vastly easier as I don't have to reflect the board and training data is easy to come by.
Automating this for the general case of any game is going to be a *lot* of work.
|
|
|
Post by Andrew Rose on Mar 29, 2016 14:28:39 GMT -8
I'm training a position evaluation function. (Since we've been promised fixed sum games, at least for now, I'm using a single output node which is the evaluation for the 1st player listed in the GDL.) Yes, I'm using 1/0.5/0. I'd picked up from your comments above that you were training a policy network instead. For now at least, I'm going to stick with training an evaluation function. I can wrap my head around it more easily and I have a gut feel that, whilst there might be efficiencies from running a single state through the network and then effectively getting scores for all possible moves, it might make the training harder because the network might have to be effectively learning the game rules too. I might dabble with it later - it isn't a particularly big change. I'm reluctant to get too far into hand-extracting features (symmetries, etc.) because, as you say, they doesn't generalise well. However, I did hand-code a feature for TTT which better than halved the learning time. This evening I've added a (crude) cooling schedule which has also had a significant impact. In 7K iterations, I can now train it so that it has stabilised with no state having an error in its evaluation of more than 3 (or 0.03 using the units above) and training for another 100 cycles doesn't push it out of that happy state. However, that's still supervised learning. In reinforcement learning, those 7K iterations are just a single pass through the algorithm. (In reality, you don't train all the way to this stage because the target values are known to be noisy anyway, so there's no point. However, it's interesting (and alarming) how long it takes even when the targets are completely defined.) This paper storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf (along with the 2013 version) seem to imply that this can be done - and done well. With a single network architecture, using reinforcement learning and a deep neural networks, they've managed to learn how to play generic Atari video games from the raw pixel values. Surely then, it must be possible to get something working for GGP.
|
|
rxe
Junior Member
Posts: 61
|
Post by rxe on Mar 31, 2016 2:49:39 GMT -8
Sounds like you are getting somewhere. Switching to TTT I started to make progress very quickly. Training examples are easy to come by, and there is no action symmetry to worry about. With a trained policy net for TTT and TTTLarge - it seems to be playing perfect games winning 95% of games against random, and drawing against a MCTS player. For TTT I had 4k training samples - so it seemed to generalise quite well. I also tried cittaceot, but it didn't fair so well. It does win against random 100% of the time, but loses 99% of the time to a MCTS player. I think this was mainly due to awful training data (to generate an "ideal" move I generated 100k random samples, only performing ~1000 MCTS iterations for each sample). On the upside it has learned the rules of a fairly complex game, and beat random! Incorporating the NN with search would make it a lot stronger also. Conceptually, generating pro like training data via MCTS doesn't feel like a viable solution. 45 seconds per move for breakthrough and 100k samples would take an eternity. And even then MCTS players are crap at breakthrough. Thanks for the link, I was somewhat familiar with the idea. DQN/reinforcement learning or something akin really does feel appealing right now (just wish I had more free time to explore/learn more about it).
|
|
|
Post by Andrew Rose on Mar 31, 2016 6:33:11 GMT -8
Richard - would you be willing to describe your network architecture in more detail? How many layers does it have? Do you have fully connected layers as well as convolution layers? What tile sizes are you using for the convolution layers and what stride lengths? Do you have any (non-linear) transfer functions and, if so, where in the network are they? Are you using any pooling? (Presumably not. Whilst position isn't relevant for image classification, it's crucial for games.) Anything else interesting or relevant?
And, come to think of it, how are you identifying co-ordinate systems for your convolution layers? Are you just hacking it by hand at the moment?
|
|
rxe
Junior Member
Posts: 61
|
Post by rxe on Mar 31, 2016 7:21:07 GMT -8
Rather than answer your questions, here is the entire model. [Rather than paste the model code, I made the code available https://gist.github.com/richemslie/1b0453049e9ad58914d439bbb72c4b7c
Look at get_model(). See keras docs for more info, if not self explanatory (it was for me).
To be honest, I barely thought about this model, just the first thing that came to mind (i was familiar with the models people have tried with go and chess). I wanted it to be small enough that it didn't take a long time to train (a few seconds per epoch). I didn't try any other models yet.
For TTT : 3-4 epochs to hit about 60% accuracy. Over fitting started about 10 epoch.
The weird padding at the start was TTT specific, as TTT it is a tiny 3x3 board, not great for 3x3 fitlers. I ran the same model for the other games, I probably should of tailored it more - but it seemed to train ok anyway.
I am currently working on improving training data gathering right now as that is the biggest bottleneck (20 hours to gather cittaceot data, and then about 1 minute to train).
The co-ordinates is hacked to work right now, but at least somewhat parametrised (see data part at top of file). The analysis code for galvanise does establish most of this information, so I am hopeful I can continue to keep it data driven and reuse galvanise analysis.
Hope this helps.
|
|
rxe
Junior Member
Posts: 61
|
Post by rxe on Apr 6, 2016 4:11:34 GMT -8
Well fwiw, I have successfully trained both policy and score networks for breakthrough. 'Successful' being a relative word, as they are both terrible. Infact they play about the level they were trained at, which was 300k samples generated from a 1 second Monte Carlo player (ie no tree). The score network with 1-ply lookahead is better than the policy network (since it does a single min-max). I hope/plan to improve it now by playing against itself.
Are people still on for a competition (maybe a different date now no coursera)?
|
|