|
Post by steadyeddie on Jan 27, 2016 13:10:48 GMT -8
I knew Demis was working on Neural Nets to tackle Go through a friend at work, but I didn't think he'd be done so quickly. Deep Neural Net + MCTS it looks like. www.bbc.co.uk/news/technology-35420579
|
|
qfwfq
New Member
Posts: 29
|
Post by qfwfq on Jan 29, 2016 13:33:32 GMT -8
This is exciting news!
Not yet directly applicable to GGP as much of the learning was supervised, but even once purely unsupervised methods are made to work, they would likely require huge computational resources. To encourage people toying with machine learning for GGP, perhaps we should have a GGP tournament variant where metagaming time is very long (e.g. 24 hours per game, with a very limited set of games).
|
|
|
Post by alandau on Jan 31, 2016 13:08:15 GMT -8
Somewhere in the media coverage I found a link to this paper; not sure if it was used directly: arxiv.org/abs/1410.5401That seems very interesting from a general computer science point of view. The claim is that you can set up a neural network to use additional memory like a Turing machine's tape, but with a setup where the "program" portion is fully differentiable and thus amenable to hill-climbing techniques. They use supervised learning to find setups that work for things like copying or sorting inputs, and then they empirically observe its behavior to show that it looks like the behavior of a simple program that a human might write.
|
|
|
Post by steadyeddie on Feb 7, 2016 15:53:36 GMT -8
On and off (mostly off) I've been tinkering with getting SteadyEddie using NeuralNets. I've used neuroph (taking a version over the summer that didn't require Java 1.8), and supervised learning. They've got some decent example apps that I plumbed my states into, combined with evals from SteadyEddie MCTS runs (OK, so not that accurate, blind leading the blind, etc). I then replaced SE rollouts with NN evals using the trained network, and boom, I had a really really weak player, but one who looked like it'd seen the game I trained it on before, without a single rollout to learn from. So _something_ is possible here, fairly easily.
Now the big issue for me is not technical, though that I'm sure would be a whole bag of fun. The issue is one of motivation, as I have a stronger draw to continue to attempt to make SteadyEddie competitive on Tiltyard; so many features, so little time. As it stands, I can't release my NN (NeuralNed) onto Tiltyard until it has learned all the games, and that would take AGES. Plus I'm not sure how I feel about learning/knowledge of games ahead of time, it's not very "general game playing".
Would anyone be interesting in me scheduling a mini-tournament for NNs (I expect _very_ mini, mechanism unclear, maybe just me hosting locally, I dunno) playing a single game, or perhaps a small set, at some point some months in the future?
|
|
|
Post by Andrew Rose on Feb 8, 2016 2:57:11 GMT -8
(As we discussed last week, but recapping here and widening the audience...)
I've been doing from-scratch "temporal difference learning" - i.e. not doing supervised learning against values learned elsewhere. Still at very early stages. Current status is roughly as follows.
- I've only looked at the Stanford version of Breakthrough Small (although there's nothing specific to this in my code). - I trained a neural network (1 input, 1 hidden, 1 output layer, sigmoid activation function) for ~6 hours (using Neuroph, so basically completely unoptimized). - During play, I do full 4-ply minimax (without alpha-beta, I just haven't written that yet) and use the NN as an evaluation function at the cut-off point.
The result is a player that's equivalent to 50K MCTS rollouts.
Some thoughts for future directions.
- Raw performance - GPU for training. (Easier to batch up operations for bulk GPU processing when doing NN training, compared to regular MCTS rollouts, where latency renders GPU useless.) - Alpha-beta during playout - Instead of minimax, use the evaluation function to guide Sancho's normal processing? - Feature detection. (At the moment, the inputs to the NN are the base propositions. But could we use e.g. our piece detection, etc. to provide extra inputs?) - Topology modification. My NN has an odd topology. (The hidden layer is *much* larger than conventional wisdom recommends. This could be causing over-fitting.) - For co-ordinate based games, add a convolution layer?
Yes, I'd be interested in playing your NN player.
How about using the full-sized version of [Breakthrough](http://www.ggp.org/view/tiltyard/games/base/breakthrough/v0/) from the base repository (i.e. the one that Tiltyard uses)? Yes, some months in the future sounds good.
|
|
rxe
Junior Member
Posts: 61
|
Post by rxe on Feb 8, 2016 4:00:26 GMT -8
Sounds like fun, I would be up for this. I have never implemented a NN, so it will be a challenge. I agree with Andrew, full sized breakthrough could be a good game choice, as it has always been a challenging game for vanilla MCTS. To be clear though, are we talking about NN (or some other learning algorithm) players, without MCTS? I like the idea of a policy network (which would map game states to move probabilities) as well as a value (evaluation function) network. A policy network would work well (I think, I am bit clueless) with trimming the width of a MCTS tree.
|
|
qfwfq
New Member
Posts: 29
|
Post by qfwfq on Feb 10, 2016 15:08:33 GMT -8
I'd also be interested... at least in theory... but not sure if I can find the time, yet.
|
|
|
Post by Andrew Rose on Mar 1, 2016 13:00:41 GMT -8
The next round of the GGP Coursera course runs from March 28th - June 6th. How about we aim to run our NN-powered tournament at the same time as the end-of-course competition?
|
|
qfwfq
New Member
Posts: 29
|
Post by qfwfq on Mar 2, 2016 12:10:29 GMT -8
What rules? We can't train the network during metagaming.
Should we stick to GGP for everything (including the state machine) and just allow using a NN pre-trained on breakthrough? (This is my preference) Or do we want to focus on the NN and allow more freedom? (e.g. add breakthrough-specific feature extractors in the code, etc.)
I suggest that for the first time we leave the NN training process free, e.g. no time limit, and we allow freedom on how to use the NN (e.g. pure NN solution vs. NN+MCTS).
Breakthrough only or also a second game like SpeedChess?
|
|
|
Post by Andrew Rose on Mar 2, 2016 14:24:32 GMT -8
I was certainly hoping to keep everything as general as possible, but just happen to train my NN for breakthrough. That said, for a first pass, I was probably expecting to lose some generality (e.g. perhaps my code will only work for 2-player games with no simultaneous play). I would certainly have thought hand-coding breakthrough-specific features would be disallowed, although maybe coding up automatic feature extractors which happened to find useful breakthrough features (and would also find similar useful features in several other existing GGP games) might perhaps be just within the boundary of what's reasonable.
On the NN vs NN+MCTS question, I'd be inclined away from NN+MCTS initially. That's mostly because I'm expecting that it's highly possible that MCTS alone might still whoop NN. Therefore, testing NN+MCTS might mostly reveal who has the best MCTS rather than really help make progress with NN. That said, I've already found that adding limited look-ahead to pure NN considerably helps, so I'm swithering on this question.
I wouldn't want more than 2 games. And if we were going to add a 2nd game, I'd probably not do a chess-related thing (since we have Breakthrough already). I'd suggest taking the opportunity to make it slightly more general and do something like C4, Hex or 9BTTT.
I'm not really bothered though.
|
|
|
Post by steadyeddie on Mar 3, 2016 18:09:06 GMT -8
OK, as "organiser" (I use the term loosely), I'm going for the following rules (unless someone objects, in which case I'm listening): - NN only. Totally agree with ARR, that otherwise it'll be who has the best MCTS. There is no way to police it, so if anyone turns up with an MCTS or hybrid they'll win. - 2 games. Breakthrough and one other. I'll choose and be honest, construct an over engineered voting mechanism that makes it random for everyone, or ask a non player to choose. It'll be: - 2 player - zero sum - non-simultaneous
But in order to make it interesting and general, I propose we chose the second game with 7 days to go, so we are testing training. Votes for/against?
|
|
rxe
Junior Member
Posts: 61
|
Post by rxe on Mar 4, 2016 1:21:37 GMT -8
I think NN with 1-ply search maximum is the only way to pit NN against NN fairly.
I agree with what everyone has said, so:
+1 for GGP everything - ie NN should only be architected/trained via the GDL, the meta-time analysis that players normally do and NOT hand coded features for a specific game +1 for dates +1 2nd game (2 player, zero sum, non-simultaneous) with 7 days meta-time (effectively)
|
|
qfwfq
New Member
Posts: 29
|
Post by qfwfq on Mar 4, 2016 9:09:09 GMT -8
Good point about avoiding MTCS to focus on comparing the NNs. 1-ply is not necessary the only fair method, any number of plies would be fair provided we all use the same value. But I'm fine either way.
I'm still unclear when you guys think the NN training should happen. If during the few minutes of meta-time, I suspect that would be too short to train a decent NN.
|
|
rxe
Junior Member
Posts: 61
|
Post by rxe on Mar 4, 2016 9:26:32 GMT -8
Good point about avoiding MTCS to focus on comparing the NNs. 1-ply is not necessary the only fair method, any number of plies would be fair provided we all use the same value. But I'm fine either way. I'm still unclear when you guys think the NN training should happen. If during the few minutes of meta-time, I suspect that would be too short to train a decent NN. 1-ply is fair if pitting a value network against a policy network (my preference was to train a policy network). A deeper search is advantageous for seeing solutions towards the end of a breakthrough like game, regardless of an evaluation function. But, I am not bothered if we go for more than 1 ply. I assumed NN training was offline, storing it and then retrieving the network at meta-time.
|
|
qfwfq
New Member
Posts: 29
|
Post by qfwfq on Mar 4, 2016 11:22:09 GMT -8
1-ply is fair if pitting a value network against a policy network (my preference was to train a policy network). A deeper search is advantageous for seeing solutions towards the end of a breakthrough like game, regardless of an evaluation function. What is a "policy network"? I only found references to it in the AlphaGo paper abstract, but I don't have the paper as it doesn't seem to be freely available (does anybody have it?).
|
|