Post by steadyeddie on Dec 18, 2016 6:31:58 GMT -8
My dream is to create a player which has as few as possible, or better still zero, specific bits of code to deal with specific kinds of games. I will not, for example, add piece heuristics to my player.
I mentioned above that I didn't have great results when I tried to use states to calculate RAVE scores. And that fact put me off doing what I intended to do with my GGP player from the start, which is to get it to play in the way I play chess as best I can. Clearly that involves a lot of decision making that is based on state, rather than moves.
Sadly events overtook me somewhat, and the AlphaGo people showed us what is possible with a non-general game player. Their nature paper is actually fairly intelligible, even to me. Whereas, sorry academic chaps, the normal GGP ones I find unnecessarily obfiscated. (If someone could translate the Woodstock paper into something I can understand that'd be grand).
First, here are some ideas that didn't work.
I tried using "neuroph" which is a popular Java based neural network. This took too long to train. Like a whole night for something a bit iffy. A bit iffy might be OK in conjunction with MCTS, but a whole night not so much. I intended to train it slowly, and store the results, but that felt like cheating. So with reluctance I figured I'd have to try something a little more tractible, and having a maths background I know the way you usually do this is by assuming linearity, as that is usually soluble. And maybe the outrageous simplification is not so important as I trust in MCTS to sort out the bumps.
So, I imagined a "neural network" with linear functions between the nodes. Obviously any intermediate layers are irrelevant in such a network and so we are just talking a dot vector multipication between inputs and output. And there are ways of optimizing the "least squares" approximation for such a system.
Least squares of what? Well, I've tried a few options. One option is to try the positions from the MCTS tree. These are more accurate as they are over multiple rollouts. Unfortunately the tree is never really that deep, so most of the states don't change. Instead I tried positions from rollouts. But which? Trying all, leads to too many positions that are the same, so currently I have a sample rate of 1 in 100. And I'm confused about whether to include terminal positions or not. I currently don't include them, but I'm pretty sure that's not right.
And how to do least squares. QR decompostion is one such way. Unfortunately it does not cope at all well with columns that don't change, and there are often many of those. You can preprocess the data to remove columns that don't change, but it's a bit annoying and prone to error. Instead I use SVD, and that worked out a lot better. Basically you end up being able to read the weights from the diagonal. I used apache's QR and SVD routines, but I found that these did bad things to rollouts while they were in their constructor. Instead I hacked out the time intensive initialization into a separate routine, and the JVM issues went away.
But though I got decent enough results with that, I wasn't convinced. I think the issue is that the computation takes so long that too few positions can be processed to make a decent eval function. It's pretty good, but not brilliant.
So instead I go for a "nudge" approach similar to the way in which (I think) a neural network is trained. Basically you feed in each position and for each you nudge up or down for each state so the resulting output is close to the actual rollout result. Then you hope that with enough data the actually important states will get "+1"'d statistically more to the point that they are in balance with the actual value they should be. And tada, all your latches and piece heuristics drop out ?
Well in theory. But I still have a problem with the RAVE eval- the move version still works better. I don't know why, perhaps it's bugged. But what about the rollouts- in theory that's the most exciting bit, because as you use a better policy for the rollout, surely the rollouts will get better, and hence the data you get out is better, in a wonderful reinforced learning experience?
Well one problem is the huge drop in rollout numbers. I'm afraid I'm rather brute force about this. For each move I'm looking at all the possible moves and working out if they were made what changes there would be to the state. This is heavy on computation roughly proportional to the number of moves per ply, though in some games (C4 for example) a lot is made back because the rollout is shorter as better moves are played.
Also my rollouts now suffer with patzer-sees-a-check in spades. I think I need to look 2 ply down the tree, but that's going to be even worse computationally. Oh, and it's not clear if now I've got a "nudge" approach whether I need things to be linear.
So, a work in progress.
There you go, 16 posts containing all the SteadyEddie design. Hope you enjoyed them.
I mentioned above that I didn't have great results when I tried to use states to calculate RAVE scores. And that fact put me off doing what I intended to do with my GGP player from the start, which is to get it to play in the way I play chess as best I can. Clearly that involves a lot of decision making that is based on state, rather than moves.
Sadly events overtook me somewhat, and the AlphaGo people showed us what is possible with a non-general game player. Their nature paper is actually fairly intelligible, even to me. Whereas, sorry academic chaps, the normal GGP ones I find unnecessarily obfiscated. (If someone could translate the Woodstock paper into something I can understand that'd be grand).
First, here are some ideas that didn't work.
I tried using "neuroph" which is a popular Java based neural network. This took too long to train. Like a whole night for something a bit iffy. A bit iffy might be OK in conjunction with MCTS, but a whole night not so much. I intended to train it slowly, and store the results, but that felt like cheating. So with reluctance I figured I'd have to try something a little more tractible, and having a maths background I know the way you usually do this is by assuming linearity, as that is usually soluble. And maybe the outrageous simplification is not so important as I trust in MCTS to sort out the bumps.
So, I imagined a "neural network" with linear functions between the nodes. Obviously any intermediate layers are irrelevant in such a network and so we are just talking a dot vector multipication between inputs and output. And there are ways of optimizing the "least squares" approximation for such a system.
Least squares of what? Well, I've tried a few options. One option is to try the positions from the MCTS tree. These are more accurate as they are over multiple rollouts. Unfortunately the tree is never really that deep, so most of the states don't change. Instead I tried positions from rollouts. But which? Trying all, leads to too many positions that are the same, so currently I have a sample rate of 1 in 100. And I'm confused about whether to include terminal positions or not. I currently don't include them, but I'm pretty sure that's not right.
And how to do least squares. QR decompostion is one such way. Unfortunately it does not cope at all well with columns that don't change, and there are often many of those. You can preprocess the data to remove columns that don't change, but it's a bit annoying and prone to error. Instead I use SVD, and that worked out a lot better. Basically you end up being able to read the weights from the diagonal. I used apache's QR and SVD routines, but I found that these did bad things to rollouts while they were in their constructor. Instead I hacked out the time intensive initialization into a separate routine, and the JVM issues went away.
But though I got decent enough results with that, I wasn't convinced. I think the issue is that the computation takes so long that too few positions can be processed to make a decent eval function. It's pretty good, but not brilliant.
So instead I go for a "nudge" approach similar to the way in which (I think) a neural network is trained. Basically you feed in each position and for each you nudge up or down for each state so the resulting output is close to the actual rollout result. Then you hope that with enough data the actually important states will get "+1"'d statistically more to the point that they are in balance with the actual value they should be. And tada, all your latches and piece heuristics drop out ?
Well in theory. But I still have a problem with the RAVE eval- the move version still works better. I don't know why, perhaps it's bugged. But what about the rollouts- in theory that's the most exciting bit, because as you use a better policy for the rollout, surely the rollouts will get better, and hence the data you get out is better, in a wonderful reinforced learning experience?
Well one problem is the huge drop in rollout numbers. I'm afraid I'm rather brute force about this. For each move I'm looking at all the possible moves and working out if they were made what changes there would be to the state. This is heavy on computation roughly proportional to the number of moves per ply, though in some games (C4 for example) a lot is made back because the rollout is shorter as better moves are played.
Also my rollouts now suffer with patzer-sees-a-check in spades. I think I need to look 2 ply down the tree, but that's going to be even worse computationally. Oh, and it's not clear if now I've got a "nudge" approach whether I need things to be linear.
So, a work in progress.
There you go, 16 posts containing all the SteadyEddie design. Hope you enjoyed them.