GPGPU based propnet

Steve Draper
Global Moderator

Posts: 143

GPGPU based propnet Jan 15, 2014 11:34:46 GMT -8 robertchuchro likes this

Quote

Post by Steve Draper on Jan 15, 2014 11:34:46 GMT -8

I've been idly wondering about the possibility of a GPGPU-based propnet implementation over Open-CL or similar for a while. Until very recently I thought it was impractical due to locality of reference issues that would result in impractically large kernels. However, today a different thought occurred to me, and I'm now thinking it might just possibly be practical after all.

Consider the following steps:

1) Construct the propnet topology in a standard way (just need the set of components and their connections - not something that actually runs, necessarily)
2) Using DeMorgan's law transformations convert all ANDs to ORs - after this transformation your only multi-input components are ORs
3) Number the components from 1...N where N is the total number of components
4) A state of the network is now represented by a bit-vector of size N (the output state of every component)
5) For each component produce a bit-vector with a 1 at the position of every component that inputs into it (will only have a single bit set for all apart from the ORs)
6) Concatenate the bit-vectors from (5) into a full connectivity matrix. Note that to calculate the next state of any component is either an operation on one bit (anything except an OR) or a vector dot product (connectivity vector by state vector, since the output of an OR is a dot product of the state vector with the connectivity vector for the gate in question)
7) Construct a dependency numbering of the components as follows:
7.1) Beginning with the inputs to the network (DOES props, base state props) number all such components with a '1'.
7.2) Repeat, at each step numbering all components whose inputs are already numbered with the next number (2,3...) until all components are numbered

The numbering in (7) is such that the next state all of components of a given number can be calculated once all those of the preceding number are done, and (crucially) each gate with the same number can be done in parallel. Produce a bit-vector for each number (so an array of bit-vectors) which tells you what set of components can be parallel-computed at each step in next-state propagation (this is a partitioning of the component set).

Assuming the cardinality of the bit vectors in the array is reasonably large (i.e. - you can generally calculate many components in parallel) then each one forms the basis for one GPU kernel invocation wherein a GPU thread computes the next state for a single component as a vector dot product (or alternatively frame it as a matrix multiply and give it to an off-the-shelf GPU matrix package).

Does this have legs do you think...?

Sam Schreiber
Global Moderator

Posts: 46

GPGPU based propnet Jan 15, 2014 13:01:51 GMT -8 via mobile

Quote

Post by Sam Schreiber on Jan 15, 2014 13:01:51 GMT -8

I've been investigating GGPGPGPU for a while now, if only because of the fantastic acronym (but also because of the possible vast performance gains). The approach you suggest definitely sounds interesting, but keep in mind that while the GPU is fast and extremely parallel, communication between the GPU and CPU is slow and limited, so using the GPU for computing individual state transitions may end up getting bogged down with inter-device communication overhead. Ideally as much of the inner loop of the gaming algorithm could be offloaded to the GPU as possible, but this becomes challenging for other reasons (e.g. branch divergence between depth charges)

Steve Draper
Global Moderator

Posts: 143

GPGPU based propnet Jan 19, 2014 11:35:40 GMT -8

Quote

Post by Steve Draper on Jan 19, 2014 11:35:40 GMT -8

Jan 15, 2014 13:01:51 GMT -8 Sam Schreiber said:

I've been investigating GGPGPGPU for a while now, if only because of the fantastic acronym (but also because of the possible vast performance gains). The approach you suggest definitely sounds interesting, but keep in mind that while the GPU is fast and extremely parallel, communication between the GPU and CPU is slow and limited, so using the GPU for computing individual state transitions may end up getting bogged down with inter-device communication overhead. Ideally as much of the inner loop of the gaming algorithm could be offloaded to the GPU as possible, but this becomes challenging for other reasons (e.g. branch divergence between depth charges)

Well, two counter-observations:

1) Unified memory models (between CPU and GPU) are starting to become a trend (AMD APU model, shared level 2 cache on Intel CPUs with integrated GPUs). This is not true of most current hardware, but is increasingly becoming so, and will lead to significantly reduced overheads in things like buffer setup

2) You can trivially widen the components in the propnet to the word size of the hardware (probably 64 bits) and propagate 64 independent states at once. If you can find a way to make use of that (it amounts to synchronously clocked parallel state machines from an external viewpoint), it's 'almost free' parallelism along another dimension. Feeding state and input markings in via a queue would be one way to do this, dequeuing 64 at a time - i.e. - always have at least 64 rollouts 'in progress' at once.

levb New Member Posts: 18	GPGPU based propnet Mar 12, 2014 12:04:10 GMT -8 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by levb on Mar 12, 2014 12:04:10 GMT -8 What about FPGA? Or at least CPLD? Is it still true that no one has used these for GGP? I suppose the problem is passing the game state over a 16- or 32-bit bus, but it could still be used for depth charges.

Steve Draper
Global Moderator

Posts: 143

GPGPU based propnet Mar 13, 2014 9:55:52 GMT -8

Quote

Post by Steve Draper on Mar 13, 2014 9:55:52 GMT -8

Mar 12, 2014 12:04:10 GMT -8 levb said:

What about FPGA? Or at least CPLD? Is it still true that no one has used these for GGP?
I suppose the problem is passing the game state over a 16- or 32-bit bus, but it could still be used for depth charges.

That and having access to one to try!

levb New Member Posts: 18	GPGPU based propnet Mar 19, 2014 20:52:30 GMT -8 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by levb on Mar 19, 2014 20:52:30 GMT -8 You can buy a CPLD with a starter board for <$50.

Claus
New Member

Posts: 18

GPGPU based propnet Aug 8, 2014 5:45:41 GMT -8

Quote

Post by Claus on Aug 8, 2014 5:45:41 GMT -8

The key to taking advantage of a GPU is having an algorithm that can be vectorized. In part this means that the control flow for the algorithm must not depend on the vectorized data. Based on earlier work that I have done on updating a propnet where the propositions are properly ordered (so all inputs to a node are updated prior to updating the node), I just recognized that the Boolean operations involved can be vectorized, using the bit wise Boolean operations for integers. With a 64 bit word size, that means that 64 states can be updated with the same number of operations as a single state on the CPU, even without a GPU. (Differential forward propagation is data dependent and so does not vectorize.)

Unfortunately, a vectorized depth charge does not seem feasible. Without spending a lot of time on it, I don't see a way to vectorize the random move selection, even on a GPU, since identifying the legal moves appears to make this very data dependent. Dealing with the different states reaching terminal states after different numbers of moves looks like it can be vectorized. Getting the score might vectorize, but would require a GPU since it is more than boolean operations.

Nonetheless, if the time to update the propnet strongly dominates the time for the depth charge, there could be benefit to vectorizing the update of the propnet, even without a way to vectorize the move selection. This would be especially of benefit with larger propnets. Obviously there are some trade-offs here, where the 64x performance boost of the vectorized code must overcome a slower propnet update algorthm and the inability to vectorize the move selection.

Last Edit: Aug 8, 2014 5:59:01 GMT -8 by Claus

alandau
Global Moderator

Posts: 159

GPGPU based propnet Aug 9, 2014 20:48:40 GMT -8

Quote

Post by alandau on Aug 9, 2014 20:48:40 GMT -8

Claus: This is the approach the Alloy 0.8 state machine uses (on a CPU). My implementation is such that there's about a 20x-50x increase in depth charges per second compared to running the same state machine not in batch mode.

There's some time wasted in de-vectorizing when picking random numbers for each player's moves, but you can add up the legal moves for each player while maintaining a sort of vector form, and ditto for setting the "does" propositions. This requires some re-implementation of basic arithmetic, "sideways". =) (I've considered approaches for picking random numbers without de-vectorizing, but haven't tried any yet.)

It does run over the entire propnet each time, unlike Sancho's approach. I suspect making it differential wouldn't add as much value because the diverging depth charges will touch more of the propnet. It probably depends on the game, though, and I haven't tried yet. ("Did anything change" checks would still work in this case, at least with CPU programming.)

Incidentally, I came up with this while thinking about fuzzy logic as applied to propnets. One of the problems there is with correlations between probabilities, i.e. how the "or" of a mutex group should often be nearly 1 but is usually much smaller. You can solve this by replacing "0.3" in your initial state with "a bunch of random bits, 30% of which are 1", and let AND, OR, and NOT be their bitwise equivalents. I was thinking about using this to generate heuristic values, but then I realized these are all legitimate game states and depth charges and should just be treated as such.

I had actually thought about a similar representation earlier, but "horizontal" rather than "vertical", i.e. you find patterns in the propnet's gates such that you can perform multiple boolean operations per CPU operation while running a single depth charge. (I kind of assumed at one point that this was what TurboTurtle did, before I saw in a Coursera post that it used boolean arrays!) This may be a superior approach, since it doesn't require depth charges to be run in batches, but would be harder to implement.

Post by Steve Draper on Jan 15, 2014 11:34:46 GMT -8

Post by Sam Schreiber on Jan 15, 2014 13:01:51 GMT -8

Post by Steve Draper on Jan 19, 2014 11:35:40 GMT -8

Post by levb on Mar 12, 2014 12:04:10 GMT -8

Post by Steve Draper on Mar 13, 2014 9:55:52 GMT -8

Post by levb on Mar 19, 2014 20:52:30 GMT -8

Post by Claus on Aug 8, 2014 5:45:41 GMT -8

Post by alandau on Aug 9, 2014 20:48:40 GMT -8

Quick Reply