Post by Steve Draper on Jan 15, 2014 11:34:46 GMT -8
I've been idly wondering about the possibility of a GPGPU-based propnet implementation over Open-CL or similar for a while. Until very recently I thought it was impractical due to locality of reference issues that would result in impractically large kernels. However, today a different thought occurred to me, and I'm now thinking it might just possibly be practical after all.
Consider the following steps:
1) Construct the propnet topology in a standard way (just need the set of components and their connections - not something that actually runs, necessarily)
2) Using DeMorgan's law transformations convert all ANDs to ORs - after this transformation your only multi-input components are ORs
3) Number the components from 1...N where N is the total number of components
4) A state of the network is now represented by a bit-vector of size N (the output state of every component)
5) For each component produce a bit-vector with a 1 at the position of every component that inputs into it (will only have a single bit set for all apart from the ORs)
6) Concatenate the bit-vectors from (5) into a full connectivity matrix. Note that to calculate the next state of any component is either an operation on one bit (anything except an OR) or a vector dot product (connectivity vector by state vector, since the output of an OR is a dot product of the state vector with the connectivity vector for the gate in question)
7) Construct a dependency numbering of the components as follows:
7.1) Beginning with the inputs to the network (DOES props, base state props) number all such components with a '1'.
7.2) Repeat, at each step numbering all components whose inputs are already numbered with the next number (2,3...) until all components are numbered
The numbering in (7) is such that the next state all of components of a given number can be calculated once all those of the preceding number are done, and (crucially) each gate with the same number can be done in parallel. Produce a bit-vector for each number (so an array of bit-vectors) which tells you what set of components can be parallel-computed at each step in next-state propagation (this is a partitioning of the component set).
Assuming the cardinality of the bit vectors in the array is reasonably large (i.e. - you can generally calculate many components in parallel) then each one forms the basis for one GPU kernel invocation wherein a GPU thread computes the next state for a single component as a vector dot product (or alternatively frame it as a matrix multiply and give it to an off-the-shelf GPU matrix package).
Does this have legs do you think...?
Consider the following steps:
1) Construct the propnet topology in a standard way (just need the set of components and their connections - not something that actually runs, necessarily)
2) Using DeMorgan's law transformations convert all ANDs to ORs - after this transformation your only multi-input components are ORs
3) Number the components from 1...N where N is the total number of components
4) A state of the network is now represented by a bit-vector of size N (the output state of every component)
5) For each component produce a bit-vector with a 1 at the position of every component that inputs into it (will only have a single bit set for all apart from the ORs)
6) Concatenate the bit-vectors from (5) into a full connectivity matrix. Note that to calculate the next state of any component is either an operation on one bit (anything except an OR) or a vector dot product (connectivity vector by state vector, since the output of an OR is a dot product of the state vector with the connectivity vector for the gate in question)
7) Construct a dependency numbering of the components as follows:
7.1) Beginning with the inputs to the network (DOES props, base state props) number all such components with a '1'.
7.2) Repeat, at each step numbering all components whose inputs are already numbered with the next number (2,3...) until all components are numbered
The numbering in (7) is such that the next state all of components of a given number can be calculated once all those of the preceding number are done, and (crucially) each gate with the same number can be done in parallel. Produce a bit-vector for each number (so an array of bit-vectors) which tells you what set of components can be parallel-computed at each step in next-state propagation (this is a partitioning of the component set).
Assuming the cardinality of the bit vectors in the array is reasonably large (i.e. - you can generally calculate many components in parallel) then each one forms the basis for one GPU kernel invocation wherein a GPU thread computes the next state for a single component as a vector dot product (or alternatively frame it as a matrix multiply and give it to an off-the-shelf GPU matrix package).
Does this have legs do you think...?