Post by steadyeddie on Dec 18, 2016 6:17:37 GMT -8
SteadyEddie (in competitions) now runs at about 120,000-250,000 (sometimes 300,000) real rollouts per second. Ok, there are complex games where this doesn't hold, but it's true often. Why? Because I've recently started experimenting with the EC2 m4.16xlarge instance, which has 64 cores, so rollout threads only need to bring in ~2,000 rollouts per second, and then the contol thread starts to get overloaded (see below).
A detour into EC2 instances (as at Dec 2016). The 64 core systems are new, and for cost effectiveness I recommend spot instances where you can get a 80+% discount if you are prepared for your instance to die at any time. The finals day cost me $3 for a 64 core box. Even more exciting is the x1.32xlarge instance with 128 cores. But be careful with that, the spot price really does go up to over $10 an hour (it is usually $1/hour)- and so you'll need deep pockets or be very careful. Oh, and you have to ask nicely for them too. I said I was exploring ultra high end machine learning performance and that was good enough. I'm sure in years to come 128 cores will look like peanuts, but for now, wow, 128 cores. You need to run there at least once, and hope you get some nice games dealt you from tiltyard.
Writing a pipeline which needs to push 150,000 items per second from one thread to 50, and another which takes 50 threads input into a single thread has proven to be an engineering challenge.
First I used "synchronize" which was woefully inadequate for the task. I seem to recall low 10,000s of lock grabs maxed out the control thread. So I tried the java.concurrent package classes which seemed likely to be the right tool for the job. They too maxed out before 100,000 rollouts per second total.
So eventually I wrote simple classes to pass the work items using atomic operations, and overlaid them with hand written queueing classes. If you do this for yourself you will need to use the "volatile" keyword to make sure a value written by one thread actually appears for other threads- very exciting!
Going forward it seems like as if I may be ahead of the game for now, but better designs are clearly available.
A detour into EC2 instances (as at Dec 2016). The 64 core systems are new, and for cost effectiveness I recommend spot instances where you can get a 80+% discount if you are prepared for your instance to die at any time. The finals day cost me $3 for a 64 core box. Even more exciting is the x1.32xlarge instance with 128 cores. But be careful with that, the spot price really does go up to over $10 an hour (it is usually $1/hour)- and so you'll need deep pockets or be very careful. Oh, and you have to ask nicely for them too. I said I was exploring ultra high end machine learning performance and that was good enough. I'm sure in years to come 128 cores will look like peanuts, but for now, wow, 128 cores. You need to run there at least once, and hope you get some nice games dealt you from tiltyard.
Writing a pipeline which needs to push 150,000 items per second from one thread to 50, and another which takes 50 threads input into a single thread has proven to be an engineering challenge.
First I used "synchronize" which was woefully inadequate for the task. I seem to recall low 10,000s of lock grabs maxed out the control thread. So I tried the java.concurrent package classes which seemed likely to be the right tool for the job. They too maxed out before 100,000 rollouts per second total.
So eventually I wrote simple classes to pass the work items using atomic operations, and overlaid them with hand written queueing classes. If you do this for yourself you will need to use the "volatile" keyword to make sure a value written by one thread actually appears for other threads- very exciting!
Going forward it seems like as if I may be ahead of the game for now, but better designs are clearly available.