Summary

The last time we worked together on a project was when we were freshmen in CS 2, writing an alpha-beta pruning algorithm to play Othello for the final tournament of the class. Our algorithm was pretty good, but not great - it ended up maybe 10th out of about 50, and one of our biggest problems was that if we tried to increase our search depth in the decision tree, we would run out of computation time. We decided to try to parallelize the algorithm using the GPU and compare its performance to the serialized version on several different board sizes and search depths.

Back to Top

Background

Games like Othello, Chess, and Go have not yet been solved due to their high complexity, and in addition, most decision trees used to play these games cannot search to very high depths because of the games’ high branching factors. The minimax algorithm, which generates a complete decision tree down to a specified depth, is an example of one such algorithm. Alpha-beta pruning is a technique built on top of minimax that is used to minimize the number of branches to be searched by determining which branches will have no effect on the outcome anyway, and “pruning” these away. Even with alpha-beta pruning, though, search depths are still very limited, as the time needed to run the algorithm increases by approximately one order of magnitude for every 1-level increase in depth. Thus, we thought that it might be interesting to see if we could parallelize the algorithm to make it run faster. Because the decision tree generated is huge and has many nodes to search through, dividing up the work between multiple threads in the GPU seemed like a promising way to speed up the computation.

However, there are aspects to alpha-beta pruning that pose particular challenges to parallelization on the GPU. The search is performed depth-first, and results from earlier branches are used to determine whether later ones should be examined or not. Parallelizing necessarily sacrifices some of this information, since if we want to search multiple branches in parallel those branches do not have the bounds from each other to work with. Because of this, we may actually end up searching branches that would have been pruned in the serialized version. In theory, a thread searching one branch could update the other threads with its bounds when it finishes, but in practice this is difficult to implement on the GPU as it would require blocks to be able to break other blocks out of recursive function calls.

Back to Top

Approach

In spite of the aforementioned difficulties, we decided to attempt to parallelize and optimize the alpha-beta pruning algorithm because we were curious to see whether it would still be faster than the serialized version, and if so, by how much. Although this is not likely to be a massively parallelizable application (such as matrix transpose, for instance) we thought it would still be interesting to see what kind of performance improvements we could make. We first implemented the standard serialized alpha-beta pruning algorithm (built on top of a basic Othello board implementation from CS 2), along with a heuristic function for scoring a board state based on total number of pieces, mobility (number of moves we can make, number of moves our opponent can make), board weights (positions such as the corners and edges are desirable), and “frontier” pieces, which are particularly vulnerable to capture. For example, here is the function used to count the number of moves for a given side and current board state:


      int Board::countMoves(Side side) {
          int count = 0;
          for (int i = 0; i < BOARD_SIZE; i++) {
              for (int j = 0; j < BOARD_SIZE; j++) {
                  Move move(i, j);
                  if (checkMove(&move, side)) count++;
              }
          }
          return count;
      }
      

We then moved on to parallelizing the algorithm. The first thing we had to decide was how to parallelize the depth-first search of the decision tree. We chose to start with the PV-split algorithm, and then optimize on top of it.

Please see the master branch of our Github repo.

The PV-split algorithm works by first searching down one branch of the tree on the CPU, and then parallelizing the remaining nodes to be searched on the GPU. This allows the GPU branches to use the bounds from the first CPU-searched branch. Our first attempt to get a working parallel algorithm involved using one block of threads per child node, and then searching down the subtree rooted at that child node with the block. The results from this are summarized in the chart (Table 1) and graph (Figure 2) below.


We see an increase in performance as the board grows larger, since with a more complex board the GPU is able to gain more of an advantage over the CPU because there is more work to be done. The drop off in performance as the depth increases is likely due to the fact that at each level of the tree, all of the nodes (except the one explored by the CPU) have their subtrees explored by only one block of threads. Thus, as the depth increases, we have GPU threads searching increasingly larger subtrees - even though there may be many threads searching in parallel, each one has to search a subtree serially. We will actually address this issue in one of our later optimizations.

Benchmarking

We used mx to benchmark our results. Generally, each board size and search depth were run a few times across several days to ensure reliable results since the performance of mx tends to vary.

In a similar fashion to our sets, we used CUDA events to time the performance of the players. We time the CPU and GPU runtime to be the total time Player and GPUPlayer, respectively, spends making moves until a game board is done (there are no more available moves). For example, this is how we time the CPU:


      START_TIMER();
      while (!board->isDone()) {
        // get the current player's move
          if (turn == BLACK) {
              m = player1->doMove(m);
          }
          else {  
              m = player2->doMove(m);   
          }

        if (!board->checkMove(m, turn)) {
          cout << "Illegal move made: " << turn << " address: " << m << endl;
        }

        // make move once it is determiend to be legal
        board->doMove(m, turn);

        // switch players
        if (turn == BLACK) {
          turn = WHITE;
        }
        else {
          turn = BLACK;
        }
      }
      STOP_RECORD_TIMER(cpu_ms);
      

Note that descriptions of Player and GPUPlayer can be found below.

Back to Top

Optimizations

Obviously, there were still many aspects of our naive parallel alpha-beta pruning algorithm that could potentially be further parallelized. Here are some things that we tried.

Dynamic Parallelism of Scoring Game Boards

Please see the dynamic branch of our Github repo.

One optimization we implemented was using dynamic parallelism for scoring the game board of a given node. We followed the approach of the serialized version with the exception of the countMoves and getFrontierScore functions which have been replaced by kernels. For example, the following shows the replacement of the countMoves function (as shown above in the Approach section) by the cudaCountMovesKernel kernel:


      __global__
      void cudaCountMovesKernel(DeviceBoard *board, Side maximizer, int *maximizerScore) {

          unsigned int index = blockDim.x * blockIdx.x + threadIdx.x;
          int maximizerSum = 0;
          while (index < BOARD_SIZE * BOARD_SIZE) {
              int x = index % BOARD_SIZE;
              int y = index / BOARD_SIZE;
              Move move(x, y);
              maximizerSum += board->checkMove(&move, maximizer);
              index += blockDim.x * gridDim.x;
          }

          // warp shuffle reduction
          maximizerSum = warpReduceSum(maximizerSum);
          if (threadIdx.x == 0) {
              *maximizerScore = maximizerSum;
          }
      }
      

In each kernel, we use warp shuffle reduction instead of atomicAdd to reduce the serialization of multiple threads adding to the global score variable. Additionally, to minimize the overhead of calling multiple kernels, we later rewrote the cudaCountMovesKernel kernel shown above to calculate the number of moves for both the maximizer and minimizer with one kernel call.

We launch our kernels with 1 block of 32 threads to reduce the use of the thread block scheduler of the GPU, which would reduce performance and to maximize the work done. Finally, to ensure that our kernels run concurrently, we launch our kernels into separate streams. However, actual concurrency of the kernels is not guaranteed [2]. Despite these uncertainties, we were able to see noticeable differences between the GPU speedup factor when using dynamic programming and our naive approach. Specifically, for larger boards and higher depths, dynamic parallelism results in a large speedup in comparison to the naive implementation. This behavior is to be expected since for larger boards / higher depths, we are able to hide the overhead of calling the kernels more since there are more positions on the game board for which to parallelize the computation of the score. The results from dynamic parallelism of the scoring are summarized in the chart (Table 2) and graph (Figure 3) below.


Full Dynamic Parallelism

Please see the full dynamic branch of our Github repo.

We also tried to fully parallelize the decision tree search by using dynamic parallelism to launch a new kernel for each node to be searched, instead of using one block to fully search a subtree. This involved replacing the recursive __device__ search function with a recursive __global__ function and unwrapping some of our classes, but the overall structure remained the same as in the naive algorithm - the CPU would search one branch of the tree, and using the bounds obtained the GPU would search the remaining nodes in parallel. The possible tradeoff of this approach is the overhead of launching so many kernels, since the naive version has much fewer kernel calls, but this should provide speedups at higher depths because of the aforementioned issue of GPU threads having to search entire subtrees.

We found that the fully parallelized approach did in fact improve performance by approximately the expected amount. For instance, on a 32x32 board, for a depth 1 tree we got:

> ./testgame
      Starting CPU game...
      CPU Game completed.
      Black score: 586
      White score: 438

      Starting GPU game...
      GPU Game completed.
      Black score: 586
      White score: 438

      CPU time: 25287.4 milliseconds
      GPU time: 9014.88 milliseconds
      Speedup factor: 2.80507
      

while for a depth 2 tree on the 32x32 board we got:

> ./testgame
      Starting CPU game...
      CPU Game completed.
      Black score: 842
      White score: 182

      Starting GPU game...
      GPU Game completed.
      Black score: 842
      White score: 182

      CPU time: 256678 milliseconds
      GPU time: 88172.8 milliseconds
      Speedup factor: 2.91109
      

Compared to the naive approach (4.17291x speedup for depth 1 tree, 1.41801x speedup for depth 2 tree), the fully parallelized algorithm performs worse for depth 1 and better for depth 2. This seems reasonable, as for depth 1, each block of threads in the naive version only has to search one node anyway, so the fully parallelized version adds the overhead of launching a kernel for each one of these nodes. At depth 2, though, we see a performance improvement because the fully parallelized algorithm is able to parallelize searching subtrees while the naive version is still searching them serially.

However, this approach had some unexpected behavior which we are still not sure how to explain. For depths 1 and 2 on a 16x16 board, the result of the game was the same as the CPU and naive GPU versions. When we tried to increase the depth to 3, though, suddenly the results became nondeterministic - for example, in one of the runs we got the following results:

> ./testgame
      Starting CPU game...
      CPU Game completed.
      Black score: 202
      White score: 54

      Starting GPU game...
      GPU Game completed.
      Black score: 177
      White score: 79

      CPU time: 52733.2 milliseconds
      GPU time: 30744 milliseconds
      Speedup factor: 1.71524
      

and in another we got these:

> ./testgame
      Starting CPU game...
      CPU Game completed.
      Black score: 202
      White score: 54

      Starting GPU game...
      GPU Game completed.
      Black score: 172
      White score: 84

      CPU time: 52623.9 milliseconds
      GPU time: 35965.2 milliseconds
      Speedup factor: 1.46319
      

We’re uncertain why this happens, as it is quite difficult to debug recursive code already, let alone on the GPU. In spite of this, though, we think that the fully parallelized algorithm is still something to consider because it does perform significantly better than the naive algorithm at higher depths (notice that on a 16x16 board at depth 3, the naive algorithm was actually slower than the CPU version). One possibility would be to combine it with the parallelized scoring kernels, which could improve performance even more. Of course, optimally we would want to figure out why the behavior becomes unpredictable, but for lack of a better solution, this one is still pretty good.

Back to Top

Conclusion

While our parallelized GPU algorithm was not faster than the CPU for a 8x8 board, it was significantly better than the CPU for larger board sizes. With additional optimizations such as dynamic parallelism of scoring game boards and full dynamic parallelism, we see that we can achieve up to a 10x speedup in comparison to the CPU. Additionally, due to limited CPU/GPU resources we did not run our algorithm to higher depths for the 32x32 and 64x64 game boards, but we would expect even greater speedups at higher depths [1]. Thus, using parallelization, we have made substantial improvements to the serialized version.

Back to Top

For Those Who Are Curious...

Players

ExamplePlayer

From a vector of all available moves for a given game board, ExamplePlayer selects a random move.

Player

Uses the alpha-beta pruning algorithm to determine the best move for a specified depth in the search tree.

GPUPlayer

Uses a parallelized verison of the alpha-beta pruning algorithm to determine the best move for a specified depth in the search tree.

Back to Top

Naive GPU Speedup

Each game was played with the following players:

CPU: Player (Black) vs. ExamplePlayer (White)
GPU: GPUPlayer (Black) vs. ExamplePlayer (White)

Board Size: 8 x 8 (Depth 1)
> ./testgame 
      Starting CPU game...
      CPU Game completed.
      Black score: 45
      White score: 19

      Starting GPU game...
      GPU Game completed.
      Black score: 45
      White score: 19

      CPU time: 21.8582 milliseconds
      GPU time: 36.6678 milliseconds
      Speedup factor: 0.596116
      
Board Size: 8 x 8 (Depth 2)
> ./testgame 
      Starting CPU game...
      CPU Game completed.
      Black score: 33
      White score: 0

      Starting GPU game...
      GPU Game completed.
      Black score: 33
      White score: 0

      CPU time: 64.9054 milliseconds
      GPU time: 90.4028 milliseconds
      Speedup factor: 0.717958
      
Board Size: 8 x 8 (Depth 3)
> ./testgame 
      Starting CPU game...
      CPU Game completed.
      Black score: 45
      White score: 19

      Starting GPU game...
      GPU Game completed.
      Black score: 45
      White score: 19

      CPU time: 493.099 milliseconds
      GPU time: 1149.2 milliseconds
      Speedup factor: 0.429081
      
Board Size: 8 x 8 (Depth 4)
> ./testgame 
      Starting CPU game...
      CPU Game completed.
      Black score: 54
      White score: 10

      Starting GPU game...
      GPU Game completed.
      Black score: 54
      White score: 10

      CPU time: 2854.51 milliseconds
      GPU time: 4893.72 milliseconds
      Speedup factor: 0.5833
      
Board Size: 8 x 8 (Depth 5)
> ./testgame 
      Starting CPU game...
      CPU Game completed.
      Black score: 56
      White score: 7

      Starting GPU game...
      GPU Game completed.
      Black score: 56
      White score: 7

      CPU time: 45272.3 milliseconds
      GPU time: 35420.2 milliseconds
      Speedup factor: 1.27815
      
Board Size: 8 x 8 (Depth 6)
> ./testgame 
      Starting CPU game...
      CPU Game completed.
      Black score: 37
      White score: 0

      Starting GPU game...
      GPU Game completed.
      Black score: 37
      White score: 0

      CPU time: 76664 milliseconds
      GPU time: 98745.8 milliseconds
      Speedup factor: 0.776377
      
Board Size: 16 x 16 (Depth 1)
> ./testgame 
      Starting CPU game...
      CPU Game completed.
      Black score: 131
      White score: 125

      Starting GPU game...
      GPU Game completed.
      Black score: 131
      White score: 125

      CPU time: 618.67 milliseconds
      GPU time: 426.073 milliseconds
      Speedup factor: 1.45203
      
Board Size: 16 x 16 (Depth 2)
> ./testgame 
      Starting CPU game...
      CPU Game completed.
      Black score: 170
      White score: 86

      Starting GPU game...
      GPU Game completed.
      Black score: 170
      White score: 86

      CPU time: 5400.06 milliseconds
      GPU time: 4884.32 milliseconds
      Speedup factor: 1.10559
      
Board Size: 16 x 16 (Depth 3)
> ./testgame 
      Starting CPU game...
      CPU Game completed.
      Black score: 202
      White score: 54

      Starting GPU game...
      GPU Game completed.
      Black score: 202
      White score: 54

      CPU time: 53614.5 milliseconds
      GPU time: 67410.6 milliseconds
      Speedup factor: 0.795342
      
Board Size: 32 x 32 (Depth 1)
> ./testgame 
      Starting CPU game...
      CPU Game completed.
      Black score: 586
      White score: 438

      Starting GPU game...
      GPU Game completed.
      Black score: 586
      White score: 438

      CPU time: 25252.4 milliseconds
      GPU time: 6051.52 milliseconds
      Speedup factor: 4.17291
      
Board Size: 32 x 32 (Depth 2)
> ./testgame 
      Starting CPU game...
      CPU Game completed.
      Black score: 842
      White score: 182

      Starting GPU game...
      GPU Game completed.
      Black score: 842
      White score: 182

      CPU time: 258019 milliseconds
      GPU time: 181959 milliseconds
      Speedup factor: 1.41801
      
Board Size: 64 x 64 (Depth 1)
> ./testgame 
      Starting CPU game...
      CPU Game completed.
      Black score: 3230
      White score: 866

      Starting GPU game...
      GPU Game completed.
      Black score: 3230
      White score: 866

      CPU time: 788325 milliseconds
      GPU time: 174815 milliseconds
      Speedup factor: 4.50948
      
Back to Top

Dynamic Parallelism of Scoring GPU Times

Board Size: 16 x 16 (Depth 1)
> ./testgame 
      Starting GPU game...
      GPU Game completed.
      Black score: 131
      White score: 125

      GPU time: 342.449 milliseconds
      
Board Size: 16 x 16 (Depth 2)
> ./testgame 
      Starting GPU game...
      GPU Game completed.
      Black score: 170
      White score: 86

      GPU time: 4083.5 milliseconds
      
Board Size: 16 x 16 (Depth 3)
> ./testgame 
      Starting GPU game...
      GPU Game completed.
      Black score: 202
      White score: 54

      GPU time: 58454 milliseconds
      
Board Size: 32 x 32 (Depth 1)
> ./testgame 
      Starting GPU game...
      GPU Game completed.
      Black score: 586
      White score: 438

      GPU time: 4706.16 milliseconds
      
Board Size: 32 x 32 (Depth 2)
> ./testgame 
      Starting GPU game...
      GPU Game completed.
      Black score: 842
      White score: 182

      GPU time: 113887 milliseconds
      
Board Size: 64 x 64 (Depth 1)
> ./testgame 
      Starting GPU game...
      GPU Game completed.
      Black score: 3230
      White score: 866

      GPU time: 78667.8 milliseconds
      
Back to Top

References

1. D. Strnad and N. Guid, "Parallel alpha-beta algorithm on the GPU," CIT, vol. 19, no. 4,pp. 269-274, 2011.
2. Nvidia, “CUDA Dynamic Parallelism Programming Guide,” 2012.

Authors and Contributors

Alexandra Cong (@a-cong) and Cindy Lin (@xcindylin)