![]()
Parallel Solution of the Subset-sum Problem: An Empirical Study
Display Full Text |
Download Full Text
329.93 kB PDF file
We investigate the parallelization of an algorithm on three very different architectures. These are: a 128-processor Cray XMT massively multithreaded machine, a 16-processor IBM x3755 shared memory machine and a 240-core NVIDIA FX 5800 graphics processor unit (GPU). The problem we use in our investigation is the well-known subset-sum problem. While this is known to be NP-complete, it is solvable in pseudo-polynomial time, i.e., time proportional to the number of input objects multiplied by the sum of their sizes. This product defines the size of the dynamic programming table used to solve the problem. The hypothesis that we wish to test is that the Cray, with its specialized hardware and large uniform shared memory, is suitable for very large problems, the IBM x3755 is suitable for intermediate sized problems and the NVIDIA FX 5800 can give superior performance only for problems that fit within its modest internal memory.
We show that it is straightforward to parallelize this algorithm on the Cray XMT primarily because of the word-level locking that is available on this architecture. For the other two machines we present an alternating word algorithm that can implement an efficient solution. The timings of our respective codes were carefully measured over a comprehensive range of problem sizes. On the Cray XMT we observe very good scaling for large problems and see sustained performance as the problem size increases. However this machine has poor scaling for small problem sizes; it performs best for problem sizes of 1012 bits or more. The IBM x3755 performs very well on medium sized problems, but has poor scalability as the number of processors increases and is unable to sustain performance as the problem size increases. This machine tends to saturate for problem sizes of 1011 bits. The NVIDIA GPU performs well for problems whose tables t within its 4GB device memory. This corresponds to tables of size approximately 1010/.
The experimental measurements support our hypothesis and show that each machine gives superior performance for distinct ranges of problem sizes and demonstrate the strengths and weaknesses of the three architectures.
Document number: osu1305898281
Permalink: http://rave.ohiolink.edu/etdc/view?acc_num=osu1305898281
This ETD has been downloaded 240 times (through March 2013)
© 2011, all rights reserved.
This open access ETD is published by
Ohio State University and OhioLINK.