Radix sort: =========== recursive implementation: RRS( a, b, e, pos ) { if ( pos == -1 ) return; // all bits used q = partition_by_bit( a, b, e, pos ); // moves all elements with ZERO bit at position pos to the left of elements with same bit set to ONE // returns index of the first element with bit ONE RRS( a, b, q, pos-1 ); RRS( a, q, e, pos-1 ); } RRS( a, 0, size, sizeof(int)*8 -1 ) 101 111 011 001 110 100 to start RRS( a, 0, 6, 3-1 ) 011 001 - 101 111 110 100 // split by 2nd bit -LEFT- -----RIGHT---- 001 - 011 //split LEFT by 1st bit -L1- -L2- and 100 101 - 111 110 //split RIGHT by 1st bit --R1-- --R2-- L1: 001 - q=e L2: 011 - q=b R1: 100 - 101 //split by 0th bit R2: 110 - 111 //split by 0th bit Notice that the structure of the algorithm is identical to the quicksort. We know how to parallelize it. Straight radix sort: ==================== using digits as blocks: 1) looking at the last digit 16 82 89 63 16a 79 72 75 19 44 19 72 16a 79 82 63 44 75 16 89 bins 0 1 2 3 4 5 6 7 8 9 0 0 2 1 1 1 2 0 0 3 0 0 2 3 4 5 7 7 7 10 read from bins in bottom up order 82 72 63 44 75 16 16a 89 79 19 2) looking at the next digit (from last) 19 79 16a 75 89 16 26 44 63 72 82 bins 0 1 2 3 4 5 6 7 8 9 read from bins in bottom up order 16 16a 19 26 44 63 72 75 79 82 89 DONE. The order is stable - if two elements are equal, they will be in the same order in sorted array, as in the original. Typically one will use blocks of bits instead of digits (faster): //////////////////////////////////////////////////////////////////////////////// #include #include #include #include int bits( int v, int pos, int len ) { return (v >> pos) % (1<= 0; --i) { int bin_index = bits(a[i], pass*width, width); // std::cerr << "ta[ "< * Here is an example: 13 09 95 84 71 | 29 64 80 05 60 | 91 29 76 37 97 | 26 52 87 14 84 <- array 3 9 5 4 1 | 9 4 0 5 0 | 1 9 6 7 7 | 6 2 7 4 4 <- last digit each thread uses its own array of bins: per thread bins T1: 0 1 0 1 1 1 0 0 0 1 T2: 2 0 0 0 1 1 0 0 0 1 T3: 1 0 0 0 0 0 1 2 0 1 T4: 0 0 1 0 2 0 1 1 0 0 Gcounts ( Gcounts[i] = T1[i]+T2[i]+T3[i]+T4[i] ) index 0 1 2 3 4 5 6 7 8 9 3 1 1 1 4 2 2 3 0 3 Fcounts ( exclusive scan of Gcounts) index 0 1 2 3 4 5 6 7 8 9 0 3 4 5 6 10 12 14 17 17 Fcounts[i] is the first index for the group of element that match bit pattern "i": Now each thread adjusts its bins from the first step by looking at Gcounts and bins from the previous threads: for ( int i=0; i<10; ++i ) { // single threaded solution needs "if"s if ( T4[i] != 0 ) { T4[i] = fc[i]+T1[i]+T2[i]+T3[i]; } if ( T3[i] != 0 ) { T3[i] = fc[i]+T1[i]+T2[i]; } if ( T2[i] != 0 ) { T2[i] = fc[i]+T1[i]; } if ( T1[i] != 0 ) { T1[i] = fc[i]; } } index 0 1 2 3 4 5 6 7 8 9 T1: 0 3 0 5 6 10 0 0 0 17 T2: 0 0 0 0 7 11 0 0 0 18 T3: 2 0 0 0 0 0 12 14 0 19 T4: 0 0 4 0 8 0 13 16 0 0 if each thread is doing its own part then the "if"s are not needed - those positions will never be looked at examples: T1 updates position 4 as follows: T1[4] = Fcounts[4] = 6 so that element ending with 4 found by T1 will be placed at position 6 (there will 1 such element) T2 updates position 4 as follows: T2[4] = Fcounts[4] + T1[4] = 6+1 = 7 so that element ending with 4 found by T2 will be placed at position 7 (there will 1 such element) T3 skips position 4: T4 updates position 4 as follows: T4[4] = Fcounts[4] + T1[4] + T2[4] + T3[4] + T4[4] = 6+1+1+0 = 8 so that element ending with 4 found by T4 will be placed at position 8 (there will 2 such elements) Note that stability requirement is satisfied, since threads are given chunks from left to right, and T2 will be placing 64 after T1's 84. Also 14 and 84 found by T4 will be in the right order since T4 scan its chunk left to right.