DFS Refresher: depth-first search is an exhaustive search algorithm on trees or graphs. DFS often used to traverse all nodes (not just find the first that satisfies goal criteria). DFTraversal( node n ) // travsersal - tree { visit( n ); // do some works for all children c of n { DFTraversal( c ); } } bool DFSearch( node n ) // search - tree { if ( is_goal( n ) ) return true; for all children c of n { if ( DFSearch( c ) ) return true; } return false; } Each graph may be expanded into a tree, but because of cycles, node can be visited more than once. Depending on the problem visiting the same node again may or may not be a duplicate calculation. Example A*, same node, but on a different path. For the next example we assume visiting same node is a duplicate calculation that should be avoided: bool visited[V] = {false}; // init visited array to false DFTraversal( node n ) // travsersal - graph { visited[ n ] = 1; visit( n ); // do some works for all children c of n { if ( ! visited[ c ] ) DFTraversal( c ); } } bool DFSearch( node n ) // search - graph { visited[ n ] = 1; if ( is_goal( n ) ) return true; for all children c of n { if ( ! visited[ c ] ) { if ( DFSearch( c ) ) return true; } } return false; } As always iterative version before parallelizing: bool * visited; int **adj; // adjacency matrix int V; // number of nodes in graph stack S; // nodes or indices void DFTraversal() { for ( int k = 0; k < V; ++k ) visited[k] = 0; for ( int k = 0; k < v; ++k ) { S.push( k ); // push ALL nodes } while ( S.size() > 0 ) { k = S.pop(); if (!visited[k]) { visited[k] = 1; /* Do work */ for ( int i = 0; i < V; ++i ) { if (adj[k][i]) s.push( i ); } } } } Second loop pushes all nodes - may be skipped if graph is connected. Note: that if the order of traversal is NOT important, than for ( int i = 0; i < V; ++i ) { /* Do work */ } will be sufficient. How to parallelize: first notice the same code structure as in quicksort and the first radix sort. There is only one minor difference - simultaneous access to array visited, both reads and writes. To synchronize we may 1) add a single lock for the whole array. But since the size of the code that needs to be synchronized is very small, such solution will become a bottleneck (unless graph is sparse). 2) notice that some threads will be reading and some other will be writing, readers-writers algorithms seems to be appropriate. But as many authors pointed out readers-writers pattern is almost identical to a single lock solution in the case of very small reader's section - which is exactly the case. 3) modulo locks. Allocate a fixed ( number of threads is a good start) number of locks and use lock[ index & num_locks ] 4) this code will be an error j = index & num_locks; lock( lock[j] ); isVisited = visited[ index ]; unlock ( lock[j] ); if (!isVisited) { lock ( lock[j] ); visited[ index ] = 1; unlock ( lock[j] ); /* Body of if statement */ } since more than 1 thread can execute the first block to find node is unvisited and then visit the node (duplicate visit). Simple solution using extra local variable: j = index & num_locks; lock( lock[j] ); if (!visited[ index ]) { doVisit = 1; visited[ index ] = 1; } unlock ( lock[j] ); if ( doVisit ) { /* visit code */ doVisit = 0; // prepare for the next iteration } BFS === breadth-first traversal uses queue instead of stack, therefore all previously discussed topics apply do BFS. There is one more topic to discuss ================================== Q: Is parallel DFS actually depth-first? A: Not really: say 1 is expanded first, children 2 and 3 inserted into stack 1 / \ stack 2,3 then 2 threads get 2 and 3 correspondingly and expand them into 4,5,6,7 1 / \ stack 4,5,6,7 2 3 this is more like BFS Same happens with BFS, a deeper node may be visited before a node with a smaller depth. Find interleaving that produces this effect. Back to search ============== As it was pointed out the difference is that search algorithms have "early exit" option when required node is visited (goal found). We can implement early exit using a special dummy node that is inserted into the container by 1) thread that found a goal 2) thread that found a dummy node in both cases thread exits after inserting dummy node. This is similar to the quicksort. There is yet one more topic to discuss. Refresher first: properties of DFS and BFS: DFS - very small memory requirement, the first goal found is not guaranteed to be the best BFS - very large memory consumption, the first goal found is guaranteed to be the best. i.e. BFS is an optimal algorithm. Notations: b - branching factor, d - depth of the optimal solution, m - maximum depth of the tree. d<=m, m may be infinity. t - number of threads Properties of searches: DFS: memory (b-1)*m time b^m - the whole tree of height m may be traversed BFS: memory: b^(d+1) - all nodes from the next level after the solution may be in the queue. time b^d - the whole tree of height d (up to the solution) may be traversed DFS: the small memory requirement property stays. Explanation - because DFS uses stack, the deepest node will always be chosen, therefore for each level each thread can only leave (b-1) unvisited node in the stack, therefore total size of the stack is no more than (b-1) * t * m. Since b and t are fixed numbers, we can claim linear memory requirement. BFS: check optimality. With previously proposed implementation we do not have optimality: assume binary tree and 2 threads and 2 goal nodes 6 and 8. 6 is level 3, 8 is level 4, so 6 is optimal. first node 1 is expanded, and both threads get a node. Thread IDs are in parenthesis: 1 / \ 2 (1) 3 (2) Now thread 2 gets stuck, while thread 1 quickly processes nodes 4, 5 in BFS manner 1 / \ 2 3 (2) / \ 4 5 queue is 8,9,10,11 Thread 1 pops 8 1 / \ 2 3 (2) / \ 4 5 / 8(1) queue is 9,10,11 8 is a goal, so threads 1 inserts T (termination node) and quits queue is 9,10,11,T thread 2 continues, 1) add 6,7 AFTER T 2) processing 9,10,11 ( children 18,19,20,21,22,23 ) but queue is T,18,19,20,21,22,23,6,7 3) read T and terminate, 6 was never processed. END of example of non-optimality of BFS with simple queue. Solution - use priority queue, priority = depth. Explanation: when 6 finished, 12 will be inserted in front of T. May need some priority for T, it should be actual priority of the goal (level). Uniform cost search =================== Just switch the priority function of the priority queue to priority = sum of costs of edges to the root. A* == is basically uniform cost search with priority = +