남범석 bnam@skku

남 범 석

[email protected]

§ Chapter 11. Introduction to Parallel Computing• Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar • http://parallelcomp.uw.hu/

§ Search possible solutions subject to constraints.

§ A discrete optimization problem can be expressed as (S, f)• S is the set of all feasible solutions• f is the cost function

§ Goal: Find a feasible solution xopt such that • f(xopt) <= f(x) for all x in S (i.e., for all feasible x)

§ 8-puzzle problem

• One of the tiles in 3x3 grid is empty. • A tile can be moved into the blank position from a position

adjacent to it, thus creating a blank in the tile’s original position.

• The goal is to move from a given initial position to the final position in a minimum number of moves.

§ 8-puzzle problem

(a) Initial configuration(b) Final configuration(c) A sequence of moves

§ Goal: Find a feasible solution xopt such that • f(xopt) <= f(x) for all x in S (i.e., for all feasible x)• Find a minimum-cost path in a graph from an initial node to a

goal node.• This graph is called a state space.• Often, it is possible to estimate the cost to reach the goal state

from an intermediate state. – E.g., Sum of the Manhattan distance for each tile in the 8-puzzle

grid.

• The estimate better to be an underestimate, for optimality• The underestimate heuristic is called an admissible heuristic.

§ NP-hard• The feasible space S is typically very large. (e.g., Go)• Often, we can find suboptimal solutions in polynomial time.

§ Consider real-time problems• robot motion planning• speech recognition• task scheduling

§ The space of a binary problem is a tree, while that of an 8-puzzle is a graph. • Unfolding a graph into a tree

§ Search algorithms• Depth First Search• Breadth First Search• Best First Search• Branch and Bound

– Use cost to determine expansion

• Iterative Deepening A*– Use cost + heuristic value to determine expansion

§ Due to nature of the problem, speedup can vary greatly

§ Ideally, the speedup has to be equal to the number of processors.

§ Two anomaly types:• Acceleration• Deceleration

§ How is the search space partitioned across processors?

§ Different subtrees can be searched concurrently.

§ However, subtrees can be very different in size.

§ It is difficult to estimate the size of a subtree rooted at a node.

§ Dynamic load balancing is required.

§ Critical issue is distribution of search space.

§ Static partitioning of unstructured trees leads to poor load balancing.

§ Each processor performs DFS on a disjoint subtree.

§ Unexplored sections are stored in the stack

§ After the processor finishes, it requests unexplored sections of the tree from other processors

§ Pop off a section from the stack and give it to somebody else

§ Work is split by splitting the stack• How much work should you give to another processor?• We do not want either of the split stack to be too small• Strategies

– Cut-off depth

– Send only nodes near bottom of stack– Send nodes near cut-off depth

– Send 1/2 of nodes between bottom and cut-off

§ Determining a donor processor• Who do you request more work from?

§ Asynchronous round robin• Each processor maintains a local counter and makes requests

in a round-robin fashion.

§ Global round robin• The system maintains a global counter and requests are made

in a round-robin fashion, globally.

§ Random polling• Request a randomly selected processor for work.

Global Round Robin

Asynchronous Round Robin

Random Polling

§ How do we know when every processor is done?

§ Dijkstra's Token Termination Detection• Assume all processors are organized in a logical ring• When idle, send idle token to next processor• When idle token is received again, all done

§ Tree-Based Termination Detection• Associate a weight of 1 with initial work load• Assign portions of the weight• When finished give the weight portion back• When processor 0 has weight of 1 --> all done.

§ All processors are organized in a logical ring.

§ When processor P0 goes idle, it passes an idle token to P1.

§ If Pi has the token and Pi is idle, it passes the token to Pi+1.

§ If Pj sends work to processor Pi and j > i then Pi becomes busy.

§ If Pi is busy, the token is set to busy before it is sent to Pi+1.

§ If Pi is idle, the token is passed unchanged.

§ When processor P0 receives an idle token and is itself idle.

1

2 3

40

1

2 3

40

1

2 3

40

1

2 3

40

1

2 3

40

ActiveInactiveToken

work

§ Problem: Fast token and slow work• Suppose process i sends work to process i + 4• Suppose the work message takes a long time to get there• In the mean time, process i becomes idle.• Process i now receives an idle token• Process i passes on the idle token• Process i + 4 receives the idle token before the work message• Process i + 4 will also pass on a idle token• The idle token will now arrive at P0 signaling termination

§ Solution: Message Counts• Send message counts along with the token• Initially, all processes are idle and have a message count of 0• Whenever a process receives a message, it decrements its

count and increments its count if it sends a message– sum of message counts will be zero iff all messages have

been delivered• Token sums message counts as it is passed.

§ If P0 is idle, it sends an idle token with its message count to P1

§ If a process sends or receives a message, it turns busy§ Pi keeps the token as long as it is busy. If it turns idle:

• If Pi is busy, change token to busy. • Otherwise token color is unchanged• Add message count to the token• Forward the token• Change state to idle

§ If P0 receives a busy token, try again.§ If P0 receives a idle token

• Token has passed through only idle processes• However, a message may be in flight

– token’s message count will be non-zero• If message count is zero, terminate all processes• Otherwise try again

§ Associate weights with individual work pieces.

§ Initially, P0 has all the work and a weight of one.

§ When work is partitioned, the weight is split.

§ When a work is done, it sends its parent the weight back.

§ Terminate when the weight at P0 becomes 1 again.

§ Heuristic is used to direct the search

§ Maintains 2 lists• Open

– Nodes unsearched– Sorted by heuristic value

• Closed– Expanded nodes

§ Concurrent processors pick the most promising node from the open list• Newly generated nodes are placed back on the open list

§ Centralized Strategy

Expand the node togenerate successors



at designated processorGlobal list maintained

best node

nodesPut expanded

Getcurrent

Pick the best nodefrom the list

Place generated

nodes in the list


Place generatednodes in the list

Unlock the list


Place generated

nodes in the list

Unlock the listUnlock the list

Lock the list

Lock the list

Lock the list

§ Termination condition• A processor may find a solution but not the best solution.

§ Centralization leads to congestion• Open list must be locked when accessed• Each processor locks this queue, extracts the best node,

unlocks it.• Successors of the node are generated, their heuristic functions

estimated, and the nodes inserted into the open list as necessary after appropriate locking.

§ The open list is a point of contention.• How to avoid the contention?

§ Let each processor maintain its own open list

§ Initially, the search space is statically divided across these open lists.

§ Processors concurrently operate on these open lists.

§ The heuristic values in these lists may diverge significantly.

§ We must periodically balance the quality of nodes in each list.

§ A number of balancing strategies based on ring, blackboard, or random communications are possible.

§ Random• Periodically send some of the best nodes to a random processor

§ Ring• Periodically exchange best nodes with neighbors

§ Blackboard• Select best node from open list

§ Problem: node replication• Graph search involves a closed list, where the major operation

is a lookup

§ Possible solution:• Assign each node to a processor using a hash function• Whenever a node is generated, check to see if it already has

been searched • If a node does not exist in a closed list, it is inserted into the

open list at the target of the hash function.

남범석 bnam@skku

Documents