남범석 bnam@skku

31
남범석 [email protected]

Upload: others

Post on 03-Nov-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 남범석 bnam@skku

남 범 석

[email protected]

Page 2: 남범석 bnam@skku

§ Chapter 11. Introduction to Parallel Computing• Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar • http://parallelcomp.uw.hu/

Page 3: 남범석 bnam@skku

§ Search possible solutions subject to constraints.

§ A discrete optimization problem can be expressed as (S, f)• S is the set of all feasible solutions• f is the cost function

§ Goal: Find a feasible solution xopt such that • f(xopt) <= f(x) for all x in S (i.e., for all feasible x)

Page 4: 남범석 bnam@skku

§ 8-puzzle problem

• One of the tiles in 3x3 grid is empty. • A tile can be moved into the blank position from a position

adjacent to it, thus creating a blank in the tile’s original position.

• The goal is to move from a given initial position to the final position in a minimum number of moves.

Page 5: 남범석 bnam@skku

§ 8-puzzle problem

(a) Initial configuration(b) Final configuration(c) A sequence of moves

Page 6: 남범석 bnam@skku

§ Goal: Find a feasible solution xopt such that • f(xopt) <= f(x) for all x in S (i.e., for all feasible x)• Find a minimum-cost path in a graph from an initial node to a

goal node.• This graph is called a state space.• Often, it is possible to estimate the cost to reach the goal state

from an intermediate state. – E.g., Sum of the Manhattan distance for each tile in the 8-puzzle

grid.

• The estimate better to be an underestimate, for optimality• The underestimate heuristic is called an admissible heuristic.

Page 7: 남범석 bnam@skku

§ NP-hard• The feasible space S is typically very large. (e.g., Go)• Often, we can find suboptimal solutions in polynomial time.

§ Consider real-time problems• robot motion planning• speech recognition• task scheduling

Page 8: 남범석 bnam@skku

§ The space of a binary problem is a tree, while that of an 8-puzzle is a graph. • Unfolding a graph into a tree

§ Search algorithms• Depth First Search• Breadth First Search• Best First Search• Branch and Bound

– Use cost to determine expansion

• Iterative Deepening A*– Use cost + heuristic value to determine expansion

Page 9: 남범석 bnam@skku

§ Due to nature of the problem, speedup can vary greatly

§ Ideally, the speedup has to be equal to the number of processors.

§ Two anomaly types:• Acceleration• Deceleration

Page 10: 남범석 bnam@skku
Page 11: 남범석 bnam@skku
Page 12: 남범석 bnam@skku

§ How is the search space partitioned across processors?

§ Different subtrees can be searched concurrently.

§ However, subtrees can be very different in size.

§ It is difficult to estimate the size of a subtree rooted at a node.

§ Dynamic load balancing is required.

Page 13: 남범석 bnam@skku

§ Critical issue is distribution of search space.

§ Static partitioning of unstructured trees leads to poor load balancing.

Page 14: 남범석 bnam@skku

§ Each processor performs DFS on a disjoint subtree.

§ Unexplored sections are stored in the stack

§ After the processor finishes, it requests unexplored sections of the tree from other processors

§ Pop off a section from the stack and give it to somebody else

Page 15: 남범석 bnam@skku

§ Work is split by splitting the stack• How much work should you give to another processor?• We do not want either of the split stack to be too small• Strategies

– Cut-off depth

– Send only nodes near bottom of stack– Send nodes near cut-off depth

– Send 1/2 of nodes between bottom and cut-off

Page 16: 남범석 bnam@skku

§ Determining a donor processor• Who do you request more work from?

§ Asynchronous round robin• Each processor maintains a local counter and makes requests

in a round-robin fashion.

§ Global round robin• The system maintains a global counter and requests are made

in a round-robin fashion, globally.

§ Random polling• Request a randomly selected processor for work.

Page 17: 남범석 bnam@skku

Global Round Robin

Asynchronous Round Robin

Random Polling

Page 18: 남범석 bnam@skku

§ How do we know when every processor is done?

§ Dijkstra's Token Termination Detection• Assume all processors are organized in a logical ring• When idle, send idle token to next processor• When idle token is received again, all done

§ Tree-Based Termination Detection• Associate a weight of 1 with initial work load• Assign portions of the weight• When finished give the weight portion back• When processor 0 has weight of 1 --> all done.

Page 19: 남범석 bnam@skku

§ All processors are organized in a logical ring.

§ When processor P0 goes idle, it passes an idle token to P1.

§ If Pi has the token and Pi is idle, it passes the token to Pi+1.

§ If Pj sends work to processor Pi and j > i then Pi becomes busy.

§ If Pi is busy, the token is set to busy before it is sent to Pi+1.

§ If Pi is idle, the token is passed unchanged.

§ When processor P0 receives an idle token and is itself idle.

Page 20: 남범석 bnam@skku

1

2 3

40

1

2 3

40

1

2 3

40

1

2 3

40

1

2 3

40

ActiveInactiveToken

work

Page 21: 남범석 bnam@skku

§ Problem: Fast token and slow work• Suppose process i sends work to process i + 4• Suppose the work message takes a long time to get there• In the mean time, process i becomes idle.• Process i now receives an idle token• Process i passes on the idle token• Process i + 4 receives the idle token before the work message• Process i + 4 will also pass on a idle token• The idle token will now arrive at P0 signaling termination

Page 22: 남범석 bnam@skku

§ Solution: Message Counts• Send message counts along with the token• Initially, all processes are idle and have a message count of 0• Whenever a process receives a message, it decrements its

count and increments its count if it sends a message– sum of message counts will be zero iff all messages have

been delivered• Token sums message counts as it is passed.

Page 23: 남범석 bnam@skku

§ If P0 is idle, it sends an idle token with its message count to P1

§ If a process sends or receives a message, it turns busy§ Pi keeps the token as long as it is busy. If it turns idle:

• If Pi is busy, change token to busy. • Otherwise token color is unchanged• Add message count to the token• Forward the token• Change state to idle

§ If P0 receives a busy token, try again.§ If P0 receives a idle token

• Token has passed through only idle processes• However, a message may be in flight

– token’s message count will be non-zero• If message count is zero, terminate all processes• Otherwise try again

Page 24: 남범석 bnam@skku

§ Associate weights with individual work pieces.

§ Initially, P0 has all the work and a weight of one.

§ When work is partitioned, the weight is split.

§ When a work is done, it sends its parent the weight back.

§ Terminate when the weight at P0 becomes 1 again.

Page 25: 남범석 bnam@skku
Page 26: 남범석 bnam@skku

§ Heuristic is used to direct the search

§ Maintains 2 lists• Open

– Nodes unsearched– Sorted by heuristic value

• Closed– Expanded nodes

§ Concurrent processors pick the most promising node from the open list• Newly generated nodes are placed back on the open list

§ Centralized Strategy

Page 27: 남범석 bnam@skku

Expand the node togenerate successors

Expand the node togenerate successors

Expand the node togenerate successors

at designated processorGlobal list maintained

best node

nodesPut expanded

Getcurrent

Pick the best nodefrom the list

Place generated

nodes in the list

Pick the best nodefrom the list

Place generatednodes in the list

Unlock the list

Pick the best nodefrom the list

Place generated

nodes in the list

Unlock the listUnlock the list

Lock the list

Lock the list

Lock the list

Page 28: 남범석 bnam@skku

§ Termination condition• A processor may find a solution but not the best solution.

§ Centralization leads to congestion• Open list must be locked when accessed• Each processor locks this queue, extracts the best node,

unlocks it.• Successors of the node are generated, their heuristic functions

estimated, and the nodes inserted into the open list as necessary after appropriate locking.

§ The open list is a point of contention.• How to avoid the contention?

Page 29: 남범석 bnam@skku

§ Let each processor maintain its own open list

§ Initially, the search space is statically divided across these open lists.

§ Processors concurrently operate on these open lists.

§ The heuristic values in these lists may diverge significantly.

§ We must periodically balance the quality of nodes in each list.

§ A number of balancing strategies based on ring, blackboard, or random communications are possible.

Page 30: 남범석 bnam@skku

§ Random• Periodically send some of the best nodes to a random processor

§ Ring• Periodically exchange best nodes with neighbors

§ Blackboard• Select best node from open list

Page 31: 남범석 bnam@skku

§ Problem: node replication• Graph search involves a closed list, where the major operation

is a lookup

§ Possible solution:• Assign each node to a processor using a hash function• Whenever a node is generated, check to see if it already has

been searched • If a node does not exist in a closed list, it is inserted into the

open list at the target of the hash function.