parallelisation+comment

Upload: prajwal-aradhya

Post on 07-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Parallelisation+Comment

    1/3

    12/7/09

    1

    Parallelisation

    More is better?

    Intuitively, increasing the number ofprocessors decreases the time taken to

    complete a task (as long as the task can be

    shared effectively between processors).

    Think of it as the old If it takes 10 minutes for 1person to wash 1 car, how many minutes will it

    take for 10 people to wash 1 car? problem. In

    theory it should only take a 10th of the time i.e. 1

    minute to complete the entire task.

    Scale-up

    The question could have been re-phrasedanother way how long will it take 10 people towash 10 cars? (Answer the same time, 10minutes) i.e. as the overall task gets larger.Increasing the number of workers allows thelarger task to be completed in the same time.This is an example of Scale-up.

    Unfortunately, the 10 workers may get in eachothers way and this may cause the time taken tobe slightly greater than 10 minutes or thenumber of cars washed to be less than 10.

    Architecture

    M

    CPU

    Disk

    Network

    M

    CPU

    Disk

    M

    CPU

    Disk

    Network

    CPU

    Disk

    M

    CPU

    Disk

    Network

    M

    CPU

    Disk

    Shared nothing Shared memory Shared disk

  • 8/3/2019 Parallelisation+Comment

    2/3

    12/7/09

    2

    Which is best?

    In all three architectures, the shared componentcan become a bottle neck, i.e. as more workneeds to be done, each CPU will start tocontend /interfere with other CPUs for theshared component. Any system that allows theprocessor to spend more time working and lesstime waiting will be better.

    Shared nothing architectures are onlycontending for network time and are thereforeless likely to suffer from interference than theother two architectures.

    Data partitioning

    Parallel query evaluation involves giving individual processors a taskto complete which; when added to the work of other processors, yieldsthe total query

    For example: We want to know who lives in a detached house.

    We have the following two relations:

    R1 (name, dateOfBirth, houseType*) name, dateOfBirth are indexed. R2 (houseType, numberRooms, detached) R1 contains 5 million individuals and R2 contains 50 house types. For a single processor R2 would be loaded into memory and then R1

    would be scanned through for matching houseTypes

    In a multiprocessor environment, each processorwould load R2 into memory,

    BUT would only take PART of R1. e.g. for 26processors, each processor could take theprojection of R1 for a particularletter of the

    alphabet. This strategy should produce a fairly

    even load for each processor and keep theinterference to a minimum (except for the initial

    demand for R2)

    n=5228

    n=5228

    Producing effective parallel

    query plans requires that:

    the interference is kept to a minimum(distribute the small relation to all)

    work is balanced over processors suchthat they all complete their task in roughlythe same amount of time (to maintain

    scale-up).

  • 8/3/2019 Parallelisation+Comment

    3/3

    12/7/09

    3

    Partitioning methods

    Hash : a hash function is chosen which will split the data evenly intoa pre-defined number of buckets (Partitions)

    Round-Robin : each tuple is delivered to a different processor inturn (very much like a card dealer dealing cards to a group ofplayers) This is a particularly effective method when the data isstored on a striped RAID system.

    Range : each processor receives a projection based on a particularrange for a chosen set of fields e.g. all tuples where the surnamestarts A-B etc. Range partitioning suffers badly from skew, whereeach range does not have a roughly similar number of relations (asin the example above)

    Pipelined parallelism

    In the absence of parallelism, each relational operator has togenerate the complete relation before starting on the next operation.

    Parallelism allows operations to be split into sub-task which can becarried out independently.

    Each task processes a tuple and immediately passes it on to thenext task without having to store the complete relation.

    Pipelining allows tuples to come out of the entire process earlier. However, pipelining can be hampered by operations which require

    the full relation to be known before progressing e.g. sorting andgrouping. These operations block pipelining.

    Run Test1 and Test2 of the task simulator to see the effect of pipelining and theadvantage of splitting the select task (T1)

    Query Optimisation

    Parallel Query Optimisation involves combining partitioning andpipelining to overcome bottlenecks in the query plan, usually createdby slow tasks such as comparisons and joins.

    Consider the following relations and SQL statement: R1 (name, dateOfBirth, houseType*) name, dateOfBirth are indexed. R2 (houseType, numberRooms, detached)

    SELECT name, dateOfBirth, numberRooms FROM R1,R2 WHERER1.houseType=R2.houseType AND detached=true

    The relational algebra for this might be: name, dateOfBirth,numberRooms ( R1 X (houseType=houseType) houseType, numberRooms

    (detached=true(R2) ) )

    name, dateOfBirth,numberRooms

    X(houseType=houseType)

    detached=true

    R2

    R1

    houseType, numberRooms

    name, dateOfBirth,numberRooms

    X(houseType=houseType)

    detached=true

    R2

    R1houseType,numberRooms

    detached=true