![Page 1: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/1.jpg)
Discovering and Understanding Performance Bottlenecks in Transactional
ApplicationsFerad Zyulkyarov1,2, Srdjan Stipic1,2, Tim Harris3, Osman S. Unsal1,
Adrián Cristal1,4, Ibrahim Hur1, Mateo Valero1,2
1BSC-Microsoft Research Centre
2Universitat Politècnica de Catalunya
3Microsoft Research Cambridge
4IIIA - Artificial Intelligence Research Institute CSIC - Spanish National Research Council
19th International Conference on Parallel Architectures and Compilation Techniques11-15 September 2010 – Vienna
![Page 2: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/2.jpg)
Abstract the TM Implementation
2
for (i = 0; i < N; i++){ atomic { x[i]++; }}
for (i = 0; i < N; i++){ atomic { y[i]++; }}
Thread 1 Thread 2Accesses to different arrays.
Accesses to different arrays.We can observe
overheads inherent to the TM implementation.
We can observe overheads inherent to the
TM implementation.We are not interested in
such bottlenecks.We are not interested in
such bottlenecks.
![Page 3: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/3.jpg)
Abstract the TM Implementation
3
for (i = 0; i < N; i++){ atomic { x[i]++; }}
for (i = 0; i < N; i++){ atomic { x[i]++; }}
Thread 1 Thread 2Accesses to the same
arrays.Accesses to the same
arrays.Contention:
Bottleneck common to all implementations of the
TM programming model.
Contention:Bottleneck common to all
implementations of the TM programming model.
We are interested in this kind of bottlenecks.
We are interested in this kind of bottlenecks.
![Page 4: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/4.jpg)
Can We Find This Kind of Bottlenecks?
4
atomic{ statement1;
statement2;
statement3;
statement4;
}
Abort rate 80%
Where aborts happen?
Where aborts happen?Which variables
conflict?Which variables
conflict?Are there false conflicts?
Are there false conflicts?
![Page 5: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/5.jpg)
Can We Find This Kind of Bottlenecks?
5
atomic{ statement1;
statement2;
statement3;
statement4;
}
counter1=0;
counter2=0;
counter3=0;
counter4=0;
![Page 6: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/6.jpg)
Can We Find This Kind of Bottlenecks?
6
atomic{ statement1;
statement2;
statement3;
statement4;
}
counter1=1;
counter2=0;
counter3=0;
counter4=0;
![Page 7: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/7.jpg)
Can We Find This Kind of Bottlenecks?
7
atomic{ statement1;
statement2;
statement3;
statement4;
}
counter1=1;
counter2=1;
counter3=0;
counter4=0;
Conflict between statement2 and
statement4.
Conflict between statement2 and
statement4.
GoalProfiling techniques to find bottlenecks (important
conflicting locations) and why these conflicts happen.
![Page 8: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/8.jpg)
Outline
Profiling Techniques
Implementation
Case Studies
8
![Page 9: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/9.jpg)
Profiling Techniques
9
Visualizing transactions
Conflict point discovery
Identifying conflicting data structures
![Page 10: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/10.jpg)
Transaction Visualizer (Genome)
10
Aborts occur at the first and last atomic blocks in
program order.
Aborts occur at the first and last atomic blocks in
program order.
Garbage CollectionGarbage Collection
14% Aborts
Wait on barrierWait on barrier
When these aborts
happen?
![Page 11: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/11.jpg)
Aborts Graph (Bayes)
11
AB1 AB2
AB3
AB4
AB5
AB6
AB7
AB8
AB9
AB10
AB12
AB11
AB13
AB14
AB1593% Aborts93% Aborts
73% 20%
![Page 12: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/12.jpg)
Number of Aborts vs Wasted Work
12
atomic{ counter++}
atomic{ hashtable.Rehash();}
Aborts = 9Aborts = 9 Aborts = 1Aborts = 1Wasted Work = 10%Wasted Work = 10% Wasted Work = 90%Wasted Work = 90%
![Page 13: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/13.jpg)
Conflict Point Discovery
13
File:Line #Conf. Method Line
Hashtable.cs:51 152 Add If (_container[hashCode]…
Hashtable.cs:48 62 Add uint hashCode = HashSdbm(…
Hashtable.cs:53 5 Add _container[hashCode] = n …
Hashtable.cs:83 5 Add while (entry != null) …
ArrayList.cs:79 3 Contains for (int i = 0; i < count; i++ )
ArrayList.cs:52 1 Add if (count == capacity – 1) …
![Page 14: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/14.jpg)
Conflicts Context
14
increment() { counter++;}
probability80 { probability = random() % 100; if (probability < 80) { atomic { increment(); } }}
probability20 { probability = random() % 100; if (probability >= 80) { atomic { increment(); } }}
Thread 1------------for (int i = 0; i < 100; i++) { probability80(); probability20();}
Thread 2------------for (int i = 0; i < 100; i++) { probability80(); probability20();}
All conflicts happen here.
All conflicts happen here.
Bottom-up view
+ increment (100%) |---- probability80 (80%) |---- probability20 (20%)
Bottom-up view
+ increment (100%) |---- probability80 (80%) |---- probability20 (20%)
Top-down view
+ main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%)
Top-down view
+ main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%)
![Page 15: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/15.jpg)
Identifying multiple conflictsfrom a single run
15
atomic { obj1.x = t1; obj2.x = t2; obj3.x = t3; ... ... ...}
atomic { ... ... ... obj1.x = t1; obj2.x = t2; obj3.x = t3;}
Thread 1 Thread 2Conflict detected at 1st iteration
Conflict detected at 1st iterationConflict detected at 2nd
iterationConflict detected at 2nd
iterationConflict detected at 3rd iteration
Conflict detected at 3rd iteration
![Page 16: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/16.jpg)
Identifying Conflicting Objects
16
List list = new List();list.Add(1);list.Add(2);list.Add(3);...atomic { list.Replace(3, 33);}
List 1 2 3
0x08 0x10 0x18 0x20
GC DbgEng
Object Addr0x20
GC Root0x08
Variable Name (list)
Memory Allocator
DbgEng
Instr Addr0x446290
List.cs:1
Per-Object View
+ List.cs:1 “list” (42%) |--- ChangeNode (20 %) +---- Replace (12%) +---- Add (8%)
Per-Object View
+ List.cs:1 “list” (42%) |--- ChangeNode (20 %) +---- Replace (12%) +---- Add (8%)
![Page 17: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/17.jpg)
Outline
Profiling Techniques
Implementation- Bartok- The data that we collect- Probe effect and profiling
Case Studies
17
![Page 18: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/18.jpg)
Bartok
• C# to x86 research compiler with language level support for TM
• STM– Eager versioning (i.e. in place update)– Detects write-write conflicts eagerly (i.e. immediately)– Detects read-write conflicts lazily (i.e. at commit)– Detects conflicts at object granularity
18
![Page 19: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/19.jpg)
Profiling Data That We Collect
• Timestamp– TX start,
– TX commit or TX abort
• Read and write set size
• On abort– The instruction of the read and write operations involved in
the conflict
– The conflicting memory address
– The call stack
• Process data offline or during GC
19
![Page 20: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/20.jpg)
Probe Effect and Overheads
20
Thread Bayes Genome Intruder Labyrinth Vacation WormBench1 0.59 0.27 0.29 0.07 0.26 0.292 0.45 0.30 0.39 0.03 0.24 0.054 0.01 0.21 0.55 0.01 0.18 0.088 0.02 0.18 1.19 0.16 0.19 0.11
Normalized Abort Rates
Normalized Execution Time
Thread Bayes Genome Intruder Labyrinth Vacation WormBench2 0.00 0.00 0.00 0.00 0.00 0.004 0.11 0.00 0.01 0.00 0.00 0.008 0.12 0.00 0.02 0.00 0.00 0.00
Average 0.016Average 0.016
Average 0.25Average 0.25
![Page 21: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/21.jpg)
Outline
Profiling Techniques
Implementation
Case Studies
21
![Page 22: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/22.jpg)
Case Studies
Bayes
Intruder
Labyrinth
22
![Page 23: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/23.jpg)
Bayes
23
public class FindBestTaskArg { public int toId; public Learner learnerPtr; public Query[] queries; public Vector queryVectorPtr; public Vector parentQueryVectorPtr; public int numTotalParent; public float basePenalty; public float baseLogLikelihood; public Bitmap bitmapPtr; public Queue workQueuePtr; public Vector aQueryVectorPtr; public Vector bQueryVectorPtr;}
Wrapper object for function arguments.Wrapper object for
function arguments.
FindBestTaskArg arg = new FindBestTaskArg();
arg.learnerPtr = learnerPtr;arg.queries = queries;arg.queryVectorPtr = queryVectorPtr;arg.parentQueryVectorPtr = parentQueryVectorPtr;arg.bitmapPtr = visitedBitmapPtr;arg.workQueuePtr = workQueuePtr;arg.aQueryVectorPtr = aQueryVectorPtr;arg.bQueryVectorPtr = bQueryVectorPtr;
Create wrapper object.
Create wrapper object.
![Page 24: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/24.jpg)
Bayes
24
public class FindBestTaskArg { public int toId; public Learner learnerPtr; public Query[] queries; public Vector queryVectorPtr; public Vector parentQueryVectorPtr; public int numTotalParent; public float basePenalty; public float baseLogLikelihood; public Bitmap bitmapPtr; public Queue workQueuePtr; public Vector aQueryVectorPtr; public Vector bQueryVectorPtr;}
FindBestTaskArg arg = new FindBestTaskArg();
arg.learnerPtr = learnerPtr;arg.queries = queries;arg.queryVectorPtr = queryVectorPtr;arg.parentQueryVectorPtr = parentQueryVectorPtr;arg.bitmapPtr = visitedBitmapPtr;arg.workQueuePtr = workQueuePtr;arg.aQueryVectorPtr = aQueryVectorPtr;arg.bQueryVectorPtr = bQueryVectorPtr;
atomic { FindBestInsertTask(BestTaskArg arg)}
Call the function using the wrapper
object.
Call the function using the wrapper
object.
Create wrapper object.
Create wrapper object.
98% of wasted work is due to the wrapper object
2 threads – 24% execution time4 threads – 80% execution time
98% of wasted work is due to the wrapper object
2 threads – 24% execution time4 threads – 80% execution time
![Page 25: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/25.jpg)
Bayes – Solution
25
atomic { FindBestInsertTaskArg ( toId, learnerPtr, queries, queryVectorPtr, parentQueryVectorPtr, numTotalParent, basePenalty, baseLogLikelihood, bitmapPtr, workQueuePtr, aQueryVectorPtr, bQueryVectorPtr, );}
Passed the arguments directly and avoid
using wrapper object.
Passed the arguments directly and avoid
using wrapper object.
![Page 26: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/26.jpg)
Intruder – Map Data Structure
26
1
2
3
4
5
6
1 2 4
2 3
1 2
1
1/3
3/16/2
4/3
6/32/46/4
Network Stream
Assembled packet fragments
![Page 27: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/27.jpg)
Network Stream
Assembled packet fragments
Intruder – Map Data Structure
27
1
2
3
4
5
6
1 2 4
2 3
1 2
1
1/3
3/1
6/2
4/3
6/32/46/4
Aborts caused 68% wasted
work.
Replaced with a chaining hashtable.
![Page 28: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/28.jpg)
Intruder – Moving Code
28
Write-write conflicts are
detected eagerly.
Write-write conflicts are
detected eagerly.
More to roll back more wasted workMore to roll back
more wasted workatomic { Decoded decodedPtr = new Decoded();
char[] data = new char[length]; Array.Copy(packetPtr.Data, data, length); decodedPtr.flowId = flowId; decodedPtr.data = data;
} this.decodedQueuePtr.Push(decodedPtr);
Little to roll back, less wasted workLittle to roll back, less wasted work
![Page 29: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/29.jpg)
Labyrinth
29
atomic{ localGrid.CopyFrom(globalGrid);
if (this.PdoExpansion(myGrid, myExpansionQueue, src, dst)) { pointVector = PdoTraceback(grid, myGrid, dst, bendCost); success = true; raced = grid.addPathOfOffsets(pointVector); }}
2 threads – 80% wasted work4 threads – 98% wasted work2 threads – 80% wasted work4 threads – 98% wasted work
Watson PACT’07, it is safe if localGrid is not
up to date.
Watson PACT’07, it is safe if localGrid is not
up to date.
Don’t instrument CopyFrom with
transactional read and writes.
Don’t instrument CopyFrom with
transactional read and writes.
![Page 30: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/30.jpg)
Summary
• Design principles– Abstract the underlying TM system– Report results at the source language constructs– Low instrumentation probe effect and overhead
• Profiling techniques– Visualizing transactions– Conflict point discovery– Identifying conflicting data structures
30
![Page 31: Discovering and Understanding Performance Bottlenecks in Transactional Applications](https://reader036.vdocuments.site/reader036/viewer/2022062421/56812a71550346895d8df585/html5/thumbnails/31.jpg)
PPoPP’2010
Debugging Programs that use Atomic Blocks and Transactional Memory
ICS’2009
QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory
PPoPP’2008
Atomic Quake: Using Transactional Memory in an Interactive Multiplayer Game Server
31
Край