integrating and optimizing transactional memory in a data mining middleware vignesh ravi and gagan...
TRANSCRIPT
Integrating and Optimizing Transactional Memory in a Data Mining Middleware
Vignesh Ravi and Gagan AgrawalDepartment of ComputerScience and Engg.
The Ohio State UniversityColumbus, Ohio - 43210
Outline
• Motivation• Software Transactional Memory• Shared Memory Parallelization Schemes• Transactional Locking II (TL2)• Hybrid Replicated STM scheme• FREERIDE Processing Structure• Experimental Results• Conclusions
April 20, 2023 2
Motivation
• Availability of large data for analysis– On the scale of tera and peta bytes
• Advent of multi-core, many-core architecture– Intel’s Polaris 80-core chip– Larrabee many-core architecture
• Programmability challenge– Coarse-grained
• Performance not sufficient
– Fine-grained• Better left to experts
• Need for transparent, scalable shared-memory parallelization technique
April 20, 2023 3
Software Transactional Memory (STM)
April 20, 2023 4
• Maps concurrent transactions in database to concurrent thread operations
• Programmer• Identify critical sections
• Tag them as transactions
• Launch multiple threads
• Transactions run as atomic and isolated operations• Data races handled automatically• Guarantees absence of deadlock!
FREERIDE Processing Structure(Framework for Rapid Implementation of Datamining
Engines)
April 20, 2023 6
{* Outer sequential loop*}
While( ) {
{*Reduction loop*}
Foreach( element e) {
(i, val) = compute(e)
RObj(i) = Reduc(Robj(i), val)
}
}
• Map-reduce• Two-stage
• FREERIDE• One-stage
• Intermediate structure exposed
• Better Performance than map-reduce [Cluster ‘09]
Middleware API
• Process each data instance
• Reduce the result into Reduction object
• Local combination from all threads if needed
Reduction Object
Shared-Memory Parallelization TechniquesContext: FREERIDE
(Framework for Rapid Implementation of Datamining Engines)
April 20, 2023 7
• Replication-based (Lock-free)• Full-replication (f-r)
• Lock-based• Full-locking• Cache-sensitive locking (cs-l)
Full Locking Cache-Sensitive Locking
Lock Reduction Element
Motivation for STM Integration
Potential downside of existing schemes [CCGRID ‘09]• Full-replication
– Very high memory requirements
• Cache-sensitive locking– Tuned for specific cache architecture– Risk of introducing bugs, deadlocks with porting
Advantages of STM• Leverage on large body of STM work
– Easier programmability– No deadlocks!
• Provide transparent integration– Programmer don’t bother about STM details
What do we need?• Use easy programmability of STM • Achieve competitive performance
April 20, 2023 8
Transactional Locking II (TL2)
April 20, 2023 9
• Word-based, Lock-based algorithm• Faster than non-blocking STM techniques• API
– STMBeginTransaction()
– STMWrite()
– STMRead()
– STMCommit()
• We used Rochester STM (RSTM-TL2)• Downside of STM
– Large number of conflicts -> large number of aborts
Optimization – Hybrid Replicated STM (rep-stm)
April 20, 2023 10
• Best of two worlds– Replication
– STM
• Replicated STM– Group ‘n’ threads by ‘m’ groups
– ‘m’ copies of Reduction object
– Each group of threads has private copy
– n/m threads within a group share to use STM
• Adv. of Replicated STM– Reduce no. of reduction object copies
– Reduce merge overhead
– Also, reduce conflicts with STM
April 20, 2023 11
Experimental Goals
Setup• Intel Xeon E5345 processors• 2-Quad cores (8 cores), each core 2.33 GHz• 6 GB main memory• 8 MB L2 cache
Goals• Compare f-r, cs-l, TL2 and rep-stm for three
datamining alogrithms– K-means, Expectation Maximization (E-M) and Principal
Component Analysis (PCA)
• Evaluate different Read-Write mixes• Evaluate conflicts and aborts
April 20, 2023 11
Parallel Efficiency of PCA
Principal Component Analysis• 8.5 GB dataBest result• rep-stm (6.1x)Observations• All techs. are competitivePCA specific• Computation is high for
finding co-variance matrix• Amortizes the revalidation
/acquire/release of locks• STM overheads, 2.3%
April 20, 2023 12
Parallel Efficiency of EM
April 20, 2023 13
Expectation-Maximization (EM)• 6.4 GB dataBest result• cs-l (~ 5x)Observations• STM schemes are competitive• STM have better scalability• Diff. between stm-TL2/rep-
stm not observed with 8 coresEM specific• Computation between updates
is high• Again, initial overhead is high
Canonical Loop – Parallel Efficiency for Read-write Mixes
Canonical loop• Synthetic computation• Follows generalized
reductionDiff workloads with R/W mix• All results from 8-threadsInteresting• Diff. winner for each
workload
April 20, 2023 14
Evaluation of Conflicts and Aborts
• Same canonical loop• Compare rate of aborts
for stm-TL2 and rep-stm• Demonstrates the adv. of
rep-stm over stm-TL2 for large no. of threads
• All cases, for rep-stm– Rate of growth of aborts is
much slower– Reduces aborts by 40-55%
April 20, 2023 15
Conclusions
• Transparent use of STM schemes• Developed Hybrid Replicated-STM to
reduce – Memory requirements– Conflicts/aborts
• TL2 and rep-stm are competitive with highly-tuned locking scheme
• rep-stm significantly reduces no. of aborts with TL2
April 20, 2023 16
April 20, 2023 17
Thank You!
Questions?
Contacts:Vignesh Ravi - [email protected]
Gagan Agrawal - [email protected]