integrating and optimizing transactional memory in a data mining middleware vignesh ravi and gagan...

Integrating and Optimizing Transactional Memory in a Data Mining Middleware

Vignesh Ravi and Gagan AgrawalDepartment of ComputerScience and Engg.

The Ohio State UniversityColumbus, Ohio - 43210

Outline

• Motivation• Software Transactional Memory• Shared Memory Parallelization Schemes• Transactional Locking II (TL2)• Hybrid Replicated STM scheme• FREERIDE Processing Structure• Experimental Results• Conclusions

April 20, 2023 2

Motivation

• Availability of large data for analysis– On the scale of tera and peta bytes

• Advent of multi-core, many-core architecture– Intel’s Polaris 80-core chip– Larrabee many-core architecture

• Programmability challenge– Coarse-grained

• Performance not sufficient

– Fine-grained• Better left to experts

• Need for transparent, scalable shared-memory parallelization technique

April 20, 2023 3

Software Transactional Memory (STM)

April 20, 2023 4

• Maps concurrent transactions in database to concurrent thread operations

• Programmer• Identify critical sections

• Tag them as transactions

• Launch multiple threads

• Transactions run as atomic and isolated operations• Data races handled automatically• Guarantees absence of deadlock!

Contributions

April 20, 2023 5

FREERIDE Processing Structure(Framework for Rapid Implementation of Datamining

Engines)

April 20, 2023 6

{* Outer sequential loop*}

While( ) {

{*Reduction loop*}

Foreach( element e) {

(i, val) = compute(e)

RObj(i) = Reduc(Robj(i), val)

}

}

• Map-reduce• Two-stage

• FREERIDE• One-stage

• Intermediate structure exposed

• Better Performance than map-reduce [Cluster ‘09]

Middleware API

• Process each data instance

• Reduce the result into Reduction object

• Local combination from all threads if needed

Reduction Object

Shared-Memory Parallelization TechniquesContext: FREERIDE

(Framework for Rapid Implementation of Datamining Engines)

April 20, 2023 7

• Replication-based (Lock-free)• Full-replication (f-r)

• Lock-based• Full-locking• Cache-sensitive locking (cs-l)

Full Locking Cache-Sensitive Locking

Lock Reduction Element

Motivation for STM Integration

Potential downside of existing schemes [CCGRID ‘09]• Full-replication

– Very high memory requirements

• Cache-sensitive locking– Tuned for specific cache architecture– Risk of introducing bugs, deadlocks with porting

Advantages of STM• Leverage on large body of STM work

– Easier programmability– No deadlocks!

• Provide transparent integration– Programmer don’t bother about STM details

What do we need?• Use easy programmability of STM • Achieve competitive performance

April 20, 2023 8

Transactional Locking II (TL2)

April 20, 2023 9

• Word-based, Lock-based algorithm• Faster than non-blocking STM techniques• API

– STMBeginTransaction()

– STMWrite()

– STMRead()

– STMCommit()

• We used Rochester STM (RSTM-TL2)• Downside of STM

– Large number of conflicts -> large number of aborts

Optimization – Hybrid Replicated STM (rep-stm)

April 20, 2023 10

• Best of two worlds– Replication

– STM

• Replicated STM– Group ‘n’ threads by ‘m’ groups

– ‘m’ copies of Reduction object

– Each group of threads has private copy

– n/m threads within a group share to use STM

• Adv. of Replicated STM– Reduce no. of reduction object copies

– Reduce merge overhead

– Also, reduce conflicts with STM

April 20, 2023 11

Experimental Goals

Setup• Intel Xeon E5345 processors• 2-Quad cores (8 cores), each core 2.33 GHz• 6 GB main memory• 8 MB L2 cache

Goals• Compare f-r, cs-l, TL2 and rep-stm for three

datamining alogrithms– K-means, Expectation Maximization (E-M) and Principal

Component Analysis (PCA)

• Evaluate different Read-Write mixes• Evaluate conflicts and aborts

April 20, 2023 11

Parallel Efficiency of PCA

Principal Component Analysis• 8.5 GB dataBest result• rep-stm (6.1x)Observations• All techs. are competitivePCA specific• Computation is high for

finding co-variance matrix• Amortizes the revalidation

/acquire/release of locks• STM overheads, 2.3%

April 20, 2023 12

Parallel Efficiency of EM

April 20, 2023 13

Expectation-Maximization (EM)• 6.4 GB dataBest result• cs-l (~ 5x)Observations• STM schemes are competitive• STM have better scalability• Diff. between stm-TL2/rep-

stm not observed with 8 coresEM specific• Computation between updates

is high• Again, initial overhead is high

Canonical Loop – Parallel Efficiency for Read-write Mixes

Canonical loop• Synthetic computation• Follows generalized

reductionDiff workloads with R/W mix• All results from 8-threadsInteresting• Diff. winner for each

workload

April 20, 2023 14

Evaluation of Conflicts and Aborts

• Same canonical loop• Compare rate of aborts

for stm-TL2 and rep-stm• Demonstrates the adv. of

rep-stm over stm-TL2 for large no. of threads

• All cases, for rep-stm– Rate of growth of aborts is

much slower– Reduces aborts by 40-55%

April 20, 2023 15

Conclusions

• Transparent use of STM schemes• Developed Hybrid Replicated-STM to

reduce – Memory requirements– Conflicts/aborts

• TL2 and rep-stm are competitive with highly-tuned locking scheme

• rep-stm significantly reduces no. of aborts with TL2

April 20, 2023 16

April 20, 2023 17

Thank You!

Questions?

Contacts:Vignesh Ravi - [email protected]

Gagan Agrawal - [email protected]

mailto:[email protected]

mailto:[email protected]

Parallel Efficiency of K-means

Kmeans clustering• 6 GB data, k=250Best result• f-r (6.57x)STM overheads• 15.3%• Revalidate R/W• Acquire/Release locksKmeans specific• Computation between

updates is quite low

April 20, 2023 18

integrating and optimizing transactional memory in a data mining middleware vignesh ravi and gagan...

Documents

stm repstm

stm detailswhat

easy programmability

rochester stm rstmtl2downside

transactional locking

core chiplarrabee

reduction loop

stmlarge number of conflicts