radium aug 2015 - nsf net-centric & cloud software ... · introduction ›...

70
Radium: Racefree Ondemand Integrity Measurement Architecture Srujan Kotikela Supported by NCSS Industry/University Cooperative Research Center (IUCRC)

Upload: doquynh

Post on 19-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Radium:  Race-­‐free  On-­‐demand  Integrity  Measurement  Architecture

Srujan  Kotikela

Supported  by  NCSS  Industry/University  Cooperative   Research  Center   (IUCRC)

Introduction› Integrity  Measurement

– Measurement  represents  the  state/behavior  of  an  entity– Integrity  measurement  acts  as  a  basis  for  trust

› Trusted  Computing– An  entity  can  be  trusted  if  it  always  behaves  in  the  expected  manner  for  the  intended   purpose  – Trusted  Computing  Group

– Trusted  computing  is  the  basis  for  modern  secure  systems

› Radium  (this  talk)– Eliminate  TOCTTOU  condition– Allow  concurrent  untrusted/trusted   services– More  semantic and  efficient measurements

2

3

Background  › Hardware  based  root-­‐of-­‐trust   is  required  for   trustworthy  measurements

› To  verify  trustworthiness of  computing  platform– Start  with  immutable  hardware and  measure  sequentially– Measure  each  component  by  a  measured  component  – establish  chain  of  trust

› Existing  solutions  to  establish  transitive   chain-­‐of-­‐trust– SRTM – static  root  of  trust  for  measurement– DRTM – dynamic  root  of  trust  for  measurement

› Both  SRTM and  DRTM operate  with  the  axiom– Measured  components  do  not  change after  measurement

Static  Root  of  Trust  for  Measurement  (SRTM)

4

• Immutable   ROM  measures  and  launches   verified  BIOS

• A  verified  software  will  measure,   verify  and  launch  the  next  software  in  the  chain

• Trust   in  application   is  guaranteed   at  the  time  of  launch  which  

• Requires   a  restart to  launch  a  MLE

T =  Time  of  M  =  Measurement   of

Hardware(Immutable  ROM)

System  Software(BIOS  +  Bootloader  +  VMM  +  

OS)

Application

M  system  software

M  application

T  boot

T  launch

T  useNo  

Measurement

Measured  Launch  Environment

Dynamic  Root  of  Trust  for  Measurement  (DRTM)

5

Hardware(Immutable  ROM)

Authenticated  Code  Module  (ACM)

Application

M  ACM

M  application

T  launch

T  useNo  

Measurement

• DRTM  is  dynamic  and  can  be  invoked  anytime  (including  boot-­‐time)  

• Suspends  current  environment  and  creates  isolated  environment

• Chain  of  trust  may  contain  vendor  specific  ACM

• Trust  in  application  is  guaranteed  at  the  time  of  launch

• Requires  a  reset to  launch  the  MLE

T =  Time  of  M  =  Measurement  of

Suspended  Untrusted  World

Measured  Launch  Environment

Radium:  Goals  › Extend  existing  trust  technology—DRTM,   to  provide  on-­‐demand  measurements

› Use  measuring  services   for  TOCTTOU free  measurements

› Provide  efficient and  semantically   rich measurements

› Allow  more  than  one  measured   environment  to  co-­‐exist and  co-­‐operate

› Contain  an Access  Control  Policy which  controls   accesses  between   all  trusted  and  untrusted  environments

6

Radium:  Architecture

7

Trusted  Hardware

CPU TPM

Trusted  Hypervisor

Asynchronous  Root  of  

Trust  for  Measurem

ent

measure

Target   VM MeasuringService

Access   Control  Policy  Module

Verif

ied

Laun

ch

VerifiedLaunch

Radium:  Architecture

8

Trusted  Hardware

CPU TPM

Trusted  Hypervisor

Asynchronous  Root  of  

Trust  for  Measurem

ent

Target   VM

Access   Control  Policy  Module

Verif

ied

Laun

ch

User/Client

Verified!!

Radium:  Implementation

9

Intel  TXT

CPU TPM

Xen

Asynchronous  Root  of  

Trust  for  Measurem

ent

measure

Ubuntu   10.04w/  kbeast

Ubuntu   12.04w/  libVMI &  Volatality

Xen  Security   Module

DRT

MBo

ot

VerifiedLaunch

ExtendMeasurem

ent

Radium:  Threat  Model

11

Radium:  Security› VMM  is  verified  and  trusted  using  DRTM

› Trusted  VMM  acts  as  Asynchronous   Root  of  Trust  for  Measurement   (ARTM)

› All  environments   (trusted  and  untrusted)  are  isolated   by  the  trusted  VMM’s  hardware  isolation

› The  trustworthiness of  all  measurements is  ensured  by  the  ARTM

› All  accesses   are  protected  by  the  trusted  Mandatory  Access  Control policy  within  the  VMM

Radium  provides   security  against  all  the  types  of  attacks  except  the  offline  attacks on  hardware

Related  Work› Concurrent  MLEs

– Intel  SGX  (McKeen,  Frank  et  al.  2013  [14]),  Concurrent  Secure  Worlds  (RamyaMasti et  al.  2013  [06])

› Trusted  VMM– TrustVisor (McCune  et  al.  2010  [05]),  Terra  (Garfinkel et  a.  2003  [08])

› VMM  security– Xoar (Colp,  Patrick  et.  al,  2011  [13]),    NoHype (E  Keller  et  al.  2010[09]),  Eli  (Abel  Gordon  et  al.  2013  [12])

› Integrity  Measurement– ReDAS,  Automated  security  debugging   (Chongkyung Kil et  al.  2009)

Future  Directions› VMware

– Porting  Radium  to  VMware  using  vProbes interface

› Intel  SGX– Hardware-­‐only  asynchronous  root  of  trust  for  measurements

› Invariants– Using  application’  properties  (invariants)   to  determine  the  security  state  of  the  application

14

Application  behavior  -­‐ Invariants› Properties   of  an  application   that  has  to  hold  good  at  certain   a  point  during  execution

› An  invariant  can  be  a  fixed  value   or  enumerated   set

› Data  Invariants:– Properties of  individual  variables   or  relation   among  a  group  of  variables– Ex:  equality/inequality   invariants– Constant,  original  invariants  

› Structural   Invariants:  – Program  rules   those  have  to  be  to  be  true  at  run  time– Return   address  of  stack  should   always  point  to  code  section   of  memory– Frame  pointer  of  stack  shouldn’t  change  during  function  execution– Similar   constraints   for  heap  and  other  sections  of  program  memory

Invariants  for  measurements› Useful  in  debugging,  but  often  undocumented  and unware  these  exists

› Invariants  are  instance  specific  – Need  to  get  multiple  set  of  invariants  (training  phase)

› Security  sensitive  invariants– Which  and  how  many  invariants  will  affect  security  of  the  application

› Extracting  Invariants– Instrument  the  application  with  canary  values– The  canary  value  is  used  to  monitor  return  address  constraint– Study  the  difference  in  invariants  during  normal  execution  and  exploited  execution– Daikon   is  used  to  produced  possible  invariants

Daikon› Ghttpd 1.4

.count_vhosts():::ENTER                                                                //function  entry::SERVERPORT  ==  80::SERVERROOT  ==  "/usr/local/ghttpd"::SERVERTYPE  ==  "Standalone“S::no_vhosts ==  0::vhosts ==  null==================================..count_vhosts():::EXIT                                                                       //function-­‐exit::SERVERPORT  ==  orig(::SERVERPORT)::no_vhosts ==  return::no_vhosts ==  orig(::no_vhosts)::vhosts ==  orig(::vhosts)::defaulthost ==  orig(::defaulthost)::SERVERPORT  ==  80

› Prozilla   1.3.7::canary  ==  1000(entry)::canary  ==  1000(exit)::canary  ==  orig(::canary)(exit)========================::connections[].http_sock elements  <  ::canary(exit)::rt has  only  one  value(entry)::rt ==  orig(::rt)(exit)==================================::rt.num_connections ==  4(entry)::rt.ftps_mirror_req_n <  orig(::canary)(exit)

Application  Integrity  verification› The  Target  VM’s  memory  is  parsed for  (canary)  variable  values  and  stack  constraints  

› Looking  for  suspicious  writing  behavior  on  the  stack

› Details  of  observed  violations  are  saved  to  TPM  PCRs

› Determine  the  security  state  of  the  application  by  measuring  security  invariants

› Detect  code  re-­‐use  attacks  which  are  hard  to  detect  with  conventional  techniques

Conclusions› On-­‐demand measurements   are  necessary   to  overcome  TOCTTOU   condition   in  integrity  measurements

› Using  a  minimal  TCB  hypervisor   as  a  root-­‐of-­‐trust   (ARTM)  for  measurement   is  a  viable  replacement   of  hardware  DRTM

› Semantically   rich,  fine  grained measurements are  possible  with  ARTM

› Zero  downtime for  environments   and  powerful   security  applications   can  be  achieved  with  Radium  architecture

[email protected]

19

References1. RADIUM:  Race  free  On-­‐demand  Integrity  Measurement.  Srujan  Kotikela,  Mahadevan  Gomathisankaran,  Tawfiq  Shah,  GelarehTaban

2. Trusted  computing  using  AMD  "Pacifica"  and  "Presidio"  secure  virtual  machine  technology;  Geoffrey  Strongin,  Advanced  Micro  Devices,  Inc.

3. BIOS  chronomancy:  fixing  the  core  root  of  trust  for  measurement;  John  Butterworth,  Corey  Kallenberg,  XenoKovah,  Amy  Herzog

4. Trusted  Boot:  Veriifyiing the  XenLaunch;  Joseph  Cihula

5. Flicker:  an  execution  infrastructure  for  TCB  minimization;  Jonathan  M.  McCune,  Bryan  J.  Parno,  Adrian  Perrig,  Michael  K.  Reiter,  Hiroshi  Isozaki

6. TrustVisor:  Efficient  TCB  Reduction  and  Attestation;  Jonathan  M.  McCune,  Yanlin Li,  NingQu,  ZongweiZhou,  Anupam Datta,  Virgil  Gligor,  Adrian  Perrig

7. An  architecture  for  concurrent  execution  of  secure  environments  in  clouds;  Ramya JayaramMasti,  Claudio  Marforio,  SrdjanCapkun

8. Copilot   -­‐ a  coprocessor-­‐based  kernel  runtime  integrity  monitor;  Nick  L.  Petroni,  Jr.,  Timothy  Fraser,  Jesus  Molina,  William  A.  Arbaugh

9. Terra:  a  virtual  machine-­‐based  platform  for  trusted  computing;  Tal  Garfinkel,  Ben  Pfaff,  Jim  Chow,  Mendel  Rosenblum,  Dan  Boneh

10. NoHype:  virtualized  cloud  infrastructure  without  the  virtualization;  E  Keller,  J  Szefer,  J  Rexford,  RB  Lee

11. Building  a  MAC-­‐based  security  architecture  for  the  Xenopen-­‐source  hypervisor;  Sailer,  R.;  Jaeger,  T.;  Valdez,  E.;  Caceres,  R.;  Perez,  R.;  Berger,  S.;  Griffin,  J.L.;  van  Doorn,  L.

12. KVM:  Hypervisor  Security  You  Can  Depend  On;  George  Wilson,  Michael  Day,  Beth  Taylor

13. ELI:  Bare-­‐Metal  Performance  for  I/O  Virtualization;  Abel  Gordon,  Nadav Amit,  NadavHar’El,  MuliBen-­‐Yehuda,  Alex  Landau,  Assaf Schuster,  Dan  Tsafrir

14. Breaking  up  is  hard  to  do:  security  and  functionality  in  a  commodity  hypervisor;  Colp,  Patrick  and  Nanavati,  Mihir and  Zhu,  Jun  and  Aiello,  William  and  Coker,  George  and  Deegan,  Tim  and  Loscocco,  Peter  and  Warfield,  Andrew

15. Innovative  Instructions  and  Software  Model  for  Isolated  Execution;  McKeen,  Frank  and  Alexandrovich,  Ilya and  Berenzon,  Alex  and  Rozas,  Carlos  V.  and  Shafi,  Hisham and  Shanbhogue,  Vedvyasand  Savagaonkar,  UdayR

16. ReDAS:  Remote  Attestation  to  Dynamic  System  properties.  ChongkyungKil,  EmreC.  Sezer,  Ahmed  M.  Azab,  Peng  Ning,  XiaolanZhang

17. Automated  security  debugging  using  program  structural  constraints.  Chongkyung  Kil,  Emre  C.  Sezer,  Peng  Ning,  Xiaolan  Zhang

18. Daikon  System  for  dynamically  detection  of  likely  program  invariants.  Michel  D.  Ernst,  J.H.Perkins,  P.J.Guo,    C.  Xios

Copyright © 2014 NSF Net-Centric I/UCRC.All Rights Reserved.

Net-Centric and Cloud Software and Systems I/UCRC Net-Centric and Cloud Software and Systems I/UCRC

NEMESIS - Automated Architecture for Threat Modeling and Risk Assessment for Cloud Computing (UNT-15-4-1)

Project Lead: Krishna Kavi, UNT, Mahadevan Gomathisankaran (Microsoft)Date: April 8, 2015

Problem Statement› Why is this research needed?

– To address the need for a comprehensive solution for cloud security threat modeling which incorporates the vulnerability assessment process, and then offers an actionable risk analysis tool

– A quantitative assessment can be used to negotiate security SLAs

› What are the specific problems to be solved?– What are the types of threats facing cloud assets? – Is there any scale to indicate the threat level? – Is there any metric to characterize critical vulnerabilities facing the cloud’s assets? – Is it possible to predict the number of latent vulnerabilities that are not revealed ?– Is it possible to recommend an alternative Cloud’s assets configuration to reduce

the current configuration perceived risk?

2

Project Description› How will this project approach the problem?

– Use STRIDE model to identify threat types– Create ontologies of vulnerabilities, attacks, and defenses– Use Bayesian probability model to estimate risk for threats using the ontologies– Use ontologies to suggest alternate configurations that can minimize risk– Explore ideas similar to software maturity for predicting latent vulnerabilities

› Preliminary results:– The following papers describe our preliminary work and feasibility of our approach:

› P. Kamongi, S. Kotikela, K. Kavi, M. Gomathisankaran and A. Singhal. "VULCAN: Vulnerability assessment framework for Cloud computing", Proceedings of the IEEE 7th International Conference on Software Security and Reliability, June 18-20, 2013, Washington, DC.

› P. Kamongi, M. Gomathisankaran, K. Kavi. "Nemesis: Automated architecture for threat modeling and risk assessment for cloud computing", The 6th ASE International Conference on Privacy, Security, Risk and Trust (PASSAT-2014), Dec. 13-16, 2014, Cambridge, MA, USA.

– We have developed a limited prototype of our Nemesis Architecture and it is used to assess the risk of any type of Software as a Service (SaaS) application running on top of an OpenStack Infrastructure as a Service (IaaS).

3

4

Assessing security of Cloud computing environments

§ Motivation– Can we quantify security risks?– Can such a measure be used to negotiate different security SLAs?– Can such a measure be used to implement different types of security solutions?

§ How we approach this?– Classify the types of threats facing cloud assets– Classify known vulnerabilities based on the types of threats possible– Develop models for assigning risk probabilities to vulnerabilities

§ Existence of actual attacks§ Existence of mitigations (or patches)§ Significance of vulnerabilities

– Use Bayesian probability models to compute overall risk– Use our previous work on Ontologies for vulnerabilities, attacks and defenses

5

Our Approach

§ How will this project approach the problem?– Use STRIDE model to identify threat types– Create ontologies of vulnerabilities, attacks, and defenses– Use Bayesian probability model to estimate risk for threats using the ontologies– Suggest alternate configurations that can minimize risk– Explore ideas similar to software maturity for predicting latent vulnerabilities

§ Preliminary results:– The following papers describe our preliminary work and feasibility of our approach:

§ P. Kamongi, S. Kotikela, K. Kavi, M. Gomathisankaran and A. Singhal. "VULCAN: Vulnerability assessment framework for Cloud computing", Proceedings of the IEEE 7th International Conference on Software Security and Reliability, June 18-20, 2013, Washington, DC.

§ P. Kamongi, M. Gomathisankaran, K. Kavi. "Nemesis: Automated architecture for threat modeling and risk assessment for cloud computing", The 6th ASE International Conference on Privacy, Security, Risk and Trust (PASSAT-2014), Dec. 13-16, 2014, Cambridge, MA, USA.

– We have developed a limited prototype of our Nemesis Architecture and it is used to assess the risk of any type of Software as a Service (SaaS) application running on top of an OpenStack Infrastructure as a Service (IaaS).

Nemesis Architecture

6

7

Nemesis Example

An aggregated risk estimated to 31.93% of severity

8

Nemesis – Suggested Configurations to Reduce Perceived Risk

› An aggregated new risk estimated to 25.88% of severity

9

Predicting Hidden Vulnerabilities

Current and Future Research

Predicting “hidden” vulnerabilities including ‘zero-day’Use software complexity models to predict hidden vulnerabilitiesUse the rate of patches and major new releases

OpenSSL Release #Known #PredictedVulnerabilities Vulnerabilities

cpe:/a:openssl:openssl:0.9.8h 47 51.92cpe:/a:openssl:openssl:0.9.7h 33 32.73cpe:/a:openssl:openssl:1.0.1g 25 26.06cpe:/a:openssl:openssl:0.9.6e 38 35.95cpe:/a:openssl:openssl:0.9.8b 51 50.15

Extend vulnerability ontology databaseUse security threat intelligence reports

Connection to NCSS Competencies/Capabilities

10�=Primary, �=Secondary, �=Tertiary

��

DeliverablesSummary of 3 most significant deliverables expected at end of Year 1.

11

Deliverable Description

1 Detailed report on our Ontologies and Vulcan framework

2 Bayesian model used to define threat probabilities

3 Demonstrations to show the capabilities of NEMESIS

Project Differentiators› What results does this project seek that are different (better)

than others?– At the best of our knowledge, we are the first group to propose an automated risk

assessment architecture for cloud computing which in turns enables us to deliver actionable intelligence in regards to the threats and risks facing any cloud’s assets.

– We are also among the first to use software maturity type approach to predict the number of vulnerabilities in software products

› What specific innovations or insights are sought by this research that distinguish it from related work?– Representing the existing knowledge – vulnerabilities, attacks, defenses, and

configurations – in a meaningful and efficient manner– Automated use of the knowledge to assess the risks– Automated suggestion of risk mitigation strategies– Estimation of hidden vulnerabilities

12

Potential Member Company Benefits› What specific benefits are sought for the industry

members?– Our framework can be utilized by small, medium and large corporations

with an interest in creating private or hybrid cloud systems or migrating to public Cloud systems, to assess the potential security threats and risk levels.

› What leverage does the research provide to industry member R&D plans?– The framework can be expanded into a web-service, leading to

commercialization of the service.

13

Sponsorship and Collaboration› Efforts to involve multiple companies in project

sponsorship:– Boeing and Firehost have expressed interest in this project

› Efforts to involve multiple university collaborators in the project:– Exploring the possibility of collaborating with UTD

14

Copyright © 2014 NSF Net-Centric I/UCRC.All Rights Reserved.

Net-Centric and Cloud Software and Systems I/UCRC

Title of New Project: Processing in Memory for Big Data Application (UNT-14-10-1)

Project Lead: Krishna Kavi, UNTDate: August 2015

Problem Statement

› Why is this research needed?

– 3D stacked DRAMs contain a logic layer

– Can we embed simple processing elements

– What computations should we move to the Processing-In-Memory cores?

› What is the specific problem to be solved?

– What applications benefit from PIMs

– What should be the architecture of PIMs

– What are the performance and energy advantages of using PIMs

– Do we need new programming models?

– New memory system organization?

– Interconnect networks?

2

3

Project OverviewTasks1:

3

Research Goals:

1. Develop energy efficient PIM cores

2. Investigate the impact of parallel overhead on number of cores and frequency

3. Understand heterogeneous memory systems

Benefits to Industry Partners:

1. Primary benefits are for processor and memory system designers like AMD, Intel, TI

2. Cloud applications may execute more efficiently on proposed systems

Project Milestones2:

Task# PlannedCompletion

Milestone (Deliverable)

1 01/15 Analyze emerging Scale Out applications for common functionalities

2 04/15 Develop models for estimating number and nature of PIM cores

3 09/15 Preliminary designs and simulations of PIM

4 10/15 Flat Address Memory Architecture datacollection

5

1 Task has been approved by IAB sponsor(s) Task is a deviation from the original sponsor-approved task (Why?)(See notes section of this slide for more information.)

Task# Task Description

1 Analyze emerging Scale Out applications for common functionalities

2 Develop models for estimating number and nature of PIM cores

3 Preliminary designs and simulations of PIM

2 Milestone complete or is on track for planned completion date Milestone has changed from original sponsor-approved date (Why?)

4

Task#/Description Status Progress and Accomplishments

1. Analyze emerging Scale Out applications for common functionalities

Some analysis is complete, but inconclusive. Need further investigation

2. Develop models for estimating number and nature of PIM cores

Completed models for ARM cores.

3. Preliminary designs and simulations of PIM

Completed for ARM cores and 4 MapReduce benchmarks. Continuing on Dataflow simulations.

4. Data collection for Flat Address Memory Architecture

Developed a heterogeneous memory system simulator. Collected traces for SPEC2006, Graph500, and MapReduce benchmarks. Collected data on HMA system tradeoffs for 2 and 3 level memory systems. Targeting HPCA-2016.

Significant Finding/Accomplishment Task Complete Task Partially Complete Task Not Started

Progress to Date and Accomplishments

Project Pictorial

5

Host

PIM &

DRAM controllers

Memory dies

Timing-specific

DRAM

interface

Abstract

load/store interface

3D-DRAM DDR PCM

+ +

Flat Address Memory Architecture -FLAME

› Large-scale workloads – Big data analysis, graph analysis, in-memory database, HPC, etc.

– Easily exceed memory capacity

– Single-node performance is limited by the memory wall› Specifically disk latency (HDD or SSD)

› Limited bandwidth -> less concurrency -> smaller throughput

– Memory system consumes most of the systems power

6

Flat Address Memory Architecture -FLAME

› Motivation

7

3D-DRAM DDR4 PCM Flash

Latency [ns] 40ns 60ns READ – comparable to DDR4WRITE – 4x-8x

~25us

Bandwidth [GB/s] 160 GB/s 25.6 GB/s 25.6 GB/s ? 500 MB/s

Read accessenergy [pJ/bit]

8 pJ/bit 30 pJ/bit WRITE = 4x DDR4

• Use as much 3D-DRAM as possible?• High bandwidth and low latency• Problem is limited capacity (couple of GBs)

• Use PCM as secondary memory?• Read latency comparable to DDR• Problem is high write energy and write endurance

• Page swapping process still has some overhead

Goal

› Provide sufficient memory capacity to eliminate hard page-faults

› Provide high-bandwidth and low latency for critical data

› Use 3D-DRAM, DDR, and PCM as part of main memory– Heterogeneous memory (Flat Memory)

› Page placement/replacement is critical– Mark pages which are frequently accessed as being “HOT”

– Page migration policy is activated every 0.1s (EPOCH)

– Hotness threshold is se to 32 accesses in 1 epoch

› We explore page migration policies, energy consumption, overhead, application behavior for different memory system organization

8

Experiments and Results

› 3D-DRAM + DDR -> what should be the size of 3D-DRAM (compared to the total memory footprint) in order for transferring to make sense?

› 3D-DRAM + DDR + PCM -> how can we refine the transfer policy to take advantage of locality and access frequency to minimize overhead and maximize performance and energy efficiency

› What is the overhead associated with page transfer?

– Cache flushing, TLB shootdown, DMA transfer/software copy

› LLC line locking -> can we lock the most heavily used data in LLC in order to minimize accesses to main memory

9

Results – 3D-DRAM size to memory footprint ratio

10

-4

-2

0

2

4

6

8

10

12

14

16

18

20

22

24

26

small_gc_lq_gc_lq medium_om_xl_lq_gc large_lb_ml_so_zs very_large_mc_bw_gm_ca

CPI Improvement in % (normalized to respective no_transfer policy)

2:3 1:2 1:4 1:8 1:16

Results

11

-4-202468

101214161820222426

Total execution time improvement in %(compared to respective No-transfer policy)

2:3 1:2 1:4 1:8 1:16

-120

-100

-80

-60

-40

-20

0

20

40

60

80

Improvement in total energy consumption in %(compared to respective no_transfer policy)

2:3 1:2 1:4 1:8 1:16

Results

12

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Percentage of overhead time(compared to respectice total execution time)

2:3 1:2 1:4 1:8 1:16

0

5000000

10000000

15000000

20000000

25000000

30000000

Total pages transferred

2:3 1:2 1:4 1:8 1:16

Results – Page Transfer Overhead

› TLB shootdown, cache flush and invalidation

› Do we want to transfer using DMA or use software copying?

› DMA is faster up to 12% but can consume more energy (up to 30%)

› Need to verify results for multi-programmed workloads

13

Results – Cache locking

14

bwavescactusAD

MGemsFDTD

libquantum

mcf milc soplex xalancbmk zeusmp

no_locking_data 7.27 3.71 17.14 23.00 23.78 14.43 1.91 1.29 5.81

1w_1h_1l_data 7.24 3.89 17.12 22.38 21.97 14.41 1.90 1.22 5.80

1w_2h_1l_data 7.25 3.76 17.13 22.37 22.06 14.43 1.89 1.27 5.80

1w_2h_2l_data 7.25 3.83 17.13 22.37 21.99 14.42 1.89 1.28 5.80

2w_1h_1l_data 7.23 4.03 17.11 21.77 21.19 14.41 1.91 1.16 5.80

2w_2h_2l_data 7.25 3.85 17.12 21.80 21.48 14.42 1.89 1.24 5.78

4w_1h_1l_data 7.23 4.08 17.10 20.62 20.54 14.41 1.94 1.16 5.78

lock_preemp_2D 7.25 3.85 17.12 21.80 21.48 14.42 1.89 1.24 5.78

0.00

5.00

10.00

15.00

20.00

25.00

L3 M

PK

I

USER PROGRAM L3 MISSES (MPKI)

Results – Cache locking

15

bwaves

cactusADM

GemsFDTD

libquantum

mcf milcsoplex

xalancbmk

zeusmp

no_locking_data 26.46 20.67 54.10 49.31 65.81 39.35 19.49 24.57 24.48

1w_1h_1l_data 26.43 20.63 54.08 48.69 63.97 39.34 19.48 24.50 24.48

1w_2h_1l_data 26.44 20.65 54.09 48.69 64.07 39.35 19.48 24.55 24.47

1w_2h_2l_data 26.44 20.61 54.09 48.69 63.99 39.34 19.48 24.56 24.47

2w_1h_1l_data 26.43 20.61 54.06 48.08 63.16 39.34 19.49 24.44 24.48

2w_2h_2l_data 26.44 20.59 54.08 48.11 63.57 39.34 19.48 24.52 24.47

4w_1h_1l_data 26.42 20.61 54.05 46.93 62.79 39.34 19.51 24.44 24.48

lock_preemp_2D 26.44 20.59 54.08 48.11 63.57 39.34 19.48 24.52 24.47

0.0010.0020.0030.0040.0050.0060.0070.00

Tota

l Exe

cuti

on

Tim

e [s

]

TOTAL EXECUTION TIME

bwaves

cactusADM

GemsFDTD

libquantum

mcf milcsoplex

xalancbmk

zeusmp

no_locking_data 9.319 3.774 27.84 3.723 23.35 24.97 0.858 0.321 4.514

1w_1h_1l_data 9.315 3.769 27.88 3.636 22.92 25.01 0.862 0.313 4.517

1w_2h_1l_data 9.312 3.767 27.84 3.634 22.95 24.97 0.867 0.319 4.519

1w_2h_2l_data 9.312 3.765 27.85 3.635 22.91 24.97 0.867 0.320 4.519

2w_1h_1l_data 9.316 3.769 27.86 3.549 22.70 25.03 0.860 0.306 4.530

2w_2h_2l_data 9.312 3.760 27.85 3.555 22.96 24.97 0.867 0.315 4.529

4w_1h_1l_data 9.316 3.768 27.89 3.389 22.98 25.05 0.850 0.307 4.546

lock_preemp_2D 9.312 3.760 27.84 3.555 22.95 24.97 0.867 0.315 4.529

0.0000

5.0000

10.0000

15.0000

20.0000

25.0000

30.0000

Tota

l Dyn

amic

En

ergy

[J]

TOTAL DYNAMIC ENERGY

Connection to NCSS Competencies/Capabilities

16

=Primary, =Secondary, =Tertiary

Efforts to Seek AdditionalSponsorships and CollaborationsWere collaborations sought with researchers at other institutions to broaden research?

Were attempts made to leverage the research to obtain additional funding from companies or government agencies?

Were student researchers subsequently employed or given internships with a sponsor as a result of their work on the project?

17

› Exploring collaborations with AMD and IBM

› Seeking additional support from ARL

Objective Evidence SupportingNCSS Value Proposition

18

Category Objective Evidence

Papers, Publications, Presentations/Venue

1. Mahzabeen Islam, Marko Scrbak, Krishna M. Kavi, Mike Ignatowski, and Nuwan

Jayasena. "Improving Node-Level MapReduce Performance Using Processing-in-

Memory Technologies." In Euro-Par 2014: Parallel Processing Workshops, pp. 425-

437. Springer International Publishing, 2014.

1. Marko Scrbak, Mahzabeen Islam, Krishna M. Kavi, Mike Ignatowski, and Nuwan

Jayasena. "Processing-in-Memory: Exploring the Design Space.” In the 28th

International Conference on the Architecture of Computer Systems (ARCS-2015),

March 24-27, 2015, Porto, Portugal.

Products (Software, Hardware, Data, Designs, etc.)

1. Gem5 implementation of ARM cores as PIMs2. McPAT models of energy for PIM and 3D DRM

Student Placements

Other 1. Exploring collaborations with ARL

Dataflow Processing in Memory (DFPIM) Using

Coarse Grain Reconfigurable Logic (CGRL)

Charles F. ShelorAugust 10, 2015

OutlineªWhat is dataflow?

ªWhat is processing in memory?

ªWhat is Coarse Grain Reconfigurable Logic?

ªWhat is DFPIM?

ªExamples of DFPIM

ªPerformance and Energy benefits of DFPIM

ªTrace based DFG generation

ªResearch Areas

ªConclusions

8/10/2015DFPIM - Charles F. Shelor 2

DataflowªStyle of computation

ª Data flows from operation to operation

ª Operation is performed when all data values arriveª Highly parallel, self synchronizing

ª Has been studied since the 1960’s

ª Overhead of tracking data available is major issue to mainstream usage

ªQ = (X+Y)*(A+B)

ªR = (X-Y)/(A+B)

ª5 operations, 2 ‘cycles’

8/10/2015DFPIM - Charles F. Shelor 3

+ -

* /

+

X Y A B

(X+Y)*(A+B) (X-Y)/(A+B)

Processing in MemoryªApplications not well suited to caches

ª Data fetched from memory, processed once, not accessed again

ª Memory bound, rather than CPU bound jobs

ª Examples include streaming tasks, big data, etcªText document word counting, image histogram, mp3

ª Potentially suitable for non-uniform access patternsª FFT, graph processing

ªMove processing closer to memoryª Higher bandwidth to memory, bypass caches

ª3D stacked memory with logic layer interposerª Requires low power, small size, simple algorithms

8/10/2015DFPIM - Charles F. Shelor 4

Coarse Grain Reconfigurable LogicªCGRL is a set of high level functional blocks with

run-time programmable connections

ªALU, LD/ST, Memory, Multiplier, Divider, Sequencer, Floating Point Add/Sub, FP Multiplier, FP Divider

ªSimilar to hard macros in FPGAsª Processor, DSP elements, Block Memory

ªEach functional block is faster and lower power than being built from programmable gates in a standard FPGA implementation

ªOverall connection routing much simpler as fewer elements to interconnect and more regular pattern

8/10/2015DFPIM - Charles F. Shelor 5

Coarse Grain Reconfigurable Logic

8/10/2015DFPIM - Charles F. Shelor 6

ALU

ALU

ALU

Sequencer

Mem

LD/ST

LD/ST

LD/ST

MULT

DIVO

O

O

O

O

O

O

O

O

O

I1

I1

I1

I1

I1

I1

I1

I1

I1

I1I2I2

I2

I2

I2

I2

I2

I2

I2

I2

Functional Units

Functional Units

Programmable Interconnect

Network

Dataflow PIMªUses CGRL on the 3D RAM logic board

ªConfigures functional blocks into a DF graph to implement the PIM application

ªParallel, pipelined blocks provide multiple operations per clock and typically require only 1 clock per item processed after the pipeline fills

ªNo instruction fetch or decode, no instruction window or reorder buffers, no cache hierarchy, slower clock à lower power with respect to an out-of-order processor

8/10/2015DFPIM - Charles F. Shelor 7

PIM Image Histogram KernelªCode:

ªcompiles to 40 x86 instructions

ªrequires 23 clocks per pixel

ª2.17 IPC (micro-ops per clock)

8/10/2015DFPIM - Charles F. Shelor 8

void histogram(pix image[], int size, int red[], int grn[], int blu[]) {int pxl;int rd, gr, bl;

for (pxl = 0; pxl < size; pxl++) {rd = (int) image[pxl].r;gr = (int) image[pxl].g;bl = (int) image[pxl].b;

red[rd]++;grn[gr]++;blu[bl]++;

}}

DFPIM Image Histogram

8/10/2015DFPIM - Charles F. Shelor 9

rd_adrrd_dat

wr_adr wr_dat

24-bit

>>>>

and and and

rd_adrrd_dat

wr_adr wr_dat

rd_adrrd_dat

wr_adr wr_dat

1

add

1

add

1

add

1

16 8

0xFF0xFF 0xFF

red grn blu

DFPIM Code

8/10/2015DFPIM - Charles F. Shelor 10

<!-- Histogram map DF implementation -->

<LDST instance="LDST0", size="24">

<!-- clock 1: shift r, g data; delay b data -->

<IALU instance="red", in_0="LDST0.data", in_1="immed", immed="16", funct="srl” ><IALU instance="grn", in_0="LDST0.data", in_1="immed", immed="8", funct="srl” ><DLY instance="blu", in="LDST0.data”, latency=“1” >

<!-- clock 2: mask values to 8 bits each -->

<IALU instance="red2", in_0="red.data", in_1="immed", immed="0xff", funct="and”, size=“8” ><IALU instance="grn2", in_0="grn.data", in_1="immed", immed="0xff", funct="and”, size=“8” ><IALU instance="blu2", in_0="blu.data", in_1="immed", immed="0xff", funct="and”, size=“8” >

<!-- clock 3: read current histogram counts, increment, store -->

<MEM512b instance=red_hist, rd_adrs="red2.data", rd_enable="1", rd_mode=“async”,wr_adrs="red2.data”, wr_data="red_incr.data”, wr_mode=“sync” >

<MEM512b instance=grn_hist, rd_adrs="grn2.data", rd_enable="1", rd_mode=“async”,wr_adrs="grn2.data”, wr_data="grn_incr.data", wr_mode=“sync” >

<MEM512b instance=blu_hist, rd_adrs="blu2.data", rd_enable="1", rd_mode=“async”,wr_adrs="blu2.data”, wr_data="blu_incr.data", wr_mode=“sync” >

<IALU instance="red_incr", in_0="red_hist.data", in_1="immed", immed="1",funct="add”, size=“8”, latency=“0” >

<IALU instance="grn_incr", in_0="grn_hist.data", in_1="immed", immed="1",funct="add”, size=“8”, latency=“0” >

<IALU instance="blu_incr", in_0="blu_hist.data", in_1="immed", immed="1",funct="add”, size=“8”, latency=“0” >

Word Count Kernel

8/10/2015DFPIM - Charles F. Shelor 11

while(i < buffsize) if (fdata[i] >= ‘a’ && fdata[i] <= ‘z’)

fdata[i] = fdata[i] & 0xdf; // make upper casewhile(i < buffsize) {

while(i < buffsize && (fdata[i] < 'A' || fdata[i] > ‘Z’))i++; // skip non-alpha characters

uint64_t start = i;while(((fdata[i] >= 'A' && fdata[i] <= 'Z')

|| fdata[i] == ‘\'') && i < buffsize)i++; // find next non-alpha

if (i > start) { // isolate wordfdata[i] = ‘\0’;char* word = (char*)malloc((i-start+1)*sizeof(char));int x = 0;while (x < (i-start)){

word[x] = fdata[start+x];x++;

}word[x] = '\0';emit(word); // compute hash, incr count

}}

DFPIM Word Count

8/10/2015DFPIM - Charles F. Shelor 12

and

a & b

8-bit

select

1 >= <=

0xDF 'a' 'z'

== >= <=

'\'' 'A' 'Z'

a | (b & c)

TF

add

a & ~b

ZQ

~a & b

FIFO 1 x 32

FIFO 2 x 2

rol

5

word hash

DFPIM Word Count

8/10/2015DFPIM - Charles F. Shelor 13

FIFO 1 x 32

FIFO 2 x 2

wordhash

rd_adrrd_dat

wr_adr wr_dat

rd_adrrd_dat

wr_adr wr_dat

==

<<

2

add

add

1

sequencer

add

1

==

0

Word Count

Word Compare

empty detect

Collision incr

*8

64Kx8*32

64Kx32

DFPIM BenefitsªPerformance/Energy Comparison

ª Processor: 4 GHz, Quad core, 80 Watts (20 per core)

ª DFPIM: 0.8 GHz, 0.05 Watt per ALU equivalentª x86 clocks measured using embedded performance

counters through PAPI library

8/10/2015DFPIM - Charles F. Shelor 14

Benchmark Time  (uS) Energy  (uJ)x86 DFPIM Speedup x86 DFPIM Savings

Histogram 1506 328 4.59 30115 197 99.3%Word  Count 2054 141 14.57 41072 253 99.4%FFT (4096)  1 698 82 5.25 13953 164 98.8%FFT (4096)  2 698 164 4.26 13953 295 97.9%FFT (4096)  3 698 246 2.84 13953 393 97.2%FFT (4096)  4 698 328 2.13 13953 524 96.2%

Trace Based DFG GenerationªDefining code kernels for acceleration using GPUs or

other techniques is often the responsibility of the programmer.

ªCompilers may be used to identify a limited set of kernels for acceleration.

ª In this project, we propose to identify kernels by analyzing execution traces.

ªWe use simple data mining technique by building hash table to count how many times a given instruction address is repeated.

ªThen, we identify kernels based on clusters of instructions with equal, high counts.

8/10/2015DFPIM - Charles F. Shelor 15

Trace Based DFG GenerationªExecution Trace Example

ª Instruction Cluster Example

8/10/2015DFPIM - Charles F. Shelor 16

ªDataflow Graph for previous example

8/10/2015DFPIM - Charles F. Shelor 17

ADD

AND

OR

SRLI 1

SUB

0x1

0x1

edx

ecx

eax

edi

eax'

edi’

esi

t0d

(loop iteration)

edx’

(bit reversal)

Research AreasªAddition of more PIM benchmarks

ª Graph processing, more map-reduce configurations

ªDevelop Energy, Timing, and Size models for each DFPIM functional block and interconnectª Work with synthesis and silicon vendors for values

ªDevelop DFPIM simulatorª Verify accuracy of DFPIM configurations, calculate

timing, compute energy estimates

ªContinue Trace Based DFG Generation effortª Improve recognition of kernelsª Automate generation of DFG from instructions

8/10/2015DFPIM - Charles F. Shelor 18

ConclusionsªDFPIM concept has been defined and its potential

has been evaluated

ªBenefits of DFPIM are dramatic, especially energyª Speedups of 4.6, 5.2, and 14.5

ª Energy savings of 96.2% – 99.4%

ªSignificant detailed research remains

8/10/2015DFPIM - Charles F. Shelor 19