radium aug 2015 - nsf net-centric & cloud software ... · introduction ›...

Radium: Race-‐free On-‐demand Integrity Measurement Architecture

Srujan Kotikela

Supported by NCSS Industry/University Cooperative Research Center (IUCRC)

Introduction› Integrity Measurement

– Measurement represents the state/behavior of an entity– Integrity measurement acts as a basis for trust

› Trusted Computing– An entity can be trusted if it always behaves in the expected manner for the intended purpose – Trusted Computing Group

– Trusted computing is the basis for modern secure systems

› Radium (this talk)– Eliminate TOCTTOU condition– Allow concurrent untrusted/trusted services– More semantic and efficient measurements

2

3

Background › Hardware based root-‐of-‐trust is required for trustworthy measurements

› To verify trustworthiness of computing platform– Start with immutable hardware and measure sequentially– Measure each component by a measured component – establish chain of trust

› Existing solutions to establish transitive chain-‐of-‐trust– SRTM – static root of trust for measurement– DRTM – dynamic root of trust for measurement

› Both SRTM and DRTM operate with the axiom– Measured components do not change after measurement

Static Root of Trust for Measurement (SRTM)

4

• Immutable ROM measures and launches verified BIOS

• A verified software will measure, verify and launch the next software in the chain

• Trust in application is guaranteed at the time of launch which

• Requires a restart to launch a MLE

T = Time of M = Measurement of

Hardware(Immutable ROM)

System Software(BIOS + Bootloader + VMM +

OS)

Application

M system software

M application

T boot

T launch

T useNo

Measurement

Measured Launch Environment

Dynamic Root of Trust for Measurement (DRTM)

5

Hardware(Immutable ROM)

Authenticated Code Module (ACM)

Application

M ACM

M application

T launch

T useNo

Measurement

• DRTM is dynamic and can be invoked anytime (including boot-‐time)

• Suspends current environment and creates isolated environment

• Chain of trust may contain vendor specific ACM

• Trust in application is guaranteed at the time of launch

• Requires a reset to launch the MLE

T = Time of M = Measurement of

Suspended Untrusted World

Measured Launch Environment

Radium: Goals › Extend existing trust technology—DRTM, to provide on-‐demand measurements

› Use measuring services for TOCTTOU free measurements

› Provide efficient and semantically rich measurements

› Allow more than one measured environment to co-‐exist and co-‐operate

› Contain an Access Control Policy which controls accesses between all trusted and untrusted environments

6

Radium: Architecture

7

Trusted Hardware

CPU TPM

Trusted Hypervisor

Asynchronous Root of

Trust for Measurem

ent

measure

Target VM MeasuringService

Access Control Policy Module

Verif

ied

Laun

ch

VerifiedLaunch

Radium: Architecture

8

Trusted Hardware

CPU TPM

Trusted Hypervisor


Trust for Measurem

ent

Target VM

Access Control Policy Module

Verif

ied

Laun

ch

User/Client

Verified!!

Radium: Implementation

9

Intel TXT

CPU TPM

Xen


Trust for Measurem

ent

measure

Ubuntu 10.04w/ kbeast

Ubuntu 12.04w/ libVMI & Volatality

Xen Security Module

DRT

MBo

ot

VerifiedLaunch

ExtendMeasurem

ent

Radium: Threat Model

11

Radium: Security› VMM is verified and trusted using DRTM

› Trusted VMM acts as Asynchronous Root of Trust for Measurement (ARTM)

› All environments (trusted and untrusted) are isolated by the trusted VMM’s hardware isolation

› The trustworthiness of all measurements is ensured by the ARTM

› All accesses are protected by the trusted Mandatory Access Control policy within the VMM

Radium provides security against all the types of attacks except the offline attacks on hardware

Related Work› Concurrent MLEs

– Intel SGX (McKeen, Frank et al. 2013 [14]), Concurrent Secure Worlds (RamyaMasti et al. 2013 [06])

› Trusted VMM– TrustVisor (McCune et al. 2010 [05]), Terra (Garfinkel et a. 2003 [08])

› VMM security– Xoar (Colp, Patrick et. al, 2011 [13]), NoHype (E Keller et al. 2010[09]), Eli (Abel Gordon et al. 2013 [12])

› Integrity Measurement– ReDAS, Automated security debugging (Chongkyung Kil et al. 2009)

Future Directions› VMware

– Porting Radium to VMware using vProbes interface

› Intel SGX– Hardware-‐only asynchronous root of trust for measurements

› Invariants– Using application’ properties (invariants) to determine the security state of the application

14

Application behavior -‐ Invariants› Properties of an application that has to hold good at certain a point during execution

› An invariant can be a fixed value or enumerated set

› Data Invariants:– Properties of individual variables or relation among a group of variables– Ex: equality/inequality invariants– Constant, original invariants

› Structural Invariants: – Program rules those have to be to be true at run time– Return address of stack should always point to code section of memory– Frame pointer of stack shouldn’t change during function execution– Similar constraints for heap and other sections of program memory

Invariants for measurements› Useful in debugging, but often undocumented and unware these exists

› Invariants are instance specific – Need to get multiple set of invariants (training phase)

› Security sensitive invariants– Which and how many invariants will affect security of the application

› Extracting Invariants– Instrument the application with canary values– The canary value is used to monitor return address constraint– Study the difference in invariants during normal execution and exploited execution– Daikon is used to produced possible invariants

Daikon› Ghttpd 1.4

.count_vhosts():::ENTER //function entry::SERVERPORT == 80::SERVERROOT == "/usr/local/ghttpd"::SERVERTYPE == "Standalone“S::no_vhosts == 0::vhosts == null==================================..count_vhosts():::EXIT //function-‐exit::SERVERPORT == orig(::SERVERPORT)::no_vhosts == return::no_vhosts == orig(::no_vhosts)::vhosts == orig(::vhosts)::defaulthost == orig(::defaulthost)::SERVERPORT == 80

› Prozilla 1.3.7::canary == 1000(entry)::canary == 1000(exit)::canary == orig(::canary)(exit)========================::connections[].http_sock elements < ::canary(exit)::rt has only one value(entry)::rt == orig(::rt)(exit)==================================::rt.num_connections == 4(entry)::rt.ftps_mirror_req_n < orig(::canary)(exit)

Application Integrity verification› The Target VM’s memory is parsed for (canary) variable values and stack constraints

› Looking for suspicious writing behavior on the stack

› Details of observed violations are saved to TPM PCRs

› Determine the security state of the application by measuring security invariants

› Detect code re-‐use attacks which are hard to detect with conventional techniques

Conclusions› On-‐demand measurements are necessary to overcome TOCTTOU condition in integrity measurements

› Using a minimal TCB hypervisor as a root-‐of-‐trust (ARTM) for measurement is a viable replacement of hardware DRTM

› Semantically rich, fine grained measurements are possible with ARTM

› Zero downtime for environments and powerful security applications can be achieved with Radium architecture

[email protected]

19

References1. RADIUM: Race free On-‐demand Integrity Measurement. Srujan Kotikela, Mahadevan Gomathisankaran, Tawfiq Shah, GelarehTaban

2. Trusted computing using AMD "Pacifica" and "Presidio" secure virtual machine technology; Geoffrey Strongin, Advanced Micro Devices, Inc.

3. BIOS chronomancy: fixing the core root of trust for measurement; John Butterworth, Corey Kallenberg, XenoKovah, Amy Herzog

4. Trusted Boot: Veriifyiing the XenLaunch; Joseph Cihula

5. Flicker: an execution infrastructure for TCB minimization; Jonathan M. McCune, Bryan J. Parno, Adrian Perrig, Michael K. Reiter, Hiroshi Isozaki

6. TrustVisor: Efficient TCB Reduction and Attestation; Jonathan M. McCune, Yanlin Li, NingQu, ZongweiZhou, Anupam Datta, Virgil Gligor, Adrian Perrig

7. An architecture for concurrent execution of secure environments in clouds; Ramya JayaramMasti, Claudio Marforio, SrdjanCapkun

8. Copilot -‐ a coprocessor-‐based kernel runtime integrity monitor; Nick L. Petroni, Jr., Timothy Fraser, Jesus Molina, William A. Arbaugh

9. Terra: a virtual machine-‐based platform for trusted computing; Tal Garfinkel, Ben Pfaff, Jim Chow, Mendel Rosenblum, Dan Boneh

10. NoHype: virtualized cloud infrastructure without the virtualization; E Keller, J Szefer, J Rexford, RB Lee

11. Building a MAC-‐based security architecture for the Xenopen-‐source hypervisor; Sailer, R.; Jaeger, T.; Valdez, E.; Caceres, R.; Perez, R.; Berger, S.; Griffin, J.L.; van Doorn, L.

12. KVM: Hypervisor Security You Can Depend On; George Wilson, Michael Day, Beth Taylor

13. ELI: Bare-‐Metal Performance for I/O Virtualization; Abel Gordon, Nadav Amit, NadavHar’El, MuliBen-‐Yehuda, Alex Landau, Assaf Schuster, Dan Tsafrir

14. Breaking up is hard to do: security and functionality in a commodity hypervisor; Colp, Patrick and Nanavati, Mihir and Zhu, Jun and Aiello, William and Coker, George and Deegan, Tim and Loscocco, Peter and Warfield, Andrew

15. Innovative Instructions and Software Model for Isolated Execution; McKeen, Frank and Alexandrovich, Ilya and Berenzon, Alex and Rozas, Carlos V. and Shafi, Hisham and Shanbhogue, Vedvyasand Savagaonkar, UdayR

16. ReDAS: Remote Attestation to Dynamic System properties. ChongkyungKil, EmreC. Sezer, Ahmed M. Azab, Peng Ning, XiaolanZhang

17. Automated security debugging using program structural constraints. Chongkyung Kil, Emre C. Sezer, Peng Ning, Xiaolan Zhang

18. Daikon System for dynamically detection of likely program invariants. Michel D. Ernst, J.H.Perkins, P.J.Guo, C. Xios

Copyright © 2014 NSF Net-Centric I/UCRC.All Rights Reserved.

Net-Centric and Cloud Software and Systems I/UCRC Net-Centric and Cloud Software and Systems I/UCRC

NEMESIS - Automated Architecture for Threat Modeling and Risk Assessment for Cloud Computing (UNT-15-4-1)

Project Lead: Krishna Kavi, UNT, Mahadevan Gomathisankaran (Microsoft)Date: April 8, 2015

Problem Statement› Why is this research needed?

– To address the need for a comprehensive solution for cloud security threat modeling which incorporates the vulnerability assessment process, and then offers an actionable risk analysis tool

– A quantitative assessment can be used to negotiate security SLAs

› What are the specific problems to be solved?– What are the types of threats facing cloud assets? – Is there any scale to indicate the threat level? – Is there any metric to characterize critical vulnerabilities facing the cloud’s assets? – Is it possible to predict the number of latent vulnerabilities that are not revealed ?– Is it possible to recommend an alternative Cloud’s assets configuration to reduce

the current configuration perceived risk?

2

Project Description› How will this project approach the problem?

– Use STRIDE model to identify threat types– Create ontologies of vulnerabilities, attacks, and defenses– Use Bayesian probability model to estimate risk for threats using the ontologies– Use ontologies to suggest alternate configurations that can minimize risk– Explore ideas similar to software maturity for predicting latent vulnerabilities

› Preliminary results:– The following papers describe our preliminary work and feasibility of our approach:

› P. Kamongi, S. Kotikela, K. Kavi, M. Gomathisankaran and A. Singhal. "VULCAN: Vulnerability assessment framework for Cloud computing", Proceedings of the IEEE 7th International Conference on Software Security and Reliability, June 18-20, 2013, Washington, DC.

› P. Kamongi, M. Gomathisankaran, K. Kavi. "Nemesis: Automated architecture for threat modeling and risk assessment for cloud computing", The 6th ASE International Conference on Privacy, Security, Risk and Trust (PASSAT-2014), Dec. 13-16, 2014, Cambridge, MA, USA.

– We have developed a limited prototype of our Nemesis Architecture and it is used to assess the risk of any type of Software as a Service (SaaS) application running on top of an OpenStack Infrastructure as a Service (IaaS).

3

4

Assessing security of Cloud computing environments

§ Motivation– Can we quantify security risks?– Can such a measure be used to negotiate different security SLAs?– Can such a measure be used to implement different types of security solutions?

§ How we approach this?– Classify the types of threats facing cloud assets– Classify known vulnerabilities based on the types of threats possible– Develop models for assigning risk probabilities to vulnerabilities

§ Existence of actual attacks§ Existence of mitigations (or patches)§ Significance of vulnerabilities

– Use Bayesian probability models to compute overall risk– Use our previous work on Ontologies for vulnerabilities, attacks and defenses

5

Our Approach

§ How will this project approach the problem?– Use STRIDE model to identify threat types– Create ontologies of vulnerabilities, attacks, and defenses– Use Bayesian probability model to estimate risk for threats using the ontologies– Suggest alternate configurations that can minimize risk– Explore ideas similar to software maturity for predicting latent vulnerabilities

§ Preliminary results:– The following papers describe our preliminary work and feasibility of our approach:

§ P. Kamongi, S. Kotikela, K. Kavi, M. Gomathisankaran and A. Singhal. "VULCAN: Vulnerability assessment framework for Cloud computing", Proceedings of the IEEE 7th International Conference on Software Security and Reliability, June 18-20, 2013, Washington, DC.

§ P. Kamongi, M. Gomathisankaran, K. Kavi. "Nemesis: Automated architecture for threat modeling and risk assessment for cloud computing", The 6th ASE International Conference on Privacy, Security, Risk and Trust (PASSAT-2014), Dec. 13-16, 2014, Cambridge, MA, USA.

– We have developed a limited prototype of our Nemesis Architecture and it is used to assess the risk of any type of Software as a Service (SaaS) application running on top of an OpenStack Infrastructure as a Service (IaaS).

Nemesis Architecture

6

7

Nemesis Example

An aggregated risk estimated to 31.93% of severity

8

Nemesis – Suggested Configurations to Reduce Perceived Risk

› An aggregated new risk estimated to 25.88% of severity

9

Predicting Hidden Vulnerabilities

Current and Future Research

Predicting “hidden” vulnerabilities including ‘zero-day’Use software complexity models to predict hidden vulnerabilitiesUse the rate of patches and major new releases

OpenSSL Release #Known #PredictedVulnerabilities Vulnerabilities

cpe:/a:openssl:openssl:0.9.8h 47 51.92cpe:/a:openssl:openssl:0.9.7h 33 32.73cpe:/a:openssl:openssl:1.0.1g 25 26.06cpe:/a:openssl:openssl:0.9.6e 38 35.95cpe:/a:openssl:openssl:0.9.8b 51 50.15

Extend vulnerability ontology databaseUse security threat intelligence reports

Connection to NCSS Competencies/Capabilities

10�=Primary, �=Secondary, �=Tertiary

�

��

DeliverablesSummary of 3 most significant deliverables expected at end of Year 1.

11

Deliverable Description

1 Detailed report on our Ontologies and Vulcan framework

2 Bayesian model used to define threat probabilities

3 Demonstrations to show the capabilities of NEMESIS

Project Differentiators› What results does this project seek that are different (better)

than others?– At the best of our knowledge, we are the first group to propose an automated risk

assessment architecture for cloud computing which in turns enables us to deliver actionable intelligence in regards to the threats and risks facing any cloud’s assets.

– We are also among the first to use software maturity type approach to predict the number of vulnerabilities in software products

› What specific innovations or insights are sought by this research that distinguish it from related work?– Representing the existing knowledge – vulnerabilities, attacks, defenses, and

configurations – in a meaningful and efficient manner– Automated use of the knowledge to assess the risks– Automated suggestion of risk mitigation strategies– Estimation of hidden vulnerabilities

12

Potential Member Company Benefits› What specific benefits are sought for the industry

members?– Our framework can be utilized by small, medium and large corporations

with an interest in creating private or hybrid cloud systems or migrating to public Cloud systems, to assess the potential security threats and risk levels.

› What leverage does the research provide to industry member R&D plans?– The framework can be expanded into a web-service, leading to

commercialization of the service.

13

Sponsorship and Collaboration› Efforts to involve multiple companies in project

sponsorship:– Boeing and Firehost have expressed interest in this project

› Efforts to involve multiple university collaborators in the project:– Exploring the possibility of collaborating with UTD

14

Copyright © 2014 NSF Net-Centric I/UCRC.All Rights Reserved.

Net-Centric and Cloud Software and Systems I/UCRC

Title of New Project: Processing in Memory for Big Data Application (UNT-14-10-1)

Project Lead: Krishna Kavi, UNTDate: August 2015

Problem Statement

› Why is this research needed?

– 3D stacked DRAMs contain a logic layer

– Can we embed simple processing elements

– What computations should we move to the Processing-In-Memory cores?

› What is the specific problem to be solved?

– What applications benefit from PIMs

– What should be the architecture of PIMs

– What are the performance and energy advantages of using PIMs

– Do we need new programming models?

– New memory system organization?

– Interconnect networks?

2

3

Project OverviewTasks1:

3

Research Goals:

1. Develop energy efficient PIM cores

2. Investigate the impact of parallel overhead on number of cores and frequency

3. Understand heterogeneous memory systems

Benefits to Industry Partners:

1. Primary benefits are for processor and memory system designers like AMD, Intel, TI

2. Cloud applications may execute more efficiently on proposed systems

Project Milestones2:

Task# PlannedCompletion

Milestone (Deliverable)

1 01/15 Analyze emerging Scale Out applications for common functionalities

2 04/15 Develop models for estimating number and nature of PIM cores

3 09/15 Preliminary designs and simulations of PIM

4 10/15 Flat Address Memory Architecture datacollection

5

1 Task has been approved by IAB sponsor(s) Task is a deviation from the original sponsor-approved task (Why?)(See notes section of this slide for more information.)

Task# Task Description

1 Analyze emerging Scale Out applications for common functionalities

2 Develop models for estimating number and nature of PIM cores

3 Preliminary designs and simulations of PIM

2 Milestone complete or is on track for planned completion date Milestone has changed from original sponsor-approved date (Why?)

4

Task#/Description Status Progress and Accomplishments

1. Analyze emerging Scale Out applications for common functionalities

Some analysis is complete, but inconclusive. Need further investigation

2. Develop models for estimating number and nature of PIM cores

Completed models for ARM cores.

3. Preliminary designs and simulations of PIM

Completed for ARM cores and 4 MapReduce benchmarks. Continuing on Dataflow simulations.

4. Data collection for Flat Address Memory Architecture

Developed a heterogeneous memory system simulator. Collected traces for SPEC2006, Graph500, and MapReduce benchmarks. Collected data on HMA system tradeoffs for 2 and 3 level memory systems. Targeting HPCA-2016.

Significant Finding/Accomplishment Task Complete Task Partially Complete Task Not Started

Progress to Date and Accomplishments

Project Pictorial

5

Host

PIM &

DRAM controllers

Memory dies

Timing-specific

DRAM

interface

Abstract

load/store interface

3D-DRAM DDR PCM

+ +

Flat Address Memory Architecture -FLAME

› Large-scale workloads – Big data analysis, graph analysis, in-memory database, HPC, etc.

– Easily exceed memory capacity

– Single-node performance is limited by the memory wall› Specifically disk latency (HDD or SSD)

› Limited bandwidth -> less concurrency -> smaller throughput

– Memory system consumes most of the systems power

6

Flat Address Memory Architecture -FLAME

› Motivation

7

3D-DRAM DDR4 PCM Flash

Latency [ns] 40ns 60ns READ – comparable to DDR4WRITE – 4x-8x

~25us

Bandwidth [GB/s] 160 GB/s 25.6 GB/s 25.6 GB/s ? 500 MB/s

Read accessenergy [pJ/bit]

8 pJ/bit 30 pJ/bit WRITE = 4x DDR4

• Use as much 3D-DRAM as possible?• High bandwidth and low latency• Problem is limited capacity (couple of GBs)

• Use PCM as secondary memory?• Read latency comparable to DDR• Problem is high write energy and write endurance

• Page swapping process still has some overhead

Goal

› Provide sufficient memory capacity to eliminate hard page-faults

› Provide high-bandwidth and low latency for critical data

› Use 3D-DRAM, DDR, and PCM as part of main memory– Heterogeneous memory (Flat Memory)

› Page placement/replacement is critical– Mark pages which are frequently accessed as being “HOT”

– Page migration policy is activated every 0.1s (EPOCH)

– Hotness threshold is se to 32 accesses in 1 epoch

› We explore page migration policies, energy consumption, overhead, application behavior for different memory system organization

8

Experiments and Results

› 3D-DRAM + DDR -> what should be the size of 3D-DRAM (compared to the total memory footprint) in order for transferring to make sense?

› 3D-DRAM + DDR + PCM -> how can we refine the transfer policy to take advantage of locality and access frequency to minimize overhead and maximize performance and energy efficiency

› What is the overhead associated with page transfer?

– Cache flushing, TLB shootdown, DMA transfer/software copy

› LLC line locking -> can we lock the most heavily used data in LLC in order to minimize accesses to main memory

9

Results – 3D-DRAM size to memory footprint ratio

10

-4

-2

0

2

4

6

8

10

12

14

16

18

20

22

24

26

small_gc_lq_gc_lq medium_om_xl_lq_gc large_lb_ml_so_zs very_large_mc_bw_gm_ca

CPI Improvement in % (normalized to respective no_transfer policy)

2:3 1:2 1:4 1:8 1:16

Results

11

-4-202468

101214161820222426

Total execution time improvement in %(compared to respective No-transfer policy)

2:3 1:2 1:4 1:8 1:16

-120

-100

-80

-60

-40

-20

0

20

40

60

80

Improvement in total energy consumption in %(compared to respective no_transfer policy)

2:3 1:2 1:4 1:8 1:16

Results

12

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Percentage of overhead time(compared to respectice total execution time)

2:3 1:2 1:4 1:8 1:16

0

5000000

10000000

15000000

20000000

25000000

30000000

Total pages transferred

2:3 1:2 1:4 1:8 1:16

Results – Page Transfer Overhead

› TLB shootdown, cache flush and invalidation

› Do we want to transfer using DMA or use software copying?

› DMA is faster up to 12% but can consume more energy (up to 30%)

› Need to verify results for multi-programmed workloads

13

Results – Cache locking

14

bwavescactusAD

MGemsFDTD

libquantum

mcf milc soplex xalancbmk zeusmp

no_locking_data 7.27 3.71 17.14 23.00 23.78 14.43 1.91 1.29 5.81

1w_1h_1l_data 7.24 3.89 17.12 22.38 21.97 14.41 1.90 1.22 5.80

1w_2h_1l_data 7.25 3.76 17.13 22.37 22.06 14.43 1.89 1.27 5.80

1w_2h_2l_data 7.25 3.83 17.13 22.37 21.99 14.42 1.89 1.28 5.80

2w_1h_1l_data 7.23 4.03 17.11 21.77 21.19 14.41 1.91 1.16 5.80

2w_2h_2l_data 7.25 3.85 17.12 21.80 21.48 14.42 1.89 1.24 5.78

4w_1h_1l_data 7.23 4.08 17.10 20.62 20.54 14.41 1.94 1.16 5.78

lock_preemp_2D 7.25 3.85 17.12 21.80 21.48 14.42 1.89 1.24 5.78

0.00

5.00

10.00

15.00

20.00

25.00

L3 M

PK

I

USER PROGRAM L3 MISSES (MPKI)

Results – Cache locking

15

bwaves

cactusADM

GemsFDTD

libquantum

mcf milcsoplex

xalancbmk

zeusmp

no_locking_data 26.46 20.67 54.10 49.31 65.81 39.35 19.49 24.57 24.48

1w_1h_1l_data 26.43 20.63 54.08 48.69 63.97 39.34 19.48 24.50 24.48

1w_2h_1l_data 26.44 20.65 54.09 48.69 64.07 39.35 19.48 24.55 24.47

1w_2h_2l_data 26.44 20.61 54.09 48.69 63.99 39.34 19.48 24.56 24.47

2w_1h_1l_data 26.43 20.61 54.06 48.08 63.16 39.34 19.49 24.44 24.48

2w_2h_2l_data 26.44 20.59 54.08 48.11 63.57 39.34 19.48 24.52 24.47

4w_1h_1l_data 26.42 20.61 54.05 46.93 62.79 39.34 19.51 24.44 24.48

lock_preemp_2D 26.44 20.59 54.08 48.11 63.57 39.34 19.48 24.52 24.47

0.0010.0020.0030.0040.0050.0060.0070.00

Tota

l Exe

cuti

on

Tim

e [s

]

TOTAL EXECUTION TIME

bwaves

cactusADM

GemsFDTD

libquantum

mcf milcsoplex

xalancbmk

zeusmp

no_locking_data 9.319 3.774 27.84 3.723 23.35 24.97 0.858 0.321 4.514

1w_1h_1l_data 9.315 3.769 27.88 3.636 22.92 25.01 0.862 0.313 4.517

1w_2h_1l_data 9.312 3.767 27.84 3.634 22.95 24.97 0.867 0.319 4.519

1w_2h_2l_data 9.312 3.765 27.85 3.635 22.91 24.97 0.867 0.320 4.519

2w_1h_1l_data 9.316 3.769 27.86 3.549 22.70 25.03 0.860 0.306 4.530

2w_2h_2l_data 9.312 3.760 27.85 3.555 22.96 24.97 0.867 0.315 4.529

4w_1h_1l_data 9.316 3.768 27.89 3.389 22.98 25.05 0.850 0.307 4.546

lock_preemp_2D 9.312 3.760 27.84 3.555 22.95 24.97 0.867 0.315 4.529

0.0000

5.0000

10.0000

15.0000

20.0000

25.0000

30.0000

Tota

l Dyn

amic

En

ergy

[J]

TOTAL DYNAMIC ENERGY

Connection to NCSS Competencies/Capabilities

16

=Primary, =Secondary, =Tertiary

Efforts to Seek AdditionalSponsorships and CollaborationsWere collaborations sought with researchers at other institutions to broaden research?

Were attempts made to leverage the research to obtain additional funding from companies or government agencies?

Were student researchers subsequently employed or given internships with a sponsor as a result of their work on the project?

17

› Exploring collaborations with AMD and IBM

› Seeking additional support from ARL

Objective Evidence SupportingNCSS Value Proposition

18

Category Objective Evidence

Papers, Publications, Presentations/Venue

1. Mahzabeen Islam, Marko Scrbak, Krishna M. Kavi, Mike Ignatowski, and Nuwan

Jayasena. "Improving Node-Level MapReduce Performance Using Processing-in-

Memory Technologies." In Euro-Par 2014: Parallel Processing Workshops, pp. 425-

437. Springer International Publishing, 2014.

1. Marko Scrbak, Mahzabeen Islam, Krishna M. Kavi, Mike Ignatowski, and Nuwan

Jayasena. "Processing-in-Memory: Exploring the Design Space.” In the 28th

International Conference on the Architecture of Computer Systems (ARCS-2015),

March 24-27, 2015, Porto, Portugal.

Products (Software, Hardware, Data, Designs, etc.)

1. Gem5 implementation of ARM cores as PIMs2. McPAT models of energy for PIM and 3D DRM

Student Placements

Other 1. Exploring collaborations with ARL

Dataflow Processing in Memory (DFPIM) Using

Coarse Grain Reconfigurable Logic (CGRL)

Charles F. ShelorAugust 10, 2015

OutlineªWhat is dataflow?

ªWhat is processing in memory?

ªWhat is Coarse Grain Reconfigurable Logic?

ªWhat is DFPIM?

ªExamples of DFPIM

ªPerformance and Energy benefits of DFPIM

ªTrace based DFG generation

ªResearch Areas

ªConclusions

8/10/2015DFPIM - Charles F. Shelor 2

DataflowªStyle of computation

ª Data flows from operation to operation

ª Operation is performed when all data values arriveª Highly parallel, self synchronizing

ª Has been studied since the 1960’s

ª Overhead of tracking data available is major issue to mainstream usage

ªQ = (X+Y)*(A+B)

ªR = (X-Y)/(A+B)

ª5 operations, 2 ‘cycles’


+ -

* /

+

X Y A B

(X+Y)*(A+B) (X-Y)/(A+B)

Processing in MemoryªApplications not well suited to caches

ª Data fetched from memory, processed once, not accessed again

ª Memory bound, rather than CPU bound jobs

ª Examples include streaming tasks, big data, etcªText document word counting, image histogram, mp3

ª Potentially suitable for non-uniform access patternsª FFT, graph processing

ªMove processing closer to memoryª Higher bandwidth to memory, bypass caches

ª3D stacked memory with logic layer interposerª Requires low power, small size, simple algorithms


Coarse Grain Reconfigurable LogicªCGRL is a set of high level functional blocks with

run-time programmable connections

ªALU, LD/ST, Memory, Multiplier, Divider, Sequencer, Floating Point Add/Sub, FP Multiplier, FP Divider

ªSimilar to hard macros in FPGAsª Processor, DSP elements, Block Memory

ªEach functional block is faster and lower power than being built from programmable gates in a standard FPGA implementation

ªOverall connection routing much simpler as fewer elements to interconnect and more regular pattern


Coarse Grain Reconfigurable Logic


ALU

ALU

ALU

Sequencer

Mem

LD/ST

LD/ST

LD/ST

MULT

DIVO

O

O

O

O

O

O

O

O

O

I1

I1

I1

I1

I1

I1

I1

I1

I1

I1I2I2

I2

I2

I2

I2

I2

I2

I2

I2

Functional Units

Functional Units

Programmable Interconnect

Network

Dataflow PIMªUses CGRL on the 3D RAM logic board

ªConfigures functional blocks into a DF graph to implement the PIM application

ªParallel, pipelined blocks provide multiple operations per clock and typically require only 1 clock per item processed after the pipeline fills

ªNo instruction fetch or decode, no instruction window or reorder buffers, no cache hierarchy, slower clock à lower power with respect to an out-of-order processor


PIM Image Histogram KernelªCode:

ªcompiles to 40 x86 instructions

ªrequires 23 clocks per pixel

ª2.17 IPC (micro-ops per clock)


void histogram(pix image[], int size, int red[], int grn[], int blu[]) {int pxl;int rd, gr, bl;

for (pxl = 0; pxl < size; pxl++) {rd = (int) image[pxl].r;gr = (int) image[pxl].g;bl = (int) image[pxl].b;

red[rd]++;grn[gr]++;blu[bl]++;

}}

DFPIM Image Histogram


rd_adrrd_dat

wr_adr wr_dat

24-bit

>>>>

and and and

rd_adrrd_dat

wr_adr wr_dat

rd_adrrd_dat

wr_adr wr_dat

1

add

1

add

1

add

1

16 8

0xFF0xFF 0xFF

red grn blu

DFPIM Code




<LDST instance="LDST0", size="24">



<IALU instance="red", in_0="LDST0.data", in_1="immed", immed="16", funct="srl” ><IALU instance="grn", in_0="LDST0.data", in_1="immed", immed="8", funct="srl” ><DLY instance="blu", in="LDST0.data”, latency=“1” >



<IALU instance="red2", in_0="red.data", in_1="immed", immed="0xff", funct="and”, size=“8” ><IALU instance="grn2", in_0="grn.data", in_1="immed", immed="0xff", funct="and”, size=“8” ><IALU instance="blu2", in_0="blu.data", in_1="immed", immed="0xff", funct="and”, size=“8” >



<MEM512b instance=red_hist, rd_adrs="red2.data", rd_enable="1", rd_mode=“async”,wr_adrs="red2.data”, wr_data="red_incr.data”, wr_mode=“sync” >

<MEM512b instance=grn_hist, rd_adrs="grn2.data", rd_enable="1", rd_mode=“async”,wr_adrs="grn2.data”, wr_data="grn_incr.data", wr_mode=“sync” >

<MEM512b instance=blu_hist, rd_adrs="blu2.data", rd_enable="1", rd_mode=“async”,wr_adrs="blu2.data”, wr_data="blu_incr.data", wr_mode=“sync” >

<IALU instance="red_incr", in_0="red_hist.data", in_1="immed", immed="1",funct="add”, size=“8”, latency=“0” >

<IALU instance="grn_incr", in_0="grn_hist.data", in_1="immed", immed="1",funct="add”, size=“8”, latency=“0” >

<IALU instance="blu_incr", in_0="blu_hist.data", in_1="immed", immed="1",funct="add”, size=“8”, latency=“0” >

Word Count Kernel


while(i < buffsize) if (fdata[i] >= ‘a’ && fdata[i] <= ‘z’)

fdata[i] = fdata[i] & 0xdf; // make upper casewhile(i < buffsize) {

while(i < buffsize && (fdata[i] < 'A' || fdata[i] > ‘Z’))i++; // skip non-alpha characters

uint64_t start = i;while(((fdata[i] >= 'A' && fdata[i] <= 'Z')

|| fdata[i] == ‘\'') && i < buffsize)i++; // find next non-alpha

if (i > start) { // isolate wordfdata[i] = ‘\0’;char* word = (char*)malloc((i-start+1)*sizeof(char));int x = 0;while (x < (i-start)){

word[x] = fdata[start+x];x++;

}word[x] = '\0';emit(word); // compute hash, incr count

}}

DFPIM Word Count


and

a & b

8-bit

select

1 >= <=

0xDF 'a' 'z'

== >= <=

'\'' 'A' 'Z'

a | (b & c)

TF

add

a & ~b

ZQ

~a & b

FIFO 1 x 32

FIFO 2 x 2

rol

5

word hash

DFPIM Word Count


FIFO 1 x 32

FIFO 2 x 2

wordhash

rd_adrrd_dat

wr_adr wr_dat

rd_adrrd_dat

wr_adr wr_dat

==

<<

2

add

add

1

sequencer

add

1

==

0

Word Count

Word Compare

empty detect

Collision incr

*8

64Kx8*32

64Kx32

DFPIM BenefitsªPerformance/Energy Comparison

ª Processor: 4 GHz, Quad core, 80 Watts (20 per core)

ª DFPIM: 0.8 GHz, 0.05 Watt per ALU equivalentª x86 clocks measured using embedded performance

counters through PAPI library


Benchmark Time (uS) Energy (uJ)x86 DFPIM Speedup x86 DFPIM Savings

Histogram 1506 328 4.59 30115 197 99.3%Word Count 2054 141 14.57 41072 253 99.4%FFT (4096) 1 698 82 5.25 13953 164 98.8%FFT (4096) 2 698 164 4.26 13953 295 97.9%FFT (4096) 3 698 246 2.84 13953 393 97.2%FFT (4096) 4 698 328 2.13 13953 524 96.2%

Trace Based DFG GenerationªDefining code kernels for acceleration using GPUs or

other techniques is often the responsibility of the programmer.

ªCompilers may be used to identify a limited set of kernels for acceleration.

ª In this project, we propose to identify kernels by analyzing execution traces.

ªWe use simple data mining technique by building hash table to count how many times a given instruction address is repeated.

ªThen, we identify kernels based on clusters of instructions with equal, high counts.


Trace Based DFG GenerationªExecution Trace Example

ª Instruction Cluster Example


ªDataflow Graph for previous example


ADD

AND

OR

SRLI 1

SUB

0x1

0x1

edx

ecx

eax

edi

eax'

edi’

esi

t0d

(loop iteration)

edx’

(bit reversal)

Research AreasªAddition of more PIM benchmarks

ª Graph processing, more map-reduce configurations

ªDevelop Energy, Timing, and Size models for each DFPIM functional block and interconnectª Work with synthesis and silicon vendors for values

ªDevelop DFPIM simulatorª Verify accuracy of DFPIM configurations, calculate

timing, compute energy estimates

ªContinue Trace Based DFG Generation effortª Improve recognition of kernelsª Automate generation of DFG from instructions


ConclusionsªDFPIM concept has been defined and its potential

has been evaluated

ªBenefits of DFPIM are dramatic, especially energyª Speedups of 4.6, 5.2, and 14.5

ª Energy savings of 96.2% – 99.4%

ªSignificant detailed research remains


radium aug 2015 - nsf net-centric & cloud software ... · introduction ›...

Documents