radium aug 2015 - nsf net-centric & cloud software ... · introduction ›...
TRANSCRIPT
Radium: Race-‐free On-‐demand Integrity Measurement Architecture
Srujan Kotikela
Supported by NCSS Industry/University Cooperative Research Center (IUCRC)
Introduction› Integrity Measurement
– Measurement represents the state/behavior of an entity– Integrity measurement acts as a basis for trust
› Trusted Computing– An entity can be trusted if it always behaves in the expected manner for the intended purpose – Trusted Computing Group
– Trusted computing is the basis for modern secure systems
› Radium (this talk)– Eliminate TOCTTOU condition– Allow concurrent untrusted/trusted services– More semantic and efficient measurements
2
3
Background › Hardware based root-‐of-‐trust is required for trustworthy measurements
› To verify trustworthiness of computing platform– Start with immutable hardware and measure sequentially– Measure each component by a measured component – establish chain of trust
› Existing solutions to establish transitive chain-‐of-‐trust– SRTM – static root of trust for measurement– DRTM – dynamic root of trust for measurement
› Both SRTM and DRTM operate with the axiom– Measured components do not change after measurement
Static Root of Trust for Measurement (SRTM)
4
• Immutable ROM measures and launches verified BIOS
• A verified software will measure, verify and launch the next software in the chain
• Trust in application is guaranteed at the time of launch which
• Requires a restart to launch a MLE
T = Time of M = Measurement of
Hardware(Immutable ROM)
System Software(BIOS + Bootloader + VMM +
OS)
Application
M system software
M application
T boot
T launch
T useNo
Measurement
Measured Launch Environment
Dynamic Root of Trust for Measurement (DRTM)
5
Hardware(Immutable ROM)
Authenticated Code Module (ACM)
Application
M ACM
M application
T launch
T useNo
Measurement
• DRTM is dynamic and can be invoked anytime (including boot-‐time)
• Suspends current environment and creates isolated environment
• Chain of trust may contain vendor specific ACM
• Trust in application is guaranteed at the time of launch
• Requires a reset to launch the MLE
T = Time of M = Measurement of
Suspended Untrusted World
Measured Launch Environment
Radium: Goals › Extend existing trust technology—DRTM, to provide on-‐demand measurements
› Use measuring services for TOCTTOU free measurements
› Provide efficient and semantically rich measurements
› Allow more than one measured environment to co-‐exist and co-‐operate
› Contain an Access Control Policy which controls accesses between all trusted and untrusted environments
6
Radium: Architecture
7
Trusted Hardware
CPU TPM
Trusted Hypervisor
Asynchronous Root of
Trust for Measurem
ent
measure
Target VM MeasuringService
Access Control Policy Module
Verif
ied
Laun
ch
VerifiedLaunch
Radium: Architecture
8
Trusted Hardware
CPU TPM
Trusted Hypervisor
Asynchronous Root of
Trust for Measurem
ent
Target VM
Access Control Policy Module
Verif
ied
Laun
ch
User/Client
Verified!!
Radium: Implementation
9
Intel TXT
CPU TPM
Xen
Asynchronous Root of
Trust for Measurem
ent
measure
Ubuntu 10.04w/ kbeast
Ubuntu 12.04w/ libVMI & Volatality
Xen Security Module
DRT
MBo
ot
VerifiedLaunch
ExtendMeasurem
ent
Radium: Security› VMM is verified and trusted using DRTM
› Trusted VMM acts as Asynchronous Root of Trust for Measurement (ARTM)
› All environments (trusted and untrusted) are isolated by the trusted VMM’s hardware isolation
› The trustworthiness of all measurements is ensured by the ARTM
› All accesses are protected by the trusted Mandatory Access Control policy within the VMM
Radium provides security against all the types of attacks except the offline attacks on hardware
Related Work› Concurrent MLEs
– Intel SGX (McKeen, Frank et al. 2013 [14]), Concurrent Secure Worlds (RamyaMasti et al. 2013 [06])
› Trusted VMM– TrustVisor (McCune et al. 2010 [05]), Terra (Garfinkel et a. 2003 [08])
› VMM security– Xoar (Colp, Patrick et. al, 2011 [13]), NoHype (E Keller et al. 2010[09]), Eli (Abel Gordon et al. 2013 [12])
› Integrity Measurement– ReDAS, Automated security debugging (Chongkyung Kil et al. 2009)
Future Directions› VMware
– Porting Radium to VMware using vProbes interface
› Intel SGX– Hardware-‐only asynchronous root of trust for measurements
› Invariants– Using application’ properties (invariants) to determine the security state of the application
14
Application behavior -‐ Invariants› Properties of an application that has to hold good at certain a point during execution
› An invariant can be a fixed value or enumerated set
› Data Invariants:– Properties of individual variables or relation among a group of variables– Ex: equality/inequality invariants– Constant, original invariants
› Structural Invariants: – Program rules those have to be to be true at run time– Return address of stack should always point to code section of memory– Frame pointer of stack shouldn’t change during function execution– Similar constraints for heap and other sections of program memory
Invariants for measurements› Useful in debugging, but often undocumented and unware these exists
› Invariants are instance specific – Need to get multiple set of invariants (training phase)
› Security sensitive invariants– Which and how many invariants will affect security of the application
› Extracting Invariants– Instrument the application with canary values– The canary value is used to monitor return address constraint– Study the difference in invariants during normal execution and exploited execution– Daikon is used to produced possible invariants
Daikon› Ghttpd 1.4
.count_vhosts():::ENTER //function entry::SERVERPORT == 80::SERVERROOT == "/usr/local/ghttpd"::SERVERTYPE == "Standalone“S::no_vhosts == 0::vhosts == null==================================..count_vhosts():::EXIT //function-‐exit::SERVERPORT == orig(::SERVERPORT)::no_vhosts == return::no_vhosts == orig(::no_vhosts)::vhosts == orig(::vhosts)::defaulthost == orig(::defaulthost)::SERVERPORT == 80
› Prozilla 1.3.7::canary == 1000(entry)::canary == 1000(exit)::canary == orig(::canary)(exit)========================::connections[].http_sock elements < ::canary(exit)::rt has only one value(entry)::rt == orig(::rt)(exit)==================================::rt.num_connections == 4(entry)::rt.ftps_mirror_req_n < orig(::canary)(exit)
Application Integrity verification› The Target VM’s memory is parsed for (canary) variable values and stack constraints
› Looking for suspicious writing behavior on the stack
› Details of observed violations are saved to TPM PCRs
› Determine the security state of the application by measuring security invariants
› Detect code re-‐use attacks which are hard to detect with conventional techniques
Conclusions› On-‐demand measurements are necessary to overcome TOCTTOU condition in integrity measurements
› Using a minimal TCB hypervisor as a root-‐of-‐trust (ARTM) for measurement is a viable replacement of hardware DRTM
› Semantically rich, fine grained measurements are possible with ARTM
› Zero downtime for environments and powerful security applications can be achieved with Radium architecture
19
References1. RADIUM: Race free On-‐demand Integrity Measurement. Srujan Kotikela, Mahadevan Gomathisankaran, Tawfiq Shah, GelarehTaban
2. Trusted computing using AMD "Pacifica" and "Presidio" secure virtual machine technology; Geoffrey Strongin, Advanced Micro Devices, Inc.
3. BIOS chronomancy: fixing the core root of trust for measurement; John Butterworth, Corey Kallenberg, XenoKovah, Amy Herzog
4. Trusted Boot: Veriifyiing the XenLaunch; Joseph Cihula
5. Flicker: an execution infrastructure for TCB minimization; Jonathan M. McCune, Bryan J. Parno, Adrian Perrig, Michael K. Reiter, Hiroshi Isozaki
6. TrustVisor: Efficient TCB Reduction and Attestation; Jonathan M. McCune, Yanlin Li, NingQu, ZongweiZhou, Anupam Datta, Virgil Gligor, Adrian Perrig
7. An architecture for concurrent execution of secure environments in clouds; Ramya JayaramMasti, Claudio Marforio, SrdjanCapkun
8. Copilot -‐ a coprocessor-‐based kernel runtime integrity monitor; Nick L. Petroni, Jr., Timothy Fraser, Jesus Molina, William A. Arbaugh
9. Terra: a virtual machine-‐based platform for trusted computing; Tal Garfinkel, Ben Pfaff, Jim Chow, Mendel Rosenblum, Dan Boneh
10. NoHype: virtualized cloud infrastructure without the virtualization; E Keller, J Szefer, J Rexford, RB Lee
11. Building a MAC-‐based security architecture for the Xenopen-‐source hypervisor; Sailer, R.; Jaeger, T.; Valdez, E.; Caceres, R.; Perez, R.; Berger, S.; Griffin, J.L.; van Doorn, L.
12. KVM: Hypervisor Security You Can Depend On; George Wilson, Michael Day, Beth Taylor
13. ELI: Bare-‐Metal Performance for I/O Virtualization; Abel Gordon, Nadav Amit, NadavHar’El, MuliBen-‐Yehuda, Alex Landau, Assaf Schuster, Dan Tsafrir
14. Breaking up is hard to do: security and functionality in a commodity hypervisor; Colp, Patrick and Nanavati, Mihir and Zhu, Jun and Aiello, William and Coker, George and Deegan, Tim and Loscocco, Peter and Warfield, Andrew
15. Innovative Instructions and Software Model for Isolated Execution; McKeen, Frank and Alexandrovich, Ilya and Berenzon, Alex and Rozas, Carlos V. and Shafi, Hisham and Shanbhogue, Vedvyasand Savagaonkar, UdayR
16. ReDAS: Remote Attestation to Dynamic System properties. ChongkyungKil, EmreC. Sezer, Ahmed M. Azab, Peng Ning, XiaolanZhang
17. Automated security debugging using program structural constraints. Chongkyung Kil, Emre C. Sezer, Peng Ning, Xiaolan Zhang
18. Daikon System for dynamically detection of likely program invariants. Michel D. Ernst, J.H.Perkins, P.J.Guo, C. Xios
Copyright © 2014 NSF Net-Centric I/UCRC.All Rights Reserved.
Net-Centric and Cloud Software and Systems I/UCRC Net-Centric and Cloud Software and Systems I/UCRC
NEMESIS - Automated Architecture for Threat Modeling and Risk Assessment for Cloud Computing (UNT-15-4-1)
Project Lead: Krishna Kavi, UNT, Mahadevan Gomathisankaran (Microsoft)Date: April 8, 2015
Problem Statement› Why is this research needed?
– To address the need for a comprehensive solution for cloud security threat modeling which incorporates the vulnerability assessment process, and then offers an actionable risk analysis tool
– A quantitative assessment can be used to negotiate security SLAs
› What are the specific problems to be solved?– What are the types of threats facing cloud assets? – Is there any scale to indicate the threat level? – Is there any metric to characterize critical vulnerabilities facing the cloud’s assets? – Is it possible to predict the number of latent vulnerabilities that are not revealed ?– Is it possible to recommend an alternative Cloud’s assets configuration to reduce
the current configuration perceived risk?
2
Project Description› How will this project approach the problem?
– Use STRIDE model to identify threat types– Create ontologies of vulnerabilities, attacks, and defenses– Use Bayesian probability model to estimate risk for threats using the ontologies– Use ontologies to suggest alternate configurations that can minimize risk– Explore ideas similar to software maturity for predicting latent vulnerabilities
› Preliminary results:– The following papers describe our preliminary work and feasibility of our approach:
› P. Kamongi, S. Kotikela, K. Kavi, M. Gomathisankaran and A. Singhal. "VULCAN: Vulnerability assessment framework for Cloud computing", Proceedings of the IEEE 7th International Conference on Software Security and Reliability, June 18-20, 2013, Washington, DC.
› P. Kamongi, M. Gomathisankaran, K. Kavi. "Nemesis: Automated architecture for threat modeling and risk assessment for cloud computing", The 6th ASE International Conference on Privacy, Security, Risk and Trust (PASSAT-2014), Dec. 13-16, 2014, Cambridge, MA, USA.
– We have developed a limited prototype of our Nemesis Architecture and it is used to assess the risk of any type of Software as a Service (SaaS) application running on top of an OpenStack Infrastructure as a Service (IaaS).
3
4
Assessing security of Cloud computing environments
§ Motivation– Can we quantify security risks?– Can such a measure be used to negotiate different security SLAs?– Can such a measure be used to implement different types of security solutions?
§ How we approach this?– Classify the types of threats facing cloud assets– Classify known vulnerabilities based on the types of threats possible– Develop models for assigning risk probabilities to vulnerabilities
§ Existence of actual attacks§ Existence of mitigations (or patches)§ Significance of vulnerabilities
– Use Bayesian probability models to compute overall risk– Use our previous work on Ontologies for vulnerabilities, attacks and defenses
5
Our Approach
§ How will this project approach the problem?– Use STRIDE model to identify threat types– Create ontologies of vulnerabilities, attacks, and defenses– Use Bayesian probability model to estimate risk for threats using the ontologies– Suggest alternate configurations that can minimize risk– Explore ideas similar to software maturity for predicting latent vulnerabilities
§ Preliminary results:– The following papers describe our preliminary work and feasibility of our approach:
§ P. Kamongi, S. Kotikela, K. Kavi, M. Gomathisankaran and A. Singhal. "VULCAN: Vulnerability assessment framework for Cloud computing", Proceedings of the IEEE 7th International Conference on Software Security and Reliability, June 18-20, 2013, Washington, DC.
§ P. Kamongi, M. Gomathisankaran, K. Kavi. "Nemesis: Automated architecture for threat modeling and risk assessment for cloud computing", The 6th ASE International Conference on Privacy, Security, Risk and Trust (PASSAT-2014), Dec. 13-16, 2014, Cambridge, MA, USA.
– We have developed a limited prototype of our Nemesis Architecture and it is used to assess the risk of any type of Software as a Service (SaaS) application running on top of an OpenStack Infrastructure as a Service (IaaS).
8
Nemesis – Suggested Configurations to Reduce Perceived Risk
› An aggregated new risk estimated to 25.88% of severity
9
Predicting Hidden Vulnerabilities
Current and Future Research
Predicting “hidden” vulnerabilities including ‘zero-day’Use software complexity models to predict hidden vulnerabilitiesUse the rate of patches and major new releases
OpenSSL Release #Known #PredictedVulnerabilities Vulnerabilities
cpe:/a:openssl:openssl:0.9.8h 47 51.92cpe:/a:openssl:openssl:0.9.7h 33 32.73cpe:/a:openssl:openssl:1.0.1g 25 26.06cpe:/a:openssl:openssl:0.9.6e 38 35.95cpe:/a:openssl:openssl:0.9.8b 51 50.15
Extend vulnerability ontology databaseUse security threat intelligence reports
DeliverablesSummary of 3 most significant deliverables expected at end of Year 1.
11
Deliverable Description
1 Detailed report on our Ontologies and Vulcan framework
2 Bayesian model used to define threat probabilities
3 Demonstrations to show the capabilities of NEMESIS
Project Differentiators› What results does this project seek that are different (better)
than others?– At the best of our knowledge, we are the first group to propose an automated risk
assessment architecture for cloud computing which in turns enables us to deliver actionable intelligence in regards to the threats and risks facing any cloud’s assets.
– We are also among the first to use software maturity type approach to predict the number of vulnerabilities in software products
› What specific innovations or insights are sought by this research that distinguish it from related work?– Representing the existing knowledge – vulnerabilities, attacks, defenses, and
configurations – in a meaningful and efficient manner– Automated use of the knowledge to assess the risks– Automated suggestion of risk mitigation strategies– Estimation of hidden vulnerabilities
12
Potential Member Company Benefits› What specific benefits are sought for the industry
members?– Our framework can be utilized by small, medium and large corporations
with an interest in creating private or hybrid cloud systems or migrating to public Cloud systems, to assess the potential security threats and risk levels.
› What leverage does the research provide to industry member R&D plans?– The framework can be expanded into a web-service, leading to
commercialization of the service.
13
Sponsorship and Collaboration› Efforts to involve multiple companies in project
sponsorship:– Boeing and Firehost have expressed interest in this project
› Efforts to involve multiple university collaborators in the project:– Exploring the possibility of collaborating with UTD
14
Copyright © 2014 NSF Net-Centric I/UCRC.All Rights Reserved.
Net-Centric and Cloud Software and Systems I/UCRC
Title of New Project: Processing in Memory for Big Data Application (UNT-14-10-1)
Project Lead: Krishna Kavi, UNTDate: August 2015
Problem Statement
› Why is this research needed?
– 3D stacked DRAMs contain a logic layer
– Can we embed simple processing elements
– What computations should we move to the Processing-In-Memory cores?
› What is the specific problem to be solved?
– What applications benefit from PIMs
– What should be the architecture of PIMs
– What are the performance and energy advantages of using PIMs
– Do we need new programming models?
– New memory system organization?
– Interconnect networks?
2
3
Project OverviewTasks1:
3
Research Goals:
1. Develop energy efficient PIM cores
2. Investigate the impact of parallel overhead on number of cores and frequency
3. Understand heterogeneous memory systems
Benefits to Industry Partners:
1. Primary benefits are for processor and memory system designers like AMD, Intel, TI
2. Cloud applications may execute more efficiently on proposed systems
Project Milestones2:
Task# PlannedCompletion
Milestone (Deliverable)
1 01/15 Analyze emerging Scale Out applications for common functionalities
2 04/15 Develop models for estimating number and nature of PIM cores
3 09/15 Preliminary designs and simulations of PIM
4 10/15 Flat Address Memory Architecture datacollection
5
1 Task has been approved by IAB sponsor(s) Task is a deviation from the original sponsor-approved task (Why?)(See notes section of this slide for more information.)
Task# Task Description
1 Analyze emerging Scale Out applications for common functionalities
2 Develop models for estimating number and nature of PIM cores
3 Preliminary designs and simulations of PIM
2 Milestone complete or is on track for planned completion date Milestone has changed from original sponsor-approved date (Why?)
4
Task#/Description Status Progress and Accomplishments
1. Analyze emerging Scale Out applications for common functionalities
Some analysis is complete, but inconclusive. Need further investigation
2. Develop models for estimating number and nature of PIM cores
Completed models for ARM cores.
3. Preliminary designs and simulations of PIM
Completed for ARM cores and 4 MapReduce benchmarks. Continuing on Dataflow simulations.
4. Data collection for Flat Address Memory Architecture
Developed a heterogeneous memory system simulator. Collected traces for SPEC2006, Graph500, and MapReduce benchmarks. Collected data on HMA system tradeoffs for 2 and 3 level memory systems. Targeting HPCA-2016.
Significant Finding/Accomplishment Task Complete Task Partially Complete Task Not Started
Progress to Date and Accomplishments
Project Pictorial
5
Host
PIM &
DRAM controllers
Memory dies
Timing-specific
DRAM
interface
Abstract
load/store interface
3D-DRAM DDR PCM
+ +
Flat Address Memory Architecture -FLAME
› Large-scale workloads – Big data analysis, graph analysis, in-memory database, HPC, etc.
– Easily exceed memory capacity
– Single-node performance is limited by the memory wall› Specifically disk latency (HDD or SSD)
› Limited bandwidth -> less concurrency -> smaller throughput
– Memory system consumes most of the systems power
6
Flat Address Memory Architecture -FLAME
› Motivation
7
3D-DRAM DDR4 PCM Flash
Latency [ns] 40ns 60ns READ – comparable to DDR4WRITE – 4x-8x
~25us
Bandwidth [GB/s] 160 GB/s 25.6 GB/s 25.6 GB/s ? 500 MB/s
Read accessenergy [pJ/bit]
8 pJ/bit 30 pJ/bit WRITE = 4x DDR4
• Use as much 3D-DRAM as possible?• High bandwidth and low latency• Problem is limited capacity (couple of GBs)
• Use PCM as secondary memory?• Read latency comparable to DDR• Problem is high write energy and write endurance
• Page swapping process still has some overhead
Goal
› Provide sufficient memory capacity to eliminate hard page-faults
› Provide high-bandwidth and low latency for critical data
› Use 3D-DRAM, DDR, and PCM as part of main memory– Heterogeneous memory (Flat Memory)
› Page placement/replacement is critical– Mark pages which are frequently accessed as being “HOT”
– Page migration policy is activated every 0.1s (EPOCH)
– Hotness threshold is se to 32 accesses in 1 epoch
› We explore page migration policies, energy consumption, overhead, application behavior for different memory system organization
8
Experiments and Results
› 3D-DRAM + DDR -> what should be the size of 3D-DRAM (compared to the total memory footprint) in order for transferring to make sense?
› 3D-DRAM + DDR + PCM -> how can we refine the transfer policy to take advantage of locality and access frequency to minimize overhead and maximize performance and energy efficiency
› What is the overhead associated with page transfer?
– Cache flushing, TLB shootdown, DMA transfer/software copy
› LLC line locking -> can we lock the most heavily used data in LLC in order to minimize accesses to main memory
9
Results – 3D-DRAM size to memory footprint ratio
10
-4
-2
0
2
4
6
8
10
12
14
16
18
20
22
24
26
small_gc_lq_gc_lq medium_om_xl_lq_gc large_lb_ml_so_zs very_large_mc_bw_gm_ca
CPI Improvement in % (normalized to respective no_transfer policy)
2:3 1:2 1:4 1:8 1:16
Results
11
-4-202468
101214161820222426
Total execution time improvement in %(compared to respective No-transfer policy)
2:3 1:2 1:4 1:8 1:16
-120
-100
-80
-60
-40
-20
0
20
40
60
80
Improvement in total energy consumption in %(compared to respective no_transfer policy)
2:3 1:2 1:4 1:8 1:16
Results
12
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Percentage of overhead time(compared to respectice total execution time)
2:3 1:2 1:4 1:8 1:16
0
5000000
10000000
15000000
20000000
25000000
30000000
Total pages transferred
2:3 1:2 1:4 1:8 1:16
Results – Page Transfer Overhead
› TLB shootdown, cache flush and invalidation
› Do we want to transfer using DMA or use software copying?
› DMA is faster up to 12% but can consume more energy (up to 30%)
› Need to verify results for multi-programmed workloads
13
Results – Cache locking
14
bwavescactusAD
MGemsFDTD
libquantum
mcf milc soplex xalancbmk zeusmp
no_locking_data 7.27 3.71 17.14 23.00 23.78 14.43 1.91 1.29 5.81
1w_1h_1l_data 7.24 3.89 17.12 22.38 21.97 14.41 1.90 1.22 5.80
1w_2h_1l_data 7.25 3.76 17.13 22.37 22.06 14.43 1.89 1.27 5.80
1w_2h_2l_data 7.25 3.83 17.13 22.37 21.99 14.42 1.89 1.28 5.80
2w_1h_1l_data 7.23 4.03 17.11 21.77 21.19 14.41 1.91 1.16 5.80
2w_2h_2l_data 7.25 3.85 17.12 21.80 21.48 14.42 1.89 1.24 5.78
4w_1h_1l_data 7.23 4.08 17.10 20.62 20.54 14.41 1.94 1.16 5.78
lock_preemp_2D 7.25 3.85 17.12 21.80 21.48 14.42 1.89 1.24 5.78
0.00
5.00
10.00
15.00
20.00
25.00
L3 M
PK
I
USER PROGRAM L3 MISSES (MPKI)
Results – Cache locking
15
bwaves
cactusADM
GemsFDTD
libquantum
mcf milcsoplex
xalancbmk
zeusmp
no_locking_data 26.46 20.67 54.10 49.31 65.81 39.35 19.49 24.57 24.48
1w_1h_1l_data 26.43 20.63 54.08 48.69 63.97 39.34 19.48 24.50 24.48
1w_2h_1l_data 26.44 20.65 54.09 48.69 64.07 39.35 19.48 24.55 24.47
1w_2h_2l_data 26.44 20.61 54.09 48.69 63.99 39.34 19.48 24.56 24.47
2w_1h_1l_data 26.43 20.61 54.06 48.08 63.16 39.34 19.49 24.44 24.48
2w_2h_2l_data 26.44 20.59 54.08 48.11 63.57 39.34 19.48 24.52 24.47
4w_1h_1l_data 26.42 20.61 54.05 46.93 62.79 39.34 19.51 24.44 24.48
lock_preemp_2D 26.44 20.59 54.08 48.11 63.57 39.34 19.48 24.52 24.47
0.0010.0020.0030.0040.0050.0060.0070.00
Tota
l Exe
cuti
on
Tim
e [s
]
TOTAL EXECUTION TIME
bwaves
cactusADM
GemsFDTD
libquantum
mcf milcsoplex
xalancbmk
zeusmp
no_locking_data 9.319 3.774 27.84 3.723 23.35 24.97 0.858 0.321 4.514
1w_1h_1l_data 9.315 3.769 27.88 3.636 22.92 25.01 0.862 0.313 4.517
1w_2h_1l_data 9.312 3.767 27.84 3.634 22.95 24.97 0.867 0.319 4.519
1w_2h_2l_data 9.312 3.765 27.85 3.635 22.91 24.97 0.867 0.320 4.519
2w_1h_1l_data 9.316 3.769 27.86 3.549 22.70 25.03 0.860 0.306 4.530
2w_2h_2l_data 9.312 3.760 27.85 3.555 22.96 24.97 0.867 0.315 4.529
4w_1h_1l_data 9.316 3.768 27.89 3.389 22.98 25.05 0.850 0.307 4.546
lock_preemp_2D 9.312 3.760 27.84 3.555 22.95 24.97 0.867 0.315 4.529
0.0000
5.0000
10.0000
15.0000
20.0000
25.0000
30.0000
Tota
l Dyn
amic
En
ergy
[J]
TOTAL DYNAMIC ENERGY
Efforts to Seek AdditionalSponsorships and CollaborationsWere collaborations sought with researchers at other institutions to broaden research?
Were attempts made to leverage the research to obtain additional funding from companies or government agencies?
Were student researchers subsequently employed or given internships with a sponsor as a result of their work on the project?
17
› Exploring collaborations with AMD and IBM
› Seeking additional support from ARL
Objective Evidence SupportingNCSS Value Proposition
18
Category Objective Evidence
Papers, Publications, Presentations/Venue
1. Mahzabeen Islam, Marko Scrbak, Krishna M. Kavi, Mike Ignatowski, and Nuwan
Jayasena. "Improving Node-Level MapReduce Performance Using Processing-in-
Memory Technologies." In Euro-Par 2014: Parallel Processing Workshops, pp. 425-
437. Springer International Publishing, 2014.
1. Marko Scrbak, Mahzabeen Islam, Krishna M. Kavi, Mike Ignatowski, and Nuwan
Jayasena. "Processing-in-Memory: Exploring the Design Space.” In the 28th
International Conference on the Architecture of Computer Systems (ARCS-2015),
March 24-27, 2015, Porto, Portugal.
Products (Software, Hardware, Data, Designs, etc.)
1. Gem5 implementation of ARM cores as PIMs2. McPAT models of energy for PIM and 3D DRM
Student Placements
Other 1. Exploring collaborations with ARL
Dataflow Processing in Memory (DFPIM) Using
Coarse Grain Reconfigurable Logic (CGRL)
Charles F. ShelorAugust 10, 2015
OutlineªWhat is dataflow?
ªWhat is processing in memory?
ªWhat is Coarse Grain Reconfigurable Logic?
ªWhat is DFPIM?
ªExamples of DFPIM
ªPerformance and Energy benefits of DFPIM
ªTrace based DFG generation
ªResearch Areas
ªConclusions
8/10/2015DFPIM - Charles F. Shelor 2
DataflowªStyle of computation
ª Data flows from operation to operation
ª Operation is performed when all data values arriveª Highly parallel, self synchronizing
ª Has been studied since the 1960’s
ª Overhead of tracking data available is major issue to mainstream usage
ªQ = (X+Y)*(A+B)
ªR = (X-Y)/(A+B)
ª5 operations, 2 ‘cycles’
8/10/2015DFPIM - Charles F. Shelor 3
+ -
* /
+
X Y A B
(X+Y)*(A+B) (X-Y)/(A+B)
Processing in MemoryªApplications not well suited to caches
ª Data fetched from memory, processed once, not accessed again
ª Memory bound, rather than CPU bound jobs
ª Examples include streaming tasks, big data, etcªText document word counting, image histogram, mp3
ª Potentially suitable for non-uniform access patternsª FFT, graph processing
ªMove processing closer to memoryª Higher bandwidth to memory, bypass caches
ª3D stacked memory with logic layer interposerª Requires low power, small size, simple algorithms
8/10/2015DFPIM - Charles F. Shelor 4
Coarse Grain Reconfigurable LogicªCGRL is a set of high level functional blocks with
run-time programmable connections
ªALU, LD/ST, Memory, Multiplier, Divider, Sequencer, Floating Point Add/Sub, FP Multiplier, FP Divider
ªSimilar to hard macros in FPGAsª Processor, DSP elements, Block Memory
ªEach functional block is faster and lower power than being built from programmable gates in a standard FPGA implementation
ªOverall connection routing much simpler as fewer elements to interconnect and more regular pattern
8/10/2015DFPIM - Charles F. Shelor 5
Coarse Grain Reconfigurable Logic
8/10/2015DFPIM - Charles F. Shelor 6
ALU
ALU
ALU
Sequencer
Mem
LD/ST
LD/ST
LD/ST
MULT
DIVO
O
O
O
O
O
O
O
O
O
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1I2I2
I2
I2
I2
I2
I2
I2
I2
I2
Functional Units
Functional Units
Programmable Interconnect
Network
Dataflow PIMªUses CGRL on the 3D RAM logic board
ªConfigures functional blocks into a DF graph to implement the PIM application
ªParallel, pipelined blocks provide multiple operations per clock and typically require only 1 clock per item processed after the pipeline fills
ªNo instruction fetch or decode, no instruction window or reorder buffers, no cache hierarchy, slower clock à lower power with respect to an out-of-order processor
8/10/2015DFPIM - Charles F. Shelor 7
PIM Image Histogram KernelªCode:
ªcompiles to 40 x86 instructions
ªrequires 23 clocks per pixel
ª2.17 IPC (micro-ops per clock)
8/10/2015DFPIM - Charles F. Shelor 8
void histogram(pix image[], int size, int red[], int grn[], int blu[]) {int pxl;int rd, gr, bl;
for (pxl = 0; pxl < size; pxl++) {rd = (int) image[pxl].r;gr = (int) image[pxl].g;bl = (int) image[pxl].b;
red[rd]++;grn[gr]++;blu[bl]++;
}}
DFPIM Image Histogram
8/10/2015DFPIM - Charles F. Shelor 9
rd_adrrd_dat
wr_adr wr_dat
24-bit
>>>>
and and and
rd_adrrd_dat
wr_adr wr_dat
rd_adrrd_dat
wr_adr wr_dat
1
add
1
add
1
add
1
16 8
0xFF0xFF 0xFF
red grn blu
DFPIM Code
8/10/2015DFPIM - Charles F. Shelor 10
<!-- Histogram map DF implementation -->
<LDST instance="LDST0", size="24">
<!-- clock 1: shift r, g data; delay b data -->
<IALU instance="red", in_0="LDST0.data", in_1="immed", immed="16", funct="srl” ><IALU instance="grn", in_0="LDST0.data", in_1="immed", immed="8", funct="srl” ><DLY instance="blu", in="LDST0.data”, latency=“1” >
<!-- clock 2: mask values to 8 bits each -->
<IALU instance="red2", in_0="red.data", in_1="immed", immed="0xff", funct="and”, size=“8” ><IALU instance="grn2", in_0="grn.data", in_1="immed", immed="0xff", funct="and”, size=“8” ><IALU instance="blu2", in_0="blu.data", in_1="immed", immed="0xff", funct="and”, size=“8” >
<!-- clock 3: read current histogram counts, increment, store -->
<MEM512b instance=red_hist, rd_adrs="red2.data", rd_enable="1", rd_mode=“async”,wr_adrs="red2.data”, wr_data="red_incr.data”, wr_mode=“sync” >
<MEM512b instance=grn_hist, rd_adrs="grn2.data", rd_enable="1", rd_mode=“async”,wr_adrs="grn2.data”, wr_data="grn_incr.data", wr_mode=“sync” >
<MEM512b instance=blu_hist, rd_adrs="blu2.data", rd_enable="1", rd_mode=“async”,wr_adrs="blu2.data”, wr_data="blu_incr.data", wr_mode=“sync” >
<IALU instance="red_incr", in_0="red_hist.data", in_1="immed", immed="1",funct="add”, size=“8”, latency=“0” >
<IALU instance="grn_incr", in_0="grn_hist.data", in_1="immed", immed="1",funct="add”, size=“8”, latency=“0” >
<IALU instance="blu_incr", in_0="blu_hist.data", in_1="immed", immed="1",funct="add”, size=“8”, latency=“0” >
Word Count Kernel
8/10/2015DFPIM - Charles F. Shelor 11
while(i < buffsize) if (fdata[i] >= ‘a’ && fdata[i] <= ‘z’)
fdata[i] = fdata[i] & 0xdf; // make upper casewhile(i < buffsize) {
while(i < buffsize && (fdata[i] < 'A' || fdata[i] > ‘Z’))i++; // skip non-alpha characters
uint64_t start = i;while(((fdata[i] >= 'A' && fdata[i] <= 'Z')
|| fdata[i] == ‘\'') && i < buffsize)i++; // find next non-alpha
if (i > start) { // isolate wordfdata[i] = ‘\0’;char* word = (char*)malloc((i-start+1)*sizeof(char));int x = 0;while (x < (i-start)){
word[x] = fdata[start+x];x++;
}word[x] = '\0';emit(word); // compute hash, incr count
}}
DFPIM Word Count
8/10/2015DFPIM - Charles F. Shelor 12
and
a & b
8-bit
select
1 >= <=
0xDF 'a' 'z'
== >= <=
'\'' 'A' 'Z'
a | (b & c)
TF
add
a & ~b
ZQ
~a & b
FIFO 1 x 32
FIFO 2 x 2
rol
5
word hash
DFPIM Word Count
8/10/2015DFPIM - Charles F. Shelor 13
FIFO 1 x 32
FIFO 2 x 2
wordhash
rd_adrrd_dat
wr_adr wr_dat
rd_adrrd_dat
wr_adr wr_dat
==
<<
2
add
add
1
sequencer
add
1
==
0
Word Count
Word Compare
empty detect
Collision incr
*8
64Kx8*32
64Kx32
DFPIM BenefitsªPerformance/Energy Comparison
ª Processor: 4 GHz, Quad core, 80 Watts (20 per core)
ª DFPIM: 0.8 GHz, 0.05 Watt per ALU equivalentª x86 clocks measured using embedded performance
counters through PAPI library
8/10/2015DFPIM - Charles F. Shelor 14
Benchmark Time (uS) Energy (uJ)x86 DFPIM Speedup x86 DFPIM Savings
Histogram 1506 328 4.59 30115 197 99.3%Word Count 2054 141 14.57 41072 253 99.4%FFT (4096) 1 698 82 5.25 13953 164 98.8%FFT (4096) 2 698 164 4.26 13953 295 97.9%FFT (4096) 3 698 246 2.84 13953 393 97.2%FFT (4096) 4 698 328 2.13 13953 524 96.2%
Trace Based DFG GenerationªDefining code kernels for acceleration using GPUs or
other techniques is often the responsibility of the programmer.
ªCompilers may be used to identify a limited set of kernels for acceleration.
ª In this project, we propose to identify kernels by analyzing execution traces.
ªWe use simple data mining technique by building hash table to count how many times a given instruction address is repeated.
ªThen, we identify kernels based on clusters of instructions with equal, high counts.
8/10/2015DFPIM - Charles F. Shelor 15
Trace Based DFG GenerationªExecution Trace Example
ª Instruction Cluster Example
8/10/2015DFPIM - Charles F. Shelor 16
ªDataflow Graph for previous example
8/10/2015DFPIM - Charles F. Shelor 17
ADD
AND
OR
SRLI 1
SUB
0x1
0x1
edx
ecx
eax
edi
eax'
edi’
esi
t0d
(loop iteration)
edx’
(bit reversal)
Research AreasªAddition of more PIM benchmarks
ª Graph processing, more map-reduce configurations
ªDevelop Energy, Timing, and Size models for each DFPIM functional block and interconnectª Work with synthesis and silicon vendors for values
ªDevelop DFPIM simulatorª Verify accuracy of DFPIM configurations, calculate
timing, compute energy estimates
ªContinue Trace Based DFG Generation effortª Improve recognition of kernelsª Automate generation of DFG from instructions
8/10/2015DFPIM - Charles F. Shelor 18