real world multicore embedded systems
TRANSCRIPT
Real World Multicore Embedded Systems
A Practical Approach
Expert Guide
Bryon Moyer
AMSTERDAM • BOSTON • HEIDELBERG • LONDON I J ^ # J NEW YORK • OXFORD • PARIS • SAN DIEGO S V J
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO ^ V « i ^ Newnes is an imprint of Elsevier NCWTIGS
Contents
About the Editor xvii About the Authors xix
Chapter 1: Introduction and Roadmap 1 Multicore is here 1 Scope 3 Who should read this book? 3 Organization and roadmap 4
Concurrency 4 Architecture 4 Infrastructure 5 Virtualization 6 Application software 6 Hardware assistance 7 System-level considerations 8
A roadmap of this book 9
Chapter 2: The Promise and Challenges of Concurrency 11 Concurrency fundamentals 12 Two kinds of concurrency 14
Data parallelism 15 Functional parallelism 16
Dependencies 18 Producers and consumers of data 19
Loops and dependencies 23 Shared resources 30 Summary 31
Chapter 3: Multicore Architectures 33 The need for multicore architectures 34
v
vi Contents
Multicore architecture drivers 36 Traditional sequential software paradigms break 38 Scope of multicore hardware architectures 41 Basic multicore hardware architecture overview 43 Specific multicore architecture characteristics 45 Processing architectures 45 ALU processing architectures 46 Lightweight processing architectures 46 Mediumweight processing architectures 48 Heavyweight processing architectures 48 Communication Architectures 49 Memory architectures 54 Application specificity 56 Application-specific platform topologies 59 Integration of multicore systems, MPSoCs and sub-systems 63 Programming challenges 66 Application characteristics 67 MPSoC analysis, debug and verification 68 Shortcomings and solutions 69 MPSoC parallel programming 69 Parallel software and MPSoCs 70 Summary 71 References 73
Chapter 4: Memory Models for Embedded Multicore Architecture 75 Introduction 76 Memory types 77 Memory architecture 79
Cache 79 Cache customization 86 Virtual memory 87 Scratchpad 88 Software overlays 89 DMA 90 DRAM 92
Special-purpose memory 93 Memory structure of multicore architecture 94
Shared memory architecture 94 Distributed memory architecture 95 Cache memory in multicore chips 96
Contents vii
Cache coherency 97 Directory-based cache coherence protocol 99 Snoopy cache coherence protocol 102 MESI cache coherence protocol 103 Cache-related performance issues 107
Transactional memory 109 Software transactional memory 112 Hardware transactional memory 115 Hybrid transactional memory 115
Summary 115 References 116
Chapter 5: Design Considerations for Multicore SoC Interconnections 117
Introduction 119 Importance of interconnections in an SoC 120 Terminology 121 Organization of the chapter 121
Communication activity in multicore SoCs 124 Transaction-Based communication 124 Storage-Oriented transactions 124 Concurrency of communication and segregation of traffic 126 Recent trends in SoCs 127
Functional requirements and topologies of SoC traffic 129 Memory organization 131 Implications of inter-device communication paradigms 134
Performance considerations 142 Transaction latency 142 Queuing delays 145 Bandwidth 149
Interconnection networks: representation and terminology 150 Representation of interconnection networks 150 Direct versus indirect networks 153 Circuit-Switched versus packet-switched communication and blocking versus non-blocking networks 155 Base-Form vs. encoded signaling 156 Transaction routing 157 Bus as an SoC interconnection 158 Limitations of the bus architecture 160
Fabric-oriented interconnects for larger SoCs 161
viii Contents
Transaction formats 163 Transaction routing 166
Building blocks of scalable interconnections 167 Links 168 Clocking considerations 169 Switches 169
Evaluating and comparing interconnection topologies for future SoCs 185
Metrics for comparing topologies 186 A Survey of interconnection networks suitable for future SoCs 188
A Pipelined bus 188 Multiple buses 189 A ring 189 A crossbar 190 Mesh topology 191
Some practical considerations in designing interconnections 192 Hierarchies of interconnections 192 Scalability in implementations 192
Summary 193 References 196 Further reading 197
Chapter 6: Operating Systems in Multicore Platforms 799 Introduction 199 Symmetric multiprocessing systems and scheduling 202 Assymetric multiprocessor systems 207
OS-per-core 207 Multiple SMP 211 SMP + RTOS 212 SMP + bare-metal 212
Virtualization 214 Controlling OS behavior 214
Controlling the assignment of threads in an SMP system 214 Controlling where interrupt handlers run 215 Partitions, containers, and zones 216 Priority 217 Kernel modifications, drivers, and thread safety 218 System start-up 220
Debugging a multicore system 221 The information gathered 222
Contents ix
Uploading the information 223 Painting the picture 224
Summary 225 Reference 226
Chapter 7: System Virtualization in Multicore Systems 227 What is virtualization? 228 A brief retrospective 230 Applications of system virtualization 231
Environment sandboxing 231 Virtual appliances 232 Workload consolidation 232 Operating system portability 233 Mixed-criticality systems 233 Maximizing multicore processing resources 233 Improved user experience 233
Hypervisor architectures 234 Type 2 234 Type 1 235 Paravirtualization 236 Monolithic hypervisor 237 Console guest hypervisor 237 Microkernel-based hypervisor 238 Core partitioning architectures 239
Leveraging hardware assists for virtualization 241 Mode hierarchy 241 Intel VT 242 Power architecture ISA 2.06 embedded hypervisor extensions 243 ARMTrustZone 244 ARM Virtualization Extensions 246
Hypervisor robustness 246 SubVirt 247 Blue pill 247 Ormandy 247 Xen owning trilogy 247 VMware's security certification 248
I/O Virtualization 249 Peripheral virtualization architectures 249 Peripheral sharing architectures 253 Combinations of I/O virtualization approaches 255
x Contents
I/O virtualization within microkernels 255 Case study: power architecture virtualization and the freescale P4080 257
Power architecture hardware hypervisor features 257 Power architecture virtualized memory management 260 Freescale P4080 IOMMU 261 Hardware support for I/O sharing in the P4080 262 Virtual machine configuration in power architecture 263
Example use cases for system virtualization 263 Telecom blade consolidation 264 Electronic flight bag 264 Intelligent Munitions System 264 Automotive infotainment 265 Medical imaging system 266
Conclusion 266 References 267
Chapter 8: Communication and Synchronization Libraries 269 Introduction 269 Library overview and basics 270
Thread APIs 270 Message-passing APIs 270
Explicit threading libraries 270 Windows Threads 271 POSIX Threads 272 Cl l and C ++ 11 Threads 275
OpenMP 277 Threading Building Blocks 282 Boost Threads 285 MCAPI 286 Conclusion 288 References 288
Chapter 9: Programming Languages 289 Programming languages for multicore embedded systems 289 C 290
Multi-threading support in C 295 Assembly language 295
Multi-threading and assembly 296
Contents xi
C++ 297 Features of C + + that work well for embedded systems 297 Features of C + + that do not work well for embedded systems 300 Multi-threading support in C + + 303
Java 304 Multi-threading support in Java 305
Python 307 Multi-threading support in Python 309
Ada 310 Concurrency support in Ada 311
Summary 311 References 312
Chapter 10: Tools 373 Introduction 314 Real-Time operating systems (RTOS) 315
DEOSbyDDC-1 315 EneaOSE 316 Express logic ThreadX 316 Green Hills Integrity 317 Lynux Works 318 Mentor Graphics Nucleus 318 MontaVista 319 QNX 319 Wind River VxWorks 320
Communication tools 321 PolyCore Software 321 Enea Linx 322
Parallelizing serial software tools 323 CriticalBlue Prism 323 Vector Fabrics 323 Open multiprocessing (MP) 324 Clean C 325
Software development and debug tools 325 Intel Parallel Studio 325 Benchmarking tools 326
Embedded Microprocessor Benchmark Consortium (EEMBC) 326 Standard Performance Evaluation Corporation (SPEC) CPU2600...328
Conclusion 328 Acknowledgments 329
xii Contents
Chapter 11: Partitioning Programs for Multicore Systems 331 Introduction 332 What level of parallelism? 334
Threads of control 334 Solutions, algorithms, and implementations 336
The basic cost of partitioning 338 A high-level partitioning algorithm 340 The central role of dependencies 341
Breaking dependencies 341 Types of dependencies 343 Locating dependencies 348 Handling broken dependencies 351
Critical sections 360 Synchronizing data 361
Using counting semaphores 361 Using FIFOs 362
Implementing a partitioning strategy 367 Using tools to simplify partitioning 368
Vector Fabrics's Pareon 369 CriticalBlue's Prism 375
Summary 384 References 384
Chapter 12: Soßware Synchronization 385 Introduction 387 Why is synchronization required? 388
Data integrity 388 Atomicity 390 Sequence of processing 393 Access to limited resources 395 Critical timing for real-time 395
Problems with not synchronizing (or synchronizing badly) 395 Slower throughput 396 Errors in synchronization logic 397 Consumes more power 397
Testing for proper synchronization 397 How is synchronization achieved? 398
Exclusion 399
Contents xiii
Test and set; compare and swap (CAS) 405 Barrier 406 Architectural design 407
Specific conditions requiring synchronization 412 Data races 413 Deadlocks 416 Livelocks 416 Non-atomic operations 417 Data caching 418 Conversion for endianness 419 How to implement synchronization 419
Language support for implementation 424 Intro 424 Language features and extensions 425 Libraries 426
Patterns 426 Finding concurrency design patterns 427 Algorithm structure design patterns 428 Supporting structures design patterns 429 Implementation mechanisms 429
Side-effects of synchronization 430 Incorrect synchronization 430 Program execution 431 Priority inversion 432 Performance 432 Code complexity 433 Software tools 436
Hardware and OS effects on synchronization 437 Number of cores 438 Memory, caches, etc 438 Thread scheduling 439 Garbage collection (and other system-level globally synchronized operations) 439
Problems when trying to implement synchronization 440 Inconsistent synchronization: not synchronizing all access methods 440 Data escapes 441 Using a mutable shared object with two different access steps (i.e., init() and parse()) 441
xiv Contents
Cached "Scratch-Pad" data 441 Multiple lock objects created 442 Trying to lock on a null-pointer 442 Double-check locking errors 442 Simple statements not atomic (i.e., increments, 64-bit assignments) 443 Check/act logic not synchronized 443 Synchronization object used for many unrelated things 444 Summary — synchronization problems 444
References 445
Chapter 13: Hardware Accelerators 447 Introduction 447 Architectural considerations 449
Blocking vs. non-blocking 449 Shared or dedicated 451 SMP vs. AMP 452 Copying data — or not 452 Signaling completion 453
The interface: registers, drivers, APIs, and ISRs 454 Hardware interface 454 Drivers 456 Software API 456 ISRs 457
Initialization 461 Operating system considerations 462 Coherency 462 Making the architectural decisions 466 Video example 467
The interface 470 The application 471 The driver 474 Real-world refinements 477
Summary 480
Chapter 14: Multicore Synchronization Hardware 481 Chapter overview 481 Instruction set support for synchronization 483
Test-and-set 484 Compare-and-swap 487
Contents xv
Load-reserved/store-conditional 489 Creating new primitives 491 Compiler intrinsics 493
Hardware support for synchronization 494 Bus locking 494 Load-reserved and store-conditional 495
Hardware support for lock-free programming 496 Lock-free synchronization with hardware queues 497 Decorated storage operations 499 Messaging 500 Hardware transactional memory 507
Memory subsystem considerations 509 Memory ordering rules 510 Using memory barriers and synchronizations 512
Conclusions 514 References 514
Chapter 15: Bare-Metal Systems 577 Introduction 517
What is a bare-metal setup? 519 Who should use bare metal? 521
Architectural arrangements 522 Data parallelism: SIMD 523 Functional parallelism: pipelines 526
Software architecture 535 Building the executable image(s) 536 Example: IPv4 forwarding 541
Packet forwarding 542 Next-hop lookup: longest prefix match 545 The DIR-24-8-BASIC algorithm 547 Example target architecture: Cavium OCTEON CN3020 552 Select code examples 555
Conclusion 560 Reference 560
Chapter 16: Multicore Debug 561 Introduction — why debug instrumentation 561
How does multicore differ from single-core debug? 566 Background — silicon debug and capabilities 568 Trace methods for multicore debug analysis 569
xvi Contents
Types of instrumentation logic blocks 571 JTAG interfaces for multicore debug 580 External interfaces for on-chip instrumentation 581
Debug flows and subsystems 582 Commercial approaches 586
The OCP debug interface 586 Nexus/IEEE 5001 587 ARMCoreSight 593 Example: MIPS PDTrace and RRT analysis 596
The future of multicore debug 600 References 602 Further reading 602
Index 603