arm cortex coding

Upload: vivek-anand

Post on 05-Apr-2018

241 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 ARM Cortex Coding

    1/24

    Getting started on Cortex A8

    Instruction Set

  • 7/31/2019 ARM Cortex Coding

    2/24

    Instruction Sets

    32-bit ARM instruction set :

    16-bit Thumb instruction set :

    32-bit Thumb-2 instruction set :

    (Trade off between two above), Most 32 bit instructions are

    unconditional when compared to ARM

    Advanced SIMD architecture.

    Enabling the same operation to be performed on multiple items in

    parallel.

    Instructions operate on vectors held in 64-bit or 128-bit registers

    Other instruction sets

    ThumbEE instruction set

    Jazelle Extension

  • 7/31/2019 ARM Cortex Coding

    3/24

    Register Set (ARM and Neon)

    33 general-purpose 32-bit registers

    In user mode only R0 to R15 are available

    R14 -> Link register : Holds the return address when thebranch is called with link (BL)

    R15 -> Program counter

    seven 32-bit status registersStatus Flags/Processor mode

    Neon Register Bank

    View 1: 32x64-bit general-purpose registers or (D0-D31)

    View 2: 16x128-bit (quadword) registers, Q0-Q15.

    Combination of these 128-bit and 64-bit registers, Q0-Q15 andD0-D31.

  • 7/31/2019 ARM Cortex Coding

    4/24

    ARM Instruction set

    All ARM instructions are 32 bits long

    Branch instructions

    Data processing instructions

    Register load and store instructions

    Multiple register load and store instructionsStatus register access instructions(OOS)

    Coprocessor instructions (OOS)

  • 7/31/2019 ARM Cortex Coding

    5/24

    ARM Instruction set

    Branch Instructions

    branch backwards to form loops

    branch forward in conditional structures

    branch to subroutines

    e.g.B label1

    BL label1(Branch with link)

    BEQ {pc}+4

  • 7/31/2019 ARM Cortex Coding

    6/24

    ARM Instruction set

    Data processing instructions

    Add or multiply two registers

    Add register with constant

    Bitwise operations

    operate on 8 bit, 16 bit and 32 bit data

    Long multiply instructions give a 64-bit result in two registerse.g.

    ADD r2, r1, r3

    SUBS r8, r6, #240 ; sets the flags on the result

    RSB r4, r4, #1280 ; subtracts contents of r4 from 1280AND r9,r2,#0xFF00

    ORREQ r2,r0,r5

    MOVS r3, r2, LSR #3 ;

  • 7/31/2019 ARM Cortex Coding

    7/24

    ARM Instruction set

    Register load and store instructions

    Load or store the a single register - 8,16,32 bit

    Load double words

    Byte and halfword loads can be zero filled or sign extended

    e.g.

    STMFD r13!, {r0-r5}LDMFD r13!, {r0-r5}

    PUSH {r5-r7,lr}

    POP {r5-r7,pc}

    LDR r3, [r0], #4 ;r0 is incremented by 4LDR r3, [r0],r4 ;r0 is incremented by r4

    LDR r3,[r0,#0x2C] ;load with offset

    LDR r3,[r0,r4,lsl #2] ;

  • 7/31/2019 ARM Cortex Coding

    8/24

    ARM Instruction set

    Conditional Execution

    FlagsN Set when the result of the operation was Negative.

    Z Set when the result of the operation was Zero.

    C Set when the operation resulted in a Carry.

    V Set when the operation caused oVerflow.

    Most of the ARM instructions can be conditional

    E.g.ADD r0, r1, r2 ; r0 = r1 + r2, don't update flags

    ADDS r0, r1, r2 ; r0 = r1 + r2, and update flags

    ADDSCS r0, r1, r2 ; If C flag set then r0 = r1 + r2, and updateflags

    CMP r0, r1 ; update flags based on r0-r1.

    why conditional instructions are required if branchinstructions are available?

  • 7/31/2019 ARM Cortex Coding

    9/24

    ARM Instruction set

    Suffix details

  • 7/31/2019 ARM Cortex Coding

    10/24

    Neon Instruction set

    Vector Duplicate

    VDUP{cond}.size Qd, Dm[x]

    cond is an optional condition code

    size must be 8, 16, or 32

    Qd specifies the destination register for a quadword operation

    Dm[x] specifies the NEON scalar.

    VADD.datatype {Qd}, Qn, Qm

    VADD.datatype {Dd}, Dn, Dm

    Datatype -> I8, I16, I32 for VADD and VSUB

    Datatype -> S64, U64 for VQADD or VQSUB(depends on instruction,refer TRM)

  • 7/31/2019 ARM Cortex Coding

    11/24

    Neon Instruction set (e.g.)

  • 7/31/2019 ARM Cortex Coding

    12/24

    Effective Assembly coding

    Branch prediction

    Maximize usage of conditional instructions instead of branchesa 512-entry 2-way set associative Branch Target Buffer (BTB)

    a 4096-entry Global History Buffer (GHB)

    an 8-entry return stack

    Pipeline model- Instruction cycle timing

    fetch, decode, execute >> 13 stage

    Load Store

    MAC

    ALU

    Neon Pipeline >> 10Removing interlocks/stalls

    Maximize usage of SIMD/Neon Instructions

    Maximize Dual Issue

  • 7/31/2019 ARM Cortex Coding

    13/24

    Effective Assembly coding

    how to read ARM instruction tables

    ADDEQ R0, R1, R2 LSL#10

  • 7/31/2019 ARM Cortex Coding

    14/24

    Effective Assembly coding

    Interlock e.g.(Refer Table in next slide)

    SMLAL R0, R1, R2, R3

    ADD R7,R8,R0 >> four cycles waisted

    Alternate approach

    SMLAL R0, R1, R2, R3MOV r4,#0x6

    ADD r5,r4,r5

    MOV r6,#0x6

    LDR r5,[r6,#0x2C]

    ADD R7,R8,R0

  • 7/31/2019 ARM Cortex Coding

    15/24

    Effective Assembly coding

    dummy

  • 7/31/2019 ARM Cortex Coding

    16/24

    Effective Assembly coding

    Dual Issue

    Two basic pipeleines ->Pipeline0 and Pipeline1

    LS pipeline, Multiply pipeline, ALU pipeline

    Multiply pipeline always goes in Pipeline 0

    The first instruction always issues in pipeline 0 and the second

    instruction, if present, issues in pipeline 1Instructions with the same destination cannot be issued in the same

    cycle.

    Refer next Slide for more e.g.

  • 7/31/2019 ARM Cortex Coding

    17/24

    Dual issue (contd..)

  • 7/31/2019 ARM Cortex Coding

    18/24

    General ARM optimization Techniques

    Loop unrolling

    Use fixed point arithmetic

    Use shifts instead of multiply and divisions

    See if complex calculations can be avoided using table

    lookupMinimize the number of arguments of a function

    Avoid branches in low level functions

  • 7/31/2019 ARM Cortex Coding

    19/24

    Assly Funcs/files e.g

    First four argument go in r0,r1,r2,r3

    e.g. of assembly function

  • 7/31/2019 ARM Cortex Coding

    20/24

    General /Neon optimization Techniques

    Code Vectorization in C itself

    Use word arrays instead of halfword or byte arrays

    Cache friendly coding

    Put code belonging to same module in the same code

    section

  • 7/31/2019 ARM Cortex Coding

    21/24

    Code Vectorization

  • 7/31/2019 ARM Cortex Coding

    22/24

    Code Vectorization

  • 7/31/2019 ARM Cortex Coding

    23/24

    Code Vectorization

  • 7/31/2019 ARM Cortex Coding

    24/24

    Code Warrior Demo/Hands on