lca14: lca14-401: bof - networking - debug/tracing/counter

17
Thu 6 March, 10:05am, Santosh Shukla, Mike Holmes LCA14-401: BoF, Networking - Debug/tracing/counter

Upload: linaro

Post on 12-Jan-2015

664 views

Category:

Technology


4 download

DESCRIPTION

Resource: LCA14 Name: LCA14-401: BoF - Networking - Debug/tracing/counter Date: 06-03-2014 Speaker: Santosh Shukla, Mike Holmes

TRANSCRIPT

Page 1: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

Thu 6 March, 10:05am, Santosh Shukla, Mike Holmes

LCA14-401: BoF, Networking - Debug/tracing/counter

Page 2: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

• Introduction• Use case

• libperf and in-kernel perf API• Test analysis direct user access vs syscall

based perf counter access• Design Issues and Next step• QA

Fast access to perf Counters

Page 3: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

• Access to perf counters is not fast enough in the embedded networking space.

• We think we need• The fastest access from user space. (see use

case)• Shared when read only (no locking overhead).• Stable API (based on libperf)• Easy way to access to SoC specific counters

Introduction

Page 4: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

• In fast path (could be ODP in future), There’ll be a method to analyze odp crash dump based on statistics.

• Because crash dump statistics are based on the perf hw counters, really low overhead counter access is needed. Should provide near accurate cpu or bus clock cycle precision.

• For example, in the fast path - per-packet budgeting is 1000 cpu cycle, then measuring can not take 3000 cpu cycle as it does today with syscall based perf counter in linux.

Use Case

Page 5: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

Perf provides a syscall method to open a perf file descriptor for user space application to access the counters, and attach the events to them.

sys_perf_counter_open - The syscall - event type attributes for monitoring/sampling - target pid - target cpu - group_fd - flags

Event type : - PERF_TYPE_HARDWARE - PERF_TYPE_SOFTWARE - PERF_TYPE_TRACEPOINT - PERF_TYPE_HW_CACHE - PERF_TYPE_RAW (for raw tracepoint data)

Perf

Page 6: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

attr.sample_type

{

bitmask

PERF_SAMPLE_IP

PERF_SAMPLE_TID

PERF_SAMPLE_TIME

PERF_SAMPLE_CALLCHAIN

PERF_SAMPLE_ID

PERF_SAMPLE_CPU }

attr config bitfield

{

disabled: off by default

inherit: children inherit it

exclude_{user,kernel,hv,idle}: don’t count these

mmap: include mmap data

comm: include comm data

inherit_stat: per task counts

enable_on_exec: next exec enables

}

perf continued..

Page 7: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

• Libperf creates set of file descriptors for bunch of perf events..by calling sys_perf_open_event() api, and does enable/disable/read operation on them .

current API has :libperf_initialize : sets up a set of fd's for profiling code to read fromlibperf_finalize : read from fd’s, print and close all pef FD.libperf_readcounter : read perf counter.libperf_enablecounter : Enable perf counterlibperf_disablecounter : disable perf counterlibperf_close : Close fd

Libperf

Page 8: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

• Raw Proposal :• Mmaping hw counters to user space could be a way forward for fast

access, removing overhead with the current kernel implementation.• Adding scalable framework in user space ..could be libperf so to read

cpu specific counter, counter on offload block and other variant of counters.

• Current mmapped based perf support in kernel:• in-kernel perf supports mmaped based persistent ring-buffer

implementation for user space.• This implementation is limited in performance due to the following.

The hw counter mappable and stored into ring-buffer with lots of synchronisation overhead for user space to access i.e.. rmb for every perf read counter, locking, async wake-up event for user space to read statistics.

design issues, next step investigation

Page 9: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

• But,• The current kernel mappable events are exclusive, and

are not shareable, they won't fall back to sysfs perf event mode. Therefore it is not scalable.

• The current kernel counter overhead is still significant, therefore the current implementation won't achieve 1000 cycle requirement for fast path model, example ODP crash dump statistics requirement mentioned in prev slide [4].

Next Step continued..

Page 10: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

• Effort to investigate and try to evaluate these issues : • Focus on exclusive fast access approach • HW counter pinned to specific core, specific task• Avoid sync primitives in kernel space while reading hw counter, Let

user space application handle this job.• Educate libperf to handle sync primitive and decide on locking policy.• Design should be flexible enough to fall back to syscall based perf

mode.• Respect SMP policy as much as possible.

Next Step continued..

Page 11: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

Userspace fast access flow control arrow key - too short Application should be square

Both these inside SocArm Processor Coreevent extensions

Page 12: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

Custom user space application detail -• Ran test application on arndale to demonstrate delta of user vs kernel space perf

counter. Result shows close to 9x improvement.• Tiny test kernel module enables,disable perf counter for user mode.

/* enable */asm ("MCR p15, 0, %0, C9, C14, 0\n\t" :: "r"(1));/* disable */asm ("MCR p15, 0, %0, C9, C14, 2\n\t" :: "r"(0x8000000f));

• User app uses x86 style timer api to read perf counter. static inline uint32_t rdtsc32(void) { #if defined(__GNUC__) && defined(__ARM_ARCH_7A__) uint32_t r = 0; asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(r) ); return r; #else #error Unsupported architecture/compiler! #endif

}

Benchmarking current & proposed access

Page 13: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

Libperf application using perf syscall -• Create perf event FD using perf_event_open syscall.• Reads perf counter event from file descriptor.

init(void){ static struct perf_event_attr attr; attr.type = PERF_TYPE_HARDWARE; attr.config = PERF_COUNT_HW_CPU_CYCLES; fddev = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);}

• Both application runs in a tight loop for some duration and there delta recorded for comparison..

Benchmarking cont..

Page 14: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

• Enable pmu direct user space vs perf syscall based application.

Benchmarking cont..

Page 16: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

QA

Page 17: LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

More about Linaro Connect: http://connect.linaro.orgMore about Linaro: http://www.linaro.org/about/

More about Linaro engineering: http://www.linaro.org/engineering/Linaro members: www.linaro.org/members