Understanding Android Benchmarks

Download Understanding Android Benchmarks

Post on 17-Aug-2014



Devices & Hardware

9 download




  • Understanding Android Benchmarks freedom koan-sin tan freedom@computer.org OSDC.tw,Taipei Apr 11th, 2014 1
  • disclaimers many of the materials used in this slide deck are from the Internet and textbooks, e.g., many of the following materials are from Computer Architecture: A Quantitative Approach, 1st ~ 5th ed opinions expressed here are my personal one, dont reect my employers view 2
  • who am i did some networking and security research before working for a SoC company, recently on big.LITTLE scheduling and related stuff parallel construct evaluation run benchmarking from time to time for improving performance of our products, and know what our colleagues' progress 3
  • Focusing on CPU and memory parts of benchmarks lets ignore graphics (2d, 3d), storage I/O, etc. 4
  • Blackbox ! google image search benchmark, you can nd many of them are Android-related benchmarks Similar to recently Cross-Strait Trade in Services Agreement (TiSA), most benchmarks on Android platform are kinda blackbox 5
  • Is Apple A7 good? When Apple released the new iPhone 5s, you saw many technical blog showed some benchmarks for reviews they came up commonly used ones: GeekBench JavaScript benchmarks Some graphics benchmarks Why? Are they right ones? etc. e.g., http://www.anandtech.com/show/7335/the-iphone-5s-review 6
  • open blackbox 7
  • Android Benchmarks 8
  • No, not improvement in this way 9 http:// www.anandtech.com /show/7384/state-of- cheating-in-android- benchmarks
  • Assuming there is not cheating, what we we can do?
  • Outline Performance benchmark review Some Android benchmarks What we did and what still can be done Future 11
  • To quote what Prof. Raj Jain quoted Benchmark v. trans. To subject (a system) to a series of tests in order to obtain prearranged results not available on competitive systems From:The Devils DP Dictionary S. Kelly-Bootle 12
  • Why benchmarking We did something good, let check if we did it right comparing with own previous results to see if we break anything We want to know how good our colleagues in other places are 13
  • What to report? Usually, what we mean by benchmarking is to measure performance What to report? intuitive answer: how many things we do in certain period of time yes, time. E.g., MIPS, MFLOPS, MiB/s, bps 14
  • MIPS and MFLOPS MIPS (Million Instruc0ons per Second), MFLOPS (Million Floa0ng-Point Opera0ons per Second) All instruc0ons are not created equal CISC machine instruc0ons usually accomplish a lot more than those of RISC machines, comparing the instruc0ons of a CISC machine and a RISC machine is similar to comparing La0n and Greek 15
  • MIPS and whats wrong with them MIPS is instruc0on set dependent, making it dicult to compare MIPS of one computers with dierent ISA MIPS varies between programs on the same computers; and most importantly, MIPS can vary inversely to performance w/ hardware FP, generally, MIPS is smaller 16
  • MFLOPS and whats wrong with them Applied only to programs with oa0ng-point opera0ons Opera0ons instead of instruc0ons, but s0ll oa0ng-point instruc0ons are dierent on machines dierent ISAs Fast and slow oa0ng-point opera0ons Possible solu0on: weight and source code level count ADD, SUB, COMPARE : 1 DIVIDE, SQRT: 2 EXP, SIN: 4 17
  • The best choice of benchmarks to measure performance is real applica0ons 18
  • Problema0c benchmarks Kernel: small, key pieces of real applica0ons, e.g., linpack Toy programs: 100-line programs from beginning programming assignments, e.g., quicksort Synthe0c benchmarks: fake programs invented to try to match the prole and behavior of really applica0ons, e.g., Dhrystone 19
  • Why they are disreputed? Small, t in cache Obsolete instruc0on mix Uncontrolled source code Prone to compiler tricks Short run0mes on modern machines Single-number performance characteriza0on with a single benchmark Dicult to reproduce results (short run0me and low-precision UNIX 0mer) 20
  • Dhrystone Source hhp://homepages.cwi.nl/~steven/dry.c < 1000 LoC Size of CA15 binary compiled with bionic Instruc0ons: ~ 14 KiB text data bss dec 13918 467 10266 24660 21
  • Whetstone Dhrystone is a pun on Whetstone Source code: hhp:// www.netlib.org/ benchmark/whetstone.c Test MFLOPS MOPS ms N1 float 119.78 0.16 N2 float 171.98 0.78 N3 if 154.25 0.67 N4 fixpt 397.48 0.79 N5 cos 19.08 4.36 N6 float 84.22 6.41 N7 equal 86.84 2.13 N8 exp 5.95 6.26 MWIPS 463.97 21.55 22
  • More on Synthe0c benchmarks The best known examples of synthe0c benchmarks are Whetstone and Dhrystone Problems: Compiler and hardware op0miza0ons can ar0cially inate performance of these benchmarks but not of real programs The other side of the coin is that because these benchmarks are not natural programs, they dont reward op0miza0ons of behaviors that occur in real programs Examples: Op0mizing compilers can discard 25% of the Dhrystone code; examples include loops that are only executed once, making the loop overhead instruc0ons unnecessary Most Whetstone oa0ng-point loops execute small numbers of 0mes or include calls inside the loop. These characteris0cs are dierent from many real programs Some more discussion in 1st edi0on of the textbook 23
  • LINPACK LINPACK: a oa0ng point benchmark from the manual of LINPACK library Source hhp://www.netlib.org/benchmark/linpackc hhp://www.netlib.org/benchmark/linpackc.new 883 LoC Size of CA15 binary compiled with bionic Instruc0ons: ~ 13 KiB text data bss dec 12670 408 0 13086 24
  • 25
  • CoreMark (1/2) CoreMark is a benchmark that aims to measure the performance of central processing units (CPU) used in embedded systems. It was developed in 2009 by Shay Gal-On at EEMBC and is intended to become an industry standard, replacing the an0quated Dhrystone benchmark The code is wrihen in C code and contains implementa0ons of the following algorithms: Linked list processing. Matrix (mathema0cs) manipula0on (common matrix opera0ons), state machine (determine if an input stream contains valid numbers), and CRC from wikipedia 26
  • CoreMark (2/2) name LoC core_list_join.c 496 core_matrix.c 308 core_stat.c 277 core_util.c 210 CoreMark vs. Dhrystone Repor0ng rule Use of library calls, e.g., malloc() is avoided CRC to make sure data are corrected However, CoreMark is a kernel + synthe0c benchmark, s0ll quite small footprint text data bss dec 18632 456 20 19108 27
  • So? Too overcome the danger of placing eggs in one basket, collec0ons of benchmark applica0ons, called benchmark suites, are popular measure of performance of processors with variety of applica0ons Standard Performance Evalua0on Corpora0on (SPEC) 28
  • 29
  • Why CPU2000 in 2010s? Why ARM s0cks with SPEC CPU2000 instead of CPU2006 1999 q4 results, earliest available CPU2000 results (hhp:// www.spec.org/cpu2000/results/res1999q4/) CINT2000 base: 133 424 CFP2000 base: 126 514 2005 Opteron 144, 1.8 GHz 1,440 (CA15 1.9 GHz reported nVidia is 1,168) CPU2006 requires much more DRAM, 1 GiB DRAM is not enough name CA9 CA7 CA15 Krait SPECint 200 356 320 537 326 SPECfp 2000 298 236 567 350 All normalized to 1.0 GHz 30
  • SPEC numbers from Quan0ta0ve Approach 5th Edi0on 31
  • How long does SPEC CPU2000 take? About 1 hrs to compile Run0me: Sum of base run0me mul0plied by 3 E.g., 1.7 GHz CA15, (2256+3229) x 3 = 16,455 s ~= 4.57 hr For 1.0 GHz: 4.57 x 1.7 = 7.77 hr For CA7 assuming twice slower: 7.77 * 2 = 15.54 hr Benchmark Reference Base Base Time Runtime Ratio 164.gzip 1400 215 652 175.vpr 1400 198 707 176.gcc 1100 94.8 1161 181.mcf 1800 266 677 186.crafty 1000 118 850 197.parser 1800 291 619 252.eon 1300 87.8 1480 253.perlbmk 1800 172 1045 254.gap 1100 107 1026 255.vortex 1900 211 899 256.bzip2 1500 203 740 300.twolf 3000 399 752 SPECint_base2000 2256 854 Benchmark Reference Base Base Time Runtime Ratio 68.wupwise 1600 162 991 171.swim 3100 389 797 172.mgrid 1800 339 532 173.applu 2100 241 870 177.mesa 1400 112 1254 178.galgel 2900 201 1444 179.art 2600 195 1332 183.equake 1300 157 828 187.facerec 1900 183 1036 188.ammp 2200 353 623 189.lucas 2000 134 1491 191.fma3d 2100 212 988 200.sixtrack 1100 241 456 301.apsi 2600 310 839 SPECfp_base2000 435 3229 909.6 32
  • Figure 1.16 SPEC2006 programs and the evolu0on of the SPEC benchmarks over 0me, with integer programs above the line and oa0ng-point programs below the line. Of the 12 SPEC2006 integer programs, 9 are wrihen in C, and the rest in C++. For the oa0ng-point programs, the split is 6 in Fortran, 4 in C++, 3 in C, and 4 in mixed C and Fortran. The gure shows all 70 of the programs in the 1989, 1992, 1995, 2000, and 2006 releases. The benchmark descrip0ons on the les are for SPEC2006 only and do not apply to earlier versions. Programs in the same row from dierent genera0ons of SPEC are generally not related; for example, fpppp is not a CFD code like bwaves. Gcc is the senior ci0zen of the group. Only 3 integer programs and 3 oa0ng-point programs survived three or more genera0ons. Note that all the oa0ng-point programs are new for SPEC2006. Although a few are carried over from genera0on to genera0on, the version of the program changes and either the input or the size of the benchmark is osen changed to increase its running 0me and to avoid perturba0on in measurement or domina0on of the execu0on 0me by some factor other than CPU 0me. 33
  • EEMBC Embedded Microprocessor Benchmark Consor0um (EEMBC): 41 kernels used to predict performance of dierent embedded applica0ons: Automo0ve/industrial Consumer Networking Oce automa0on Telecommunica0on 3rd edi0on showed some EEMBC results, 4th edi0on changed the mind Unmodied performance and full-fury performance Kernel, repor0ng op0ons Not a good predictor of rela0ve performance of dierent embedded computers 34
  • Report benchmark results Reproducible Machine congura0on (Hardware, sosware (OS, compiler etc.)) Summarizing results You should not add dierent numbers Some use weighted average Ra0o, compare with a reference machine Geometric ra1o The geometric mean of the ra0os is the same as the ra0os of geometric means The ra0o of the geometric means is equal to the geometric mean of the performance ra0os 35
  • Geometric mean 36
  • Fallacy: Benchmarks remain valid indenitely Ability to resist benchmark engineering or benchmarke0ng gcc is the only survivor from SPEC89 Almost 70% of all programs from SPEC2000 or earlier were dropped from the next release 37
  • Other benchmarks Stream To test memory bandwidth It also tests oa0ng-point performance Op0ons of oa0ng-point (double, 8 bytes) array copy, scale, add, triad lmbench Micro benchmark to measure sosware/hardware overhead from sosware perspec0ve lmbench paper (1996), hhp://www.bitmover.com/ lmbench/lmbench-usenix.pdf name kernel bytes/iter FLOPS/iter COPY a(i) = b(i) 16 0 SCALE a(i) = q*b(i) 16 1 SUM a(i) = b(i) + c(i) 24 1 TRIAD a(i) = b(i) + q*c(i) 24 2 38
  • Stream 5.10 for (k=0; k
  • lmbench lmbench is a micro-benchmark suite designed to focus ahen0on on the basic building blocks of many common system applica0ons, such as databases, simula0ons, sosware development, and networking 40
  • Parallel? Lets look at other SPEC benchmarks SPECapc for 3ds Max 2011, performance evalua0on sosware for systems running Autodesk 3ds Max 2011. SPECapcSM for Lightwave 3D 9.6, performance evalua0on sosware for systems running NewTek LightWave 3D v9.6 sosware. SPECjbb2005, evaluates the performance of server side Java by emula0ng a three-0er client/server system (with emphasis on the middle 0er). SPECjEnterprise2010, a mul0-0er benchmark for measuring the performance of Java 2 Enterprise Edi0on (J2EE) technology-based applica0on servers. SPECjms2007, Java Message Service performance SPECjvm2008, measuring basic Java performance of a Java Run0me Environment on a wide variety of both client and server systems. SPECapc, performance of several 3D-intensive popular applica0ons on a given system SPEC MPI2007, for evalua0ng performance of parallel systems using MPI (Message Passing Interface) applica0ons. SPEC OMP2001 V3.2, for evalua0ng performance of parallel systems using OpenMP (hhp://www.openmp.org) applica0ons. SPECpower_ssj2008, evaluates the energy eciency of server systems. SPECsfs2008, File server throughput and response 0me suppor0ng both NFS and CIFS protocol access SPECsip_Infrastructure2011, SIP server performance SPECviewperf 11, performance of an OpenGL 3D graphics system, tested with various rendering tasks from real applica0ons SPECvirt_sc2010 ("SPECvirt"), evaluates the performance of datacenter servers used in virtualized server consolida0on 41
  • PARSEC The Princeton Applica0on Repository for Shared-Memory Computers (PARSEC) is a benchmark suite composed of mul0threaded programs. The suite focuses on emerging workloads and was designed to be representa0ve of next-genera0on shared-memory programs for chip-mul0processors Didnt really use it yet hhp://parsec.cs.princeton.edu/ Workload Parallelization Model Pthreads OpenMP Intel TBB blackscholes Yes Yes Yes bodytrack Yes Yes Yes canneal Yes No No dedup Yes No No facesim Yes No No ferret Yes No No fluidanimate Yes No Yes freqmine No Yes No raytrace Yes No No streamcluster Yes No Yes swaptions Yes No Yes vips Yes No No x264 Yes No No 42
  • Are Dhrystone usefully? Yes, if you know the limitation of them Don't do marketing as those benchmarks mean real user perceived performance 43
  • iPhone'5s' iPhone'5s'32,bit' CA15' CA7' Krait'400' DMIPS/MHz' 7.47'' 5.70'' 2.71'' 1.67'' 2.46'' 0.00'' 1.00'' 2.00'' 3.00'' 4.00'' 5.00'' 6.00'' 7.00'' 8.00'' DMIPS/MHz) A7 Dhrystone 44
  • iPhone'5s' iPhone'5s'32, bit' 'CA15' CA7' Krait'400' MFLOPS/GHz' 722' 723' 449' 119' 299' 0' 100' 200' 300' 400' 500' 600' 700' 800' MFLOPS/GHz+ A7 linpack MFLOPS 45
  • iPhone'5s' iPhone'5s'32,bit' CA15' CA7' Krait'400' CoreMark/MHz' 5.72'' 4.45'' 3.67'' 2.46'' 3.30'' 0.00'' 1.00'' 2.00'' 3.00'' 4.00'' 5.00'' 6.00'' 7.00'' CoreMark/MHz+ A7 CoreMark 46
  • Different items Example, GeekBench 3 Arithmetic mean with different weight? How? Good properties of geometric mean 47
  • Source code So far what we talked about are all software with source code available, either publicly/freely, e.g., Dhrystone or little amount of $, e.g., SPEC CPU 48
  • Benchmark scores/results usually depend on compiler, complier ags, processors, and systems 49
  • Outline Performance benchmark review Some Android benchmarks What we did and what still can be done Future 50
  • Back to Android What kinds of Benchmarks are available, or used to compare performance Apps with native benchmarks:Antutu, GeekBench Java apps, e.g., Quadrant Hybrid: with both native and Java, e.g.,AndEBench and CF-Bench We also use SPEC CPU2000 and other stuff internally 51
  • Ars Technica List arrayOfPackageInfo[0] = new PackageInfo("com.aurorasoftworks.quadrant.ui.standard", false); arrayOfPackageInfo[1] = new PackageInfo("com.aurorasoftworks.quadrant.ui.advanced", false); arrayOfPackageInfo[2] = new PackageInfo("com.aurorasoftworks.quadrant.ui.professional", false); arrayOfPackageInfo[3] = new PackageInfo("com.redlicense.benchmark.sqlite", false); arrayOfPackageInfo[4] = new PackageInfo("com.antutu.ABenchMark", false); arrayOfPackageInfo[5] = new PackageInfo("com.greenecomputing.linpack", false); arrayOfPackageInfo[6] = new PackageInfo("com.greenecomputing.linpackpro", false); arrayOfPackageInfo[7] = new PackageInfo("com.glbenchmark.glbenchmark27", false); arrayOfPackageInfo[8] = new PackageInfo("com.glbenchmark.glbenchmark25", false); arrayOfPackageInfo[9] = new PackageInfo("com.glbenchmark.glbenchmark21", false); arrayOfPackageInfo[10] = new PackageInfo("ca.primatelabs.geekbench2", false); arrayOfPackageInfo[11] = new PackageInfo("com.eembc.coremark", false); arrayOfPackageInfo[12] = new PackageInfo("com.flexycore.caffeinemark", false); arrayOfPackageInfo[13] = new PackageInfo("eu.chainfire.cfbench", false); arrayOfPackageInfo[14] = new PackageInfo("gr.androiddev.BenchmarkPi", false); arrayOfPackageInfo[15] = new PackageInfo("com.smartbench.twelve", false); arrayOfPackageInfo[16] = new PackageInfo("com.passmark.pt_mobile", false); arrayOfPackageInfo[17] = new PackageInfo("se.nena.nenamark2", false); arrayOfPackageInfo[18] = new PackageInfo("com.samsung.benchmarks", false); arrayOfPackageInfo[19] = new PackageInfo("com.samsung.benchmarks:db", false); arrayOfPackageInfo[20] = new PackageInfo("com.samsung.benchmarks:es1", false); arrayOfPackageInfo[21] = new PackageInfo("com.samsung.benchmarks:es2", false); arrayOfPackageInfo[22] = new PackageInfo("com.samsung.benchmarks:g2d", false); arrayOfPackageInfo[23] = new PackageInfo("com.samsung.benchmarks:fs", false); arrayOfPackageInfo[24] = new PackageInfo("com.samsung.benchmarks:ks", false); arrayOfPackageInfo[25] = new PackageInfo("com.samsung.benchmarks:cpu ! ! CPU and Memory related: Quadrant, Antutu, linpack, GeekBench, AndEBench (coremark), CaffeineMark, Pi, PassMark, Samsungs benchmark 52
  • Antutu 3.x CPU: integer, oating point memory: RAM Graphics: 2D, 3D I/O: Database, SD read, SD write ! ! What are you benchmarking What's you workload How to calculate scores 53
  • What on earth are they doing? Actually no public available information But, with good enough background knowledge and proper tools (well talk about these later), we can gure it out It turns out most of them are from the BYTE nbench (http://en.wikipedia.org/wiki/ NBench) 54
  • AnTuTu 3.x CPU and Memory Tests nbench item Used by Antutu Antutu part Antutu percentage on progress bar Order nbench category NUMERIC SORT yes Integer 27% 4 integer STRING SORT yes RAM 1% 1 memory BITFIELD yes RAM 1% 2 memory FP EMULATION no FOURIER yes floating 47% 7 floating point ASSIGNMENT yes RAM 8% 3 memory IDEA yes Integer 27% 5 integer HUFFMAN yes Integer 34% 6 integer NEURAL NET no LU DECOMPOSITION no 55
  • More close look RAM String sort: string Heap sort: StrHeapSort() MoveMemory() memmove() Bit Field: Bit field test: DoBitops() Assignment: Task Assignment test: DoAssignment() Integer Numeric sort: Numeric heap sort: NumHeapSort() IDEA: IDEA encryption and decryption: cipher_idea() Huffman: Huffman encoding Floating point: Fourier: Fourier transform: pow(), sin(), cos() 56
  • for(i=top; i>0; --i)! {! "strsift(optrarray,strarray,numstrings,0,i);! ! "/* temp = string[0] */! "tlen=*strarray;! "MoveMemory((farvoid *)&temp[0], /* Perform exchange */! " "(farvoid *)strarray,! " "(unsigned long)(tlen+1));! ! ! "/* string[0]=string[i] */! "tlen=*(strarray+*(optrarray+i));! "stradjust(optrarray,strarray,numstrings,0,tlen);! "MoveMemory((farvoid *)strarray,! " "(farvoid *)(strarray+*(optrarray+i)),! " "(unsigned long)(tlen+1));! ! "/* string[i]=temp */! "tlen=temp[0];! "stradjust(optrarray,strarray,numstrings,i,tlen);! "MoveMemory((farvoid *)(strarray+*(optrarray+i)),! " "(farvoid *)&temp[0],! " "(unsigned long)(tlen+1));! ! } String Sort in NBench Sorts an array of strings of arbitrary length Test memory movement performance Non-sequential performance of cache, with added burden that moves are byte-wide and can occur on odd address boundaries 57
  • Bit eld in NBench Executes 3 bit manipulation functions Exercises "bit twiddling performance. Travels through memory bit-by-bit in a sequential fashion; different from sorts in that data is merely altered in place Operations: Set: OR 1 Clear: AND 0 Toggle: XOR Set, clear: ToggleBitRun() Toggle: FlipBitRun() static void ToggleBitRun(farulong *bitmap, /* Bitmap */ ulong bit_addr, /* Address of bits to set */ ulong nbits, /* # of bits to set/clr */ uint val) /* 1 or 0 */ { unsigned long bindex; /* Index into array */ unsigned long bitnumb; /* Bit number */ ! while(nbits--) { #ifdef LONG64 bindex=bit_addr>>6; /* Index is number /64 */ bitnumb=bit_addr % 64; /* Bit number in word */ #else bindex=bit_addr>>5; /* Index is number /32 */ bitnumb=bit_addr % 32; /* bit number in word */ #endif if(val) bitmap[bindex]|=(1L0; --i) NumSift(array,i,top); ! /* ** Repeatedly extract maximum from heap and place it at the ** end of the array. When we get done, we'll have a sorted ** array. */ for(i=top; i>0; --i) { NumSift(array,bottom,i); temp=*array; /* Perform exchange */ *array=*(array+i); *(array+i)=temp; } return; 60
  • static void cipher_idea(u16 in[4],! " "u16 out[4],! " "register IDEAkey Z)! {! register u16 x1, x2, x3, x4, t1, t2;! /* register u16 t16;! register u16 t32; */! int r=ROUNDS;! ! x1=*in++;! x2=*in++;! x3=*in++;! x4=*in;! ! do {! "MUL(x1,*Z++);! "x2+=*Z++;! "x3+=*Z++;! "MUL(x4,*Z++);! ! "t2=x1^x3;! "MUL(t2,*Z++);! "t1=t2+(x2^x4);! "MUL(t1,*Z++);! "t2=t1+t2;! ! "x1^=t1;! "x4^=t2;! ! "t2^=x2;! "x2=x3^t1;! "x3=t2;! } while(--r);! MUL(x1,*Z++);! *out++=x1;! *out++=x3+*Z++;! *out++=x2+*Z++;! MUL(x4,*Z);! *out=x4;! return;! } IDEA Encryption in NBench IDEA: a new block cipher when nbench was in development Moves through data sequentially in 16-bit chunks 61
  • Huffman in NBench Everybody knows Huffman code, right? A combination of byte operations, bit twiddling, and overall integer manipulation ..... /* ** Huffman tree built...compress the plaintext */ bitoffset=0L; /* Initialize bit offset */ for(i=0;i
  • Fourier in NBench No, not FFT, Good measure of transcendental and trigonometric performance of FPU. Little array activity, so this test should not be dependent of cache or memory architecture static double thefunction(double x, /* Independent variable */! " "double omegan, /* Omega * term */! " "int select) /* Choose term */! {! /*! ** Use select to pick which function we call.! */! switch(select)! {! "case 0: return(pow(x+(double)1.0,x));! "case 1: return(pow(x+(double)1.0,x) * cos(omegan * x));! "case 2: return(pow(x+(double)1.0,x) * sin(omegan * x));! } 63
  • Neural Net in NBench A robust algorithm for solving linear equations Small-array oating-point test heavily dependent on the exponential function; less dependent on overall FPU performance 64
  • LU Decomposition in NBench LU Decomposition Yes, the LU decomposition you learned in linear algebra A oating-point test that moves through arrays in both row-wise and column-wise fashion. Exercises only fundamental math operations (+, -, *, /) 65
  • GeekBench A cross-platform one The only publicly available one we could use to compare Android, iOS, and other platforms Quite clearly described test items http://support.primatelabs.com/kb/geekbench/geekbench-3- benchmarks Explaining how to interpret results http://support.primatelabs.com/kb/geekbench/interpreting- geekbench-3-scores Source code available if you pay 66
  • Vellamo HTML5 Metal: Dhrystone, Linpack, Branch-K, Stream 5.9, RamJam, Storage some are well-known; some are written by Quic? Anyway, all of them are described at http:// www.quicinc.com/vellamo/test-descriptions/ 67
  • CFBench Used by some people,cause Test both Java and native version its author is quite active in xda developer forum Some problems no good description of tests some code is wrong, e.g., its Native Memory Read test is not testing memory read,cause malloc()ed array is not initialized 68
  • Outline Performance benchmark review Some Android benchmarks What we did and what still can be done Future 69
  • How do we improve benchmark performance 70
  • In the good old days, we have source code, we compile and run benchmark programs In current Android ecosystem Usually we dont have source Proling: oprole, perf, DS-5 proling sometimes doesnt report real bottleneck function, e.g., static functions usually are inlined and dont have symbol in shipped binaries binutils: nm, readelf, objdump, gdb Improving libraries, e.g., libc and libm, and runtime system, e.g., JIT of Dalvik, used by those benchmarks 71
  • Antutu 3.x memmove() in bionic --> bcopy() in C rewrite with NEON assembly code pow(), sin(), cos() in C rewrite them with assembly 72
  • bcopy() in bionic MoveMemory() in nbench -> memmove() in bionic - > bcopy() in bionic memcpy() assembly in bionic and there are processor specic ones (CA9, CA15, Krait). NEON (vector load/ store) helps not for bcopy() in bionic/libc/bionic/memmove.c ! void *memmove(void *dst, const void *src, size_t n) { const char *p = src; char *q = dst; /* We can use the optimized memcpy if the source and destination * don't overlap. */ if (__builtin_expect(((q < p) && ((size_t)(p - q) >= n)) || ((p < q) && ((size_t)(q - p) >= n)), 1)) { return memcpy(dst, src, n); } else { bcopy(src, dst, n); return dst; } } in bionic/libc/string/bcopy.c /* * Copy a block of memory, handling overlap. * This is the routine that actually implements * (the portable versions of) bcopy, memcpy, and memmove. */ #ifdef MEMCOPY void * memcpy(void *dst0, const void *src0, size_t length) #else #ifdef MEMMOVE void * memmove(void *dst0, const void *src0, size_t length) #else void bcopy(const void *src0, void *dst0, size_t length) #endif #endif { ..... 73
  • Antutu 3.x For people with source code Selection of toolchain and compiler options may cause huge difference, e.g., bit eld Some version of x86 binary for Antutu 3.x was compiled with Intel, bit-by-bit operations turned in word-wide (32-bit) operations, and the speed up is about 70x faster 74
  • Stream copy usually turned into memcpy() 75
  • remote gdb 1. get /system/bin/app_process and /system/bin/linker of the target system and necessary shared libraries, e.g., /data/data/eu.chainfire.cfbench/lib/libCFBench.so adb pull /system/bin/app_process! adb pull /system/bin/linker lib/armeabi-v7a/! adb pull /data/data/eu.chainfire.cfbench/lib/libCFBench.so lib/ armeabi-v7a/! 2. arm-linux-gnueabi-gdb ./app_process 3. on the target device, attach gdbserver to the running process you wanna debug ./gdbserver --attach :5039 3484 4. set shared library search path (gdb) set solib-search-path /Users/freedom/tmp/cfbench/lib/armeabi-v7a 5. adb forward tcp:5039 tcp:5039 and set remote target (gdb) target remote :5039 6. you can set breakpoints, print backtrace, disassemble, etc. 76
  • (gdb) b Java_eu_chainre_cfbench_BenchNative_benchMemReadAligned (gdb) disassemble Dump of assembler code for function Java_eu_chainre_cfbench_BenchNative_benchMemReadAligned: 0x74b65848 : stmdb sp!, {r4, r5, r6, r7, r8, r9, r10, lr} => 0x74b6584c : bl 0x74b654ac 0x74b65850 : mov.w r0, #1048576 ; 0x100000 0x74b65854 : blx 0x74b65358 0x74b65858 : movs r6, #0 0x74b6585a : movw r9, #9999 ; 0x270f 0x74b6585e : mov r8, r0 0x74b65860 : bl 0x74b6547c 0x74b65864 : add.w r5, r8, #1048576 ; 0x100000 0x74b65868 : mov r10, r0 0x74b6586a : mov r3, r8 0x74b6586c : ldr.w r2, [r3], #4 0x74b65870 : cmp r3, r5 0x74b65872 : add r4, r2 0x74b65874 : bne.n 0x74b6586c 0x74b65876 : bl 0x74b6547c 0x74b6587a : adds r6, #1 0x74b6587c : rsb r7, r10, r0 0x74b65880 : cmp r7, r9 77
  • Quadrant Written in Java CPU: Not really testing CPU Memory: proling shows that memcpy() is heavily in used What can we do optimized JIT part of DVM 78
  • What other possible ways? binary translation during installation time run time 79
  • Wrap-up Popular CPU and Memory benchmarks on Android mostly dont reect real CPU performance We know CPU performance != System performance != user-perceived performance There is always room for improvement 80
  • So? 81
  • Recent progress EEMBCs AndEBench 2.0 is under development (http:// www.eembc.org/press/pressrelease/130128.html) Qualcomm asked BDTi to develop new benchmark (http://www.qualcomm.com/media/blog/2013/08/16/ mobile-benchmarking-turning-corner-user- experience). Samsung with other vendors launched MobileBench Consortium last year Antutu is still growing 82
  • Thanks!
  • MediaTek joined linaro.org last month linaro.org is a NPO working on open source Linux/Android related stuff for ARM-based SoCs So MTK is getting more open recently And, its looking for open source engineers Talk to guys at MTK booth or me There are more non- open source jobs 84
  • backup 85
  • Some References to Understand Performance Benchmark Raj Jain,The Art of Computer Systems Performance Analysis:Techniques for Experimental Design, Measurement, Simulation, and Modeling,Wiley, 1991 Quantitative Approach A good SPEC introduction article, http://mrob.com/ pub/comp/benchmarks/spec.html Kaivalya M. Dixit,Overview of the SPEC Benchmarks, http://people.cs.uchicago.edu/~chliu/ doc/benchmark/chapter9.pdf 86
  • Basic system parameters ------------------------------------------------------------------------------ Host OS Description Mhz tlb cache mem scal pages line par load bytes --------- ------------- ----------------------- ---- ----- ----- ------ ---- localhost Linux 3.4.5-g armv7l-linux-gnu 1696 7 64 4.4700 1 ! Processor, Processes - times in microseconds - smaller is better ------------------------------------------------------------------------------ Host OS Mhz null null open slct sig sig fork exec sh call I/O stat clos TCP inst hndl proc proc proc --------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- localhost Linux 3.4.5-g 1696 0.49 0.67 2.54 5.95 8.52 0.67 5.05 876. 1668 4654 ! Basic integer operations - times in nanoseconds - smaller is better ------------------------------------------------------------------- Host OS intgr intgr intgr intgr intgr bit add mul div mod --------- ------------- ------ ------ ------ ------ ------ localhost Linux 3.4.5-g 1.0700 0.1100 3.4000 90.5 14.8 ! Basic float operations - times in nanoseconds - smaller is better ----------------------------------------------------------------- 87
  • Context switching - times in microseconds - smaller is better ------------------------------------------------------------------------- Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw --------- ------------- ------ ------ ------ ------ ------ ------- ------- localhost Linux 3.4.5-g 8.9700 4.9000 6.1400 12.3 7.68000 57.6 ! *Local* Communication latencies in microseconds - smaller is better --------------------------------------------------------------------- Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP ctxsw UNIX UDP TCP conn --------- ------------- ----- ----- ---- ----- ----- ----- ----- ---- localhost Linux 3.4.5-g 8.970 17.6 23.9 47.5 71.3 357. ! File & VM system latencies in microseconds - smaller is better ------------------------------------------------------------------------------- Host OS 0K File 10K File Mmap Prot Page 100fd Create Delete Create Delete Latency Fault Fault selct --------- ------------- ------ ------ ------ ------ ------- ----- ------- ----- localhost Linux 3.4.5-g 700.0 1.259 2.55270 3.048 ! *Local* Communication bandwidths in MB/s - bigger is better ----------------------------------------------------------------------------- Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem 88
  • PARSEC content Blackscholes This applica0on is an Intel RMS benchmark. It calculates the prices for a por|olio of European op0ons analy0cally with the Black-Scholes par1al dieren1al equa1on (PDE). There is no closed-form expression for the Black- Scholes equa0on and as such it must be computed numerically. Bodytrack This computer vision applica0on is an Intel RMS workload which tracks a human body with mul1ple cameras through an image sequence. This benchmark was included due to the increasing signicance of computer vision algorithms in areas such as video surveillance, character anima0on and computer interfaces. Canneal This kernel was developed by Princeton University. It uses cache-aware simulated annealing (SA) to minimize the rou1ng cost of a chip design. Canneal uses ne-grained parallelism with a lock-free algorithm and a very aggressive synchroniza0on strategy that is based on data race recovery instead of avoidance. Dedup This kernel was developed by Princeton University. It compresses a data stream with a combina1on of global and local compression that is called 'deduplica1on'. The kernel uses a pipelined programming model to mimic real-world implementa0ons. The reason for the inclusion of this kernel is that deduplica0on has become a mainstream method for new-genera0on backup storage systems. Facesim This Intel RMS applica0on was originally developed by Stanford University. It computes a visually realis1c anima1on of the modeled face by simula1ng the underlying physics. The workload was included in the benchmark suite because an increasing number of anima0ons employ physical simula0on to create more realis0c eects. Ferret This applica0on is based on the Ferret toolkit which is used for content-based similarity search. It was developed by Princeton University. The reason for the inclusion in the benchmark suite is that it represents emerging next- genera0on search engines for non-text document data types. In the benchmark, we have congured the Ferret toolkit for image similarity search. Ferret is parallelized using the pipeline model. 89
  • PARSEC content Fluidanimate This Intel RMS applica0on uses an extension of the Smoothed Par0cle Hydrodynamics (SPH) method to simulate an incompressible uid for interac1ve anima1on purposes. It was included in the PARSEC benchmark suite because of the increasing signicance of physics simula0ons for anima0ons. Freqmine This applica0on employs an array-based version of the FP-growth (Frequent PaMern-growth) method for Frequent Itemset Mining (FIMI). It is an Intel RMS benchmark which was originally developed by Concordia University. Freqmine was included in the PARSEC benchmark suite because of the increasing use of data mining techniques. Raytrace The Intel RMS applica0on uses a version of the raytracing method that would typically be employed for real- 0me anima0ons such as computer games. It is op0mized for speed rather than realism. The computa0onal complexity of the algorithm depends on the resolu0on of the output image and the scene. Streamcluster This RMS kernel was developed by Princeton University and solves the online clustering problem. Streamcluster was included in the PARSEC benchmark suite because of the importance of data mining algorithms and the prevalence of problems with streaming characteris0cs. Swap1ons The applica0on is an Intel RMS workload which uses the Heath-Jarrow-Morton (HJM) framework to price a porRolio of swap1ons. Swap0ons employs Monte Carlo (MC) simula0on to compute the prices. Vips This applica0on is based on the VASARI Image Processing System (VIPS) which was originally developed through several projects funded by European Union (EU) grants. The benchmark version is derived from a print on demand service that is oered at the Na0onal Gallery of London, which is also the current maintainer of the system. The benchmark includes fundamental image opera0ons such as an ane transforma0on and a convolu0on. X264 90


View more >