hpdc 2012 presentation - june 19, 2012 - delft, the netherlands

40
Experience with 100Gbps Network Applications Mehmet Balman, Eric Pouyoul, Yushu Yao, E. Wes Bethel, Burlen Loring, Prabhat, John Shalf, Alex Sim, and Brian L. Tierney DIDC – Del/, the Netherlands June 19, 2012

Upload: balmanme

Post on 11-May-2015

287 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

 Experience  with  100Gbps  Network  

Applications  Mehmet  Balman,  Eric  Pouyoul,  Yushu  Yao,  E.  Wes  Bethel,  Burlen  Loring,  Prabhat,  John  Shalf,  Alex  Sim,  and  Brian  L.  Tierney  

DIDC  –  Del/,  the  Netherlands  June  19,  2012  

Page 2: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

 Experience  with  100Gbps  Network  Applications  

Mehmet  Balman  ComputaConal  Research  Division  

Lawrence  Berkeley  NaConal  Laboratory  

DIDC  –  Del/,  the  Netherlands  June  19,  2012  

Page 3: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Outline  • A  recent  100Gbps  demo  by  ESnet  and  Internet2  at  SC11  

•  Two  applicaCons:  • VisualizaCon  of  remotely  located  data  (Cosmology)  

• Data  movement  of  large    datasets  with  many  files  (Climate  analysis)  

Our  experience  in  applicaCon  design  issues  and  host  tuning  strategies  to  scale  to  100Gbps  rates  

Page 4: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

The  Need  for  100Gbps  networks  

Modern  science  is  Data  driven  and  Collabora?ve  in  nature  

•  The  largest  collaboraCons  are  most  likely  to  depend  on  distributed  architectures.    •  LHC  (distributed  architecture)  data  generaCon,  distribuCon,  and  analysis.    

•  The  volume  of  data  produced  by  of  genomic  sequencers  is  rising  exponenCally.    

•  In  climate  science,  researchers  must  analyze  observaConal  and  simulaCon  data  located  at  faciliCes  around  the  world    

Page 5: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

ESG  (Earth  Systems  Grid)  

• Over  2,700  sites  •  25,000  users  

 

•  IPCC  Fi[h  Assessment  Report  (AR5)  2PB    

•  IPCC  Forth  Assessment  Report  (AR4)  35TB  

Page 6: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

100Gbps  networks  arrived  

•  Increasing  network  bandwidth  is  an  important  step  toward  tackling  ever-­‐growing  scienCfic  datasets.  

•  1Gbps  to  10Gbps  transiCon  (10  years  ago)  •  ApplicaCon  did  not  run  10  Cmes  faster  because  there  was  more  bandwidth  available  

In  order  to  take  advantage  of  the  higher  network  capacity,  we  need  to  pay  close  a_enCon  to  the  applicaCon  design  and  host  tuning  issues  

Page 7: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Applications’  Perspective  

•  Increasing  the  bandwidth  is  not  sufficient  by  itself;  we  need  careful  evaluaCon  of  high-­‐bandwidth  networks  from  the  applicaCons’  perspecCve.    

•  Real  ?me  streaming  and  visualiza?on  of  cosmology  data  •  How  high  network  capability  enables  remotely  located  scien6sts  to  gain  insights  from  large  data  volumes?  

•  Data  distribu?on  for  climate  science  •   How  scien6fic  data  movement  and  analysis  between  geographically  disparate  supercompu6ng  facili6es  can  benefit  from  high-­‐bandwidth  networks?  

Page 8: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

The  SC11  100Gbps  demo  

•  100Gbps  connecCon  between  ANL  (Argon),  NERSC  at  LBNL,  ORNL  (Oak  Ridge),  and  the  SC11  booth  (Sea_le)  

Page 9: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Demo  ConFiguration  

RRT:                  Sea_le  –  NERSC    16ms                                    NERSC  –  ANL              50ms  

         NERSC  –  ORNL        64ms  

Page 10: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Visualizing  the  Universe  at  100Gbps  

•  VisaPult  for  streaming  data  •  Paraview  for  rendering    •  For  visualizaCon  purposes,  occasional  packet  loss  is  acceptable:  (using  UDP)  

•  90Gbps  of  the  bandwidth  is  used  for  the  full  dataset  •  4x10Gbps  NICs  (4  hosts)  

•  10Gbps  of  the  bandwidth  is  used  for  1/8  of  the  same  dataset  

•  10Gbps  NIC  (one  host)  

Page 11: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Demo  ConFiguration  

Flash-based GPFS Cluster

Receive/Render

H1

Sender 01

Sender 02

Sender 16

Sender 03

… …

NERSC Router

100G Pipe

Receive/Render

H2

Receive/Render

H3

Receive/Render

H4

Low Bandwidth Receive/

Render/Vis

LBL Booth Router

High Bandwidth

Vis Srv

IB Cloud

Gigabit Ethernet

Infiniband Connection

10GigE Connection

1 GigE Connection

Low Bandwidth

Display

High Bandwidth Display

The  1Gbps  connecCon  is  used  for  synchronizaCon  and  communicaCon  of  the  rendering  applicaCon,  not  for  transfer  of  the  raw  data      

Page 12: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Batch#,n X1Y1Z1D1 X2Y2Z2D2 … … XnYnZnDn

UDP Packet

In the final run n=560, packet size is 8968 bytes

UDP  shufFling  

•  UDP  packets  include  posiCon  (x,  y,  z)  informaCon  (1024ˆ3  matrix)    

 • MTU  is  9000.  The  largest  possible  packet    size  under  8972  bytes  (MTU  size  minus  IP  and  UDP  headers)    

 •  560  points  in  each  packet,  8968  bytes  

•  3  integers  (x,y,z)  +  a  floaCng  point  value  

Page 13: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Data  Flow  •  For  each  Cme  step,  the  input  data  is  split  into  32  streams  along  the  z-­‐direcCon;  each  stream  contains  a  conCguous  slice  of  the  size  1024  ∗  1024  ∗  32.    

•  32  streams  for  90Gbps  demo  (high-­‐bandwidth)  •  4  streams  for  10Gbps  demo  (low  bandwidth,  1/8  of  the  data)  

Shuffler Receiver Stager Render SW

GPFS Flash

/dev/shm /dev/shm

Flow of Data

Send Server at NERSC Receive Server at SC Booth

Page 14: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Performance  Optimization  

•  Each  10Gbps  NIC  in  the  system  is  bound  to  a  specific  core.  

•  Receiver  processes  are  also  bound  to  the  same  core  •  Renderer  same  NUMA  node  but  different  core  (accessing  to  the  same  memory  region)  

Mem Mem

Mem Mem

Receiver 1

10G Port

Receiver 2

Render 1

NUMA Node

Core

Render 2

Page 15: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Network  Utilization  

•  2.3TB  of  data  from  NERSC  to  SC11  booth  in  Sea_le  in  ≈  3.4  minutes  

•  For  each  Cmestep,  corresponding  to  16GB  of  data,  it  took  ≈  1.4  seconds  to  transfer    and  ≈  2.5  seconds  for  rendering  before  the  image  was  updated.  

•   Peak  ≈  99Gbps.    •  Average  ≈  85Gbps  

Page 16: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Demo  

Page 17: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Climate  Data  Distribution  •  ESG  data  nodes  

•  Data  replicaCon  in  the  ESG  FederaCon  

•  Local  copies  •  data  files  are  copied  into  temporary  storage  in  HPC  centers  for  post-­‐processing  and  further  climate  analysis.    

Page 18: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Climate  Data  over  100Gbps  

•  Data  volume  in  climate  applicaCons  is  increasing  exponenCally.  

•  An  important  challenge  in  managing  ever  increasing  data  sizes  in  climate  science  is  the  large  variance  in  file  sizes.    •  Climate  simulaCon  data  consists  of  a  mix  of  relaCvely  small  and  large  files  with  irregular  file  size  distribuCon  in  each  dataset.    

• Many  small  files  

Page 19: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Keep  the  data  channel  full  

FTP RPC

request a file

request a file

send file

send file

request data

send data

•  Concurrent  transfers  •  Parallel  streams  

Page 20: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

lots-­‐of-­‐small-­‐Files  problem!  File-­‐centric  tools?    

l  Not  necessarily  high-­‐speed  (same  distance)  -  Latency  is  sCll  a  problem  

100Gbps pipe 10Gbps pipe

request a dataset

send data

Page 21: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Block-­‐based  

network

Front-­‐end  threads  

(access  to  memory  blocks)

Front-­‐end  threads  (access  to  memory  blocks)

Memory  blocks Memory  

blocks

memory  caches  are  logically  mapped  between  client  and  server    

Page 22: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Moving  climate  Files  efFiciently  

Page 23: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Advantages  •  Decoupling  I/O  and  network  operaCons  

•  front-­‐end  (I/O,  proccesing)  •  back-­‐end  (networking  layer)    

•  Not  limited  by  the  characterisCcs  of  the  file  sizes    On  the  fly  tar  approach,    bundling  and  sending    many  files  together  

•  Dynamic  data  channel  management    Can  increase/decrease  the  parallelism  level  both    in  the  network  communicaCon  and  I/O  read/write    operaCons,  without  closing  and  reopening  the    data  channel  connecCon  (as  is  done  in  regular  FTP    variants).    

Page 24: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Demo  ConFiguration  

Disk  to  memory  /  reading  from  GPFS  (NERSC),  max  120Gbps  read  performance    

Page 25: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

The  SC11  100Gbps  Demo  

•  CMIP3  data  (35TB)  from  the  GPFS  filesystem  at  NERSC      •  Block  size  4MB  •  Each  block’s  data  secCon  was  aligned  according  to  the  system  pagesize.    

•  1GB  cache  both  at  the  client  and  the  server      •  At  NERSC,  8  front-­‐end  threads  on  each  host  for  reading  data  files  in  parallel.  

•   At  ANL/ORNL,  4  front-­‐end  threads  for  processing  received  data  blocks.  

•   4  parallel  TCP  streams  (four  back-­‐end  threads)  were  used  for  each  host-­‐to-­‐host  connecCon.    

Page 26: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

83Gbps    throughput  

Page 27: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

MemzNet:  memory-­‐mapped  zero-­‐copy  Network  channel  

network

Front-­‐end  threads  

(access  to  memory  blocks)

Front-­‐end  threads  (access  to  memory  blocks)

Memory  blocks Memory  

blocks

memory  caches  are  logically  mapped  between  client  and  server    

Page 28: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

ANI  100Gbps    testbed  

ANI 100G Router

nersc-diskpt-2

nersc-diskpt-3

nersc-diskpt-1

nersc-C2940 switch

4x10GE (MM)

4x 10GE (MM)

Site Router(nersc-mr2)

anl-mempt-2

anl-mempt-1

anl-app

nersc-app

NERSC ANL

Updated December 11, 2011

ANI Middleware Testbed

ANL Site Router

4x10GE (MM)

4x10GE (MM)

100G100G

1GE

1 GE

1 GE

1 GE

1GE

1 GE

1 GE1 GE

10G

10G

To ESnet

ANI 100G Router

4x10GE (MM)

100G 100G

ANI 100G Network

anl-mempt-1 NICs:2: 2x10G Myricom

anl-mempt-2 NICs:2: 2x10G Myricom

nersc-diskpt-1 NICs:2: 2x10G Myricom1: 4x10G HotLava

nersc-diskpt-2 NICs:1: 2x10G Myricom1: 2x10G Chelsio1: 6x10G HotLava

nersc-diskpt-3 NICs:1: 2x10G Myricom1: 2x10G Mellanox1: 6x10G HotLava

Note: ANI 100G routers and 100G wave available till summer 2012; Testbed resources after that subject funding availability.

nersc-asw1

anl-C2940 switch

1 GE

anl-asw1

1 GE

To ESnet

eth0

eth0

eth0

eth0

eth0

eth0

eth2-5

eth2-5

eth2-5

eth2-5

eth2-5

eth0

anl-mempt-3

4x10GE (MM)

eth2-5 eth0

1 GE

anl-mempt-3 NICs:1: 2x10G Myricom1: 2x10G Mellanox

4x10GE (MM)

10GE (MM)10GE (MM)

Page 29: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

ANI  100Gbps    testbed  

ANI 100G Router

nersc-diskpt-2

nersc-diskpt-3

nersc-diskpt-1

nersc-C2940 switch

4x10GE (MM)

4x 10GE (MM)

Site Router(nersc-mr2)

anl-mempt-2

anl-mempt-1

anl-app

nersc-app

NERSC ANL

Updated December 11, 2011

ANI Middleware Testbed

ANL Site Router

4x10GE (MM)

4x10GE (MM)

100G100G

1GE

1 GE

1 GE

1 GE

1GE

1 GE

1 GE1 GE

10G

10G

To ESnet

ANI 100G Router

4x10GE (MM)

100G 100G

ANI 100G Network

anl-mempt-1 NICs:2: 2x10G Myricom

anl-mempt-2 NICs:2: 2x10G Myricom

nersc-diskpt-1 NICs:2: 2x10G Myricom1: 4x10G HotLava

nersc-diskpt-2 NICs:1: 2x10G Myricom1: 2x10G Chelsio1: 6x10G HotLava

nersc-diskpt-3 NICs:1: 2x10G Myricom1: 2x10G Mellanox1: 6x10G HotLava

Note: ANI 100G routers and 100G wave available till summer 2012; Testbed resources after that subject funding availability.

nersc-asw1

anl-C2940 switch

1 GE

anl-asw1

1 GE

To ESnet

eth0

eth0

eth0

eth0

eth0

eth0

eth2-5

eth2-5

eth2-5

eth2-5

eth2-5

eth0

anl-mempt-3

4x10GE (MM)

eth2-5 eth0

1 GE

anl-mempt-3 NICs:1: 2x10G Myricom1: 2x10G Mellanox

4x10GE (MM)

10GE (MM)10GE (MM)

SC11  100Gbps    demo  

Page 30: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

ANI testbed 100Gbps (10x10NICs, three hosts): Throughput vs the number of parallel streams [1, 2, 4, 8, 16, 32 64 streams - 5min intervals], TCP buffer size is 50M

Many  TCP  Streams  

Page 31: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

ANI testbed 100Gbps (10x10NICs, three hosts): Interface traffic vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min intervals], TCP buffer size is 50M

Page 32: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Performance  at  SC11  demo  

TCP  buffer  size  is  set  to  50MB    

MemzNet GridFTP

Page 33: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Performance  results  in  the  100Gbps  ANI  testbed  

Page 34: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Host  tuning    With  proper  tuning,  we  achieved  98Gbps  using  only  3  sending  hosts,  3  receiving  hosts,  10  10GE  NICS,  and  10  TCP  flows  

Page 35: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

NIC/TCP  Tuning  • We  are  using  Myricom  10G  NIC  (100Gbps  testbed)  

•  Download  latest  drive/firmware  from  vendor  site  •  Version  of  driver  in  RHEL/CentOS  fairly  old  

•  Enable  MSI-­‐X  •  Increase  txgueuelen  

/sbin/ifconfig eth2 txqueuelen 10000!•  Increase  Interrupt  coalescence!

/usr/sbin/ethtool -C eth2 rx-usecs 100!•  TCP  Tuning:  net.core.rmem_max = 67108864!net.core.wmem_max = 67108864!net.core.netdev_max_backlog =  250000  

Page 36: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

100Gbps  =  It’s  full  of  frames  !  •  Problem:  

•  Interrupts  are  very  expensive  •  Even  with  jumbo  frames  and  driver  opCmizaCon,    there  is  sCll  too  many  interrupts.  

•  SoluCon:  •  Turn  off  Linux  irqbalance  (chkconfig  irqbalance  off)  •  Use  /proc/interrupt  to  get  the  list  of  interrupts  •  Dedicate  an  enCre  processor  core  for  each  10G  interface  

•  Use  /proc/irq/<irq-­‐number>/smp_affinity  to  bind  rx/tx  queues  to  a  specific  core.  

Page 37: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Tuning Setting without tuning with tuning % improvementInterrupt coalescing (TCP) 24 36.8 53.3333333

Interrupt coalescing (UDP) 21.1 38.8 83.8862559

IRQ Binding (TCP) 30.6 36.8 20.2614379

IRQ Binding (UDP) 27.9 38.5 37.9928315

0

5

10

15

20

25

30

35

40

45

Interrupt coalescing

(TCP)

Interrupt coalescing

(UDP)

IRQ Binding (TCP)

IRQ Binding (UDP)

Gbp

s

Host Tuning Results

without tuning

with tuning

Page 38: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Conclusion  

• Host  tuning  &  host  performance    • MulCple  NICs  and  mulCple  cores  •  The  effect  of  the  applicaCon  design  

•  TCP/UDP  buffer  tuning,  using  jumbo  frames,  and  interrupt  coalescing.      

• MulC-­‐core  systems:  IRQ  binding  is  now  essenCal  for  maximizing  host  performance.    

Page 39: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Acknowledgements  Peter   Nugent,   Zarija   Lukic   ,   Patrick   Dorn,   Evangelos  Chaniotakis,   John   Christman,   Chin   Guok,   Chris   Tracy,   Lauren  Rotman,   Jason   Lee,   Shane   Canon,   Tina   Declerck,   Cary  Whitney,  Ed  Holohan,    Adam  Scovel,  Linda  Winkler,  Jason  Hill,  Doug  Fuller,    Susan  Hicks,  Hank  Childs,  Mark  Howison,  Aaron  Thomas,  John  Dugan,  Gopal  Vaswani  

Page 40: HPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands

Ques?ons?  Contact:  

• Mehmet  Balman  [email protected]  • Brian  Tierney  [email protected]