an introduction and evaluations of a wide area distributed storage system

97
An introduction and evaluations of a wide area distributed storage system

Upload: hiroki-kashiwazaki

Post on 31-May-2015

357 views

Category:

Technology


3 download

DESCRIPTION

A presentation on Storage Developer Conference (SDC) 2014 in Santa Clara, California. General overview of distcloud until now and the future. 米カリフォルニア州サンタクララで開催された Storage Developer Conference 2014 での発表資料です。distcloud のこれまでとこれからの総括。

TRANSCRIPT

  • 1. An introduction and evaluations of a wide area distributed storage system

2. DRDisaster Recovery 3. 1978 4. Sun Information Systems 5. mainframe hot site 6. 80s 7. Realtime Processing 8. POSpoint of sales 9. 90s 10. the Internet 11. 2001.9.11 September 11 attacks 12. 2003.8.14 Northeast blackout of 2003 13. in Japan 14. 2011.3.11The aftermath of the 2011 Tohoku earthquake and tsunami 15. BCPBusiness Continuity Plan 16. Eurasian plate North American Plate Pacic Ocean Plate Philippine Sea Plate epicenter of 3.11 Nankai (South Sea) Trough [NEXT] 17. Gunnma 18. Gunnma 19. Gunnma 20. Ishikari 21. Ishikari 22. Is Two enough ? 23. cost 24. National Institute of Informatics 25. Trans-Japan Inter-Cloud Testbed 26. Kitami Institute of Technology University of the Ryukyus SINET the longest path 27. Cybermedia Center Osaka University Kitami Institute of Technology University of the Ryukyus XenServer 6.0.2 CloudStack 4.0.0 XenServer 6.0.2 CloudStack 4.0.0 28. problems 29. shared storage 30. 50ms 31. RTT > 200ms 32. Storage XenMotion Live Migration without shared storage > XenServer 6.1 33. VSAvSphere Storage Appliance 34. VMware VSAN 35. WIDE cloud different translate 36. Distributed Storage 37. requirement 38. 64 256 1024 409616384655362621441.04858e+064.1943e+061.67772e+076.71089e+074 16 64 256 1024 4096 16384 0 20000 40000 60000 80000 100000 120000 Kbytes/sec File size in 2^n KBytes Record size in 2^n Kbytes 0 20000 40000 60000 80000 100000 120000 High Random R/W Performance 39. POSIX le system interface protocl NFS, CIFS, iSCSI 40. RICCRegional InterCloud Committee 41. Distcloudwidely distributed virtualization infrastructure 42. Con$idential Global VM migration is also available by sharing "storage space" by VM host machines. Real time availability makes it possible. Actual data copy follows. (VM operator need virtually common Ethernet segment and fat pipe for memory copy) TOYAMA site OSAKA site TOKYO site before Migration Copy to DR-sites Copy to DR-sites live migration of VM between distributed areas real time and active-active features seem to be just a simple "shared storage". Live migration is also possible between DR sites (it requires common subnet and fat pipe for memory copy, of course) after Migration Copy to DR-sites 43. Con$idential Front-end servers aggregate client requests (READ / WRITE) so that, lots of back-end servers can handle user data in parallel & distributed manner. Both of performance & storage space are scalable, depends on # of servers. front-end (access server) Access Gateway (via NFS, CIFS or similar) clients back-end (core server) WRITE req. write blocks read blocks READ req. scalable performance & scalable storage size by parallel & distributing processing technology 44. File block block block block block block block block block Meta Data consistent hash backend (core servers) 45. Con$idential 1. assign a new unique ID for any updated block (to ensure consistency). 2. make replication in local site (for quick ACK) and update meta data. 3. make replication in global distributed environment (for actual data copies). back-end (multi-sites) a file, consisted from many blocks multiplicity in multi-location, makes each user data, redundant in local, at first, 3 distributed copies, at last. (2) create 2 copies in local for each user data, write META data, ant returns ACK (1) (1') (3-a) (3-a) (3-a) make a copy in different location right after ACK. (3-b) remove one of 2 local blocks, in a future. (3-b) (1) assign a new unique ID for any updated block, so that, ID ensures the consistency Most important ! the key for "distributed replication" 46. NFS CIFS iSCSI 47. redundancy = 3 r = 2 ACK r = 1 r = 0 write 48. dundancy = 3 ACK r = 2 e = 0 r = 1 e = 0 r = 0 e = 1 r = -1 e = 2 external 49. 10Gbps VMs core servers access server (nfsd) VM images VM image chunks virtualization host 50. 316 km 440 km 690 km Hiroshima Univ. Kanazawa Univ. 51. Hiroshima Univ. Kanazawa Univ. NII VMM: virtual machine monitor CS: core servers HS: hint servers AS: access servers AS AS VMM VMM CS CS CS CS CS CSHS HS CS CS CSHS L3VPN L3VPN L2VPN L2VPN L2VPN L2VPN L3VPN EXAGE-LAN EXAGE-LAN admin LAN admin LANMIGRATION-LAN EXAGE-LAN MIGRATION-LAN 52. iozone -aceI a: full automatic mode c: Include close() in the timing calculations e: Include flush (fsync,fflush) in the timing calculations I: Use DIRECT_IO if possible for all file operations. 53. write 64 256 1024 409616384655362621441.04858e+064.1943e+061.67772e+076.71089e+074 16 64 256 1024 4096 16384 0 20000 40000 60000 80000 100000 120000 Kbytes/sec File size in 2^n KBytes Record size in 2^n Kbytes 0 20000 40000 60000 80000 100000 120000 64 256 1024 4096 16384655362621441.04858e+064.1943e+061.67772e+076.71089e+07 4 16 64 256 1024 4096 16384 File size in 2^n KBytes Recordsizein2^nKbytes 54. write read re-readre-write random read backwords read records rewrite strided read random write fwrite file size [Bytes] file size [Bytes] file size [Bytes] recordsize[KB] recordsize[KB]recordsize[KB] 4 16 64 256 1024 4096 16384 4 16 64 256 1024 4096 16384 0 20 40 60 80 100 120 64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB MB/sec 4 16 64 256 1024 4096 16384 frewrite 55. 0 20 40 60 80 100 120 10MB 100MB 1GB 10GB Throughput(MB/sec) File size write rewrite read reread random read random write bkwd read stride read fwrite fread legend record rewrite 0 20 40 60 80 100 120 10MB 100MB 1GB 10GB Throughput(MB/sec) File size 0 20 40 60 80 100 120 10MB 100MB 1GB 10GB Throughput(MB/sec) File size 0 20 40 60 80 100 120 10MB 100MB 1GB 10GB Throughput(MB/sec) File size 0 20 40 60 80 100 120 10MB 100MB 1GB 10GB Throughput(MB/sec) File size 0 20 40 60 80 100 120 10MB 100MB 1GB 10GB Throughput(MB/sec) File size 0 20 40 60 80 100 120 10MB 100MB 1GB 10GB Throughput(MB/sec) File size 0 20 40 60 80 100 120 10MB 100MB 1GB 10GB Throughput(MB/sec) File size 0 20 40 60 80 100 120 10MB 100MB 1GB 10GB Throughput(MB/sec) File size 0 20 40 60 80 100 120 10MB 100MB 1GB 10GB Throughput(MB/sec) File size 0 20 40 60 80 100 120 10MB 100MB 1GB 10GB Throughput(MB/sec) File size Exage/Storage Exage/Storage 56. SINET4 Hiroshima University EXAGE L3VPN SINET4 Kanazawa University EXAGE L3VPN 57. core servers KVM host access server distcloud NFS server access server Kanazawa Univ. Hiroshima Univ. 58. proposed method (read) NFS (read) decline of throughput by latency start live migration 59. proposedmethod shared NFS Read (before migration) Read (after migration) Write (before migration) Write (after migration) Throughput(MB/sec) 60. SC2013 2013/11/1722 @Colorado Convention Center 61. Ikuo Nakagawa INTEC Inc. / Osaka University 62. Kohei Ichikawa Nara Institute of Science and Technology 63. We have been developing a widely distributed cluster storage system and evaluating the storage along with various applications. The main advantage of our storage is its very fast random I/O performance, even though it provides a POSIX compatible file system interface on the top of distributed cluster storage. 64. an initial plan 65. s Shinji Shimojo Director of JGN-X, NICT 66. It s not so fun. 67. real stage 68. 24,000 km RTT=244ms 1Gbps loop back real stage 69. Blocks (chunks) are located on the nearest 70. consistent hash Meta data is not suitable for wide area 71. type of line load condition required time (sec) domestic no load 17.9 international no load 201.6 read load 175.4 write load 400.6 required time to migration IO performance type of access pattern load condition domestic (read) 64.6 domestic (write) 58.7 international (read) 25.4 international write) 20.9 average throughput (MB/s) of dd 72. Live migration demo on an international line 73. Evaluations of distcloud on an international line 74. Disaster Recovery demonstration of DC down 75. U.S. region will be build soon 76. Future Works 77. SC142014/11/1621 @Ernest N. Morial Convention Center 78. Big Data Analysis 79. behavior data from mobile devices 80. data from non-electrication area 81. mobile devices sensor devices personal data aggregation service high latency power consumption 82. mobile devices sensor devices low latency wide-area distributed platform regional exchange regional exchange personal data aggregation service 83. route optimization 84. the Internet distcloud storage region A region B region C live migration optimize routes with remaining independence of each region users from the Internet can access the VM after live migration 85. Layer method outline features L3 routing update routing table by each migrations routing per region cannot routing per VM routing operation cost routing + L2 extension VPLS, IEEE802.1ad PB(Q in Q) IEEE802.1ah (Mac-in-Mac) stability, operation cost poor scalability L2 over L3 VXLAN, OTV, NVGRE stability overhead of tunneling IP multicast SDN OpenFlow programable operation cost of equipment ID/locator separation LISP scalability, routing per VM cost, immediacy IP mobility MAT, NEMO, MIP (Kagemusha) scalability load of router L4 mSCTP SCTP multipath independent from L2/L3 limited in SCTP L7 DNS + reverseNAT Dynamic DNS independent from L2/L3 altering IP addr. closing connection 86. 2011.3.11The aftermath of the 2011 Tohoku earthquake and tsunami 87. https://www.ickr.com/photos/idvsolutions/7439877658/sizes/o/in/photostream/ 88. 2014 Storage Developer Conference. Osaka University All Rights Reserved. go on to the next stage