update on the spider ii file system
DESCRIPTION
In this presentation from the DDN User Meeting at SC13, Sarp Oral provides an update on the Spider II file system at Oak Ridge National Laboratory. Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/TRANSCRIPT
16 Sarp Oral | SC’13, DDN User Meeting
Spider%II%Specs%
1'SFA12K40'IB'FDR'10'60Sdisk'enclosure'560'2'TB'NL'SAS'drives'
36'SFA12K40'IB'FDR'10'60Sdisk'trays/couplet'560'2'TB'NL'SAS/couplet'20,160'drives'40'PB'capacity'(raw)'>'1'TB/s'performance''
Scalable'Storage'System'
Test'and'Development'System'
32'PB'capacity'(aker'RAID)'
>'1'TB/s'performance''
288'Lustre'OSS'total'
8'OSS'per'couplet'
4'MDS'and'2'MGS'
Configured'in'4'rows'
2x'108Sport'FDR'IB'switches'
36x'36Sport'FDR'IB'switches'
440'Lustre'Titan'LNET'routers'(432&for&OSS,&8&for&MDS)&
Facts'
17 Sarp Oral | SC’13, DDN User Meeting
Spider%II%R%Architecture%
Enterprise Storagecontrollers and large
racks of disks are connectedvia InfiniBand.
36 DataDirect SFA12K-40controller pairs with
2 Tbyte NL- SAS drives and 8 InifiniBand FDR connections per pair
Storage Nodesrun parallel file system software and manage incoming FS traffic.
288 Dell servers with
64 GB of RAM each
SION II Networkprovides connectivity
between OLCF resources and
primarily carries storage traffic.
1600 ports, 56 Gbit/secInfiniBand switch
complex
Lustre Router Nodesrun parallel file system
client software andforward I/O operations
from HPC clients.
432 XK7 XIO nodesconfigured as Lustre
routers on Titan
Titan XK7
Other OLCFresources
XK7 Gemini 3D Torus
9.6 Gbytes/sec per directionInfiniBand56 Gbit/sec
Serial ATA6 Gbit/sec
18 Sarp Oral | SC’13, DDN User Meeting
Spider%II%R%Facili5es%
• Sits'on'a'36’’'raised'floor'and'forced'air'cooled'• 4'iden>cal'rows'in'hotSaisle/coldSaisle'configura>on'
– 9'racks'for'DDN'SFA12KS40'equipment'
– 1'infrastructure'rack'– ColdSaisle'is'fully'contained'with'overhead'panels'and'sliding'doors'at'each'end'of'the'rows'• Prevents'hotSair'coldSair'mixing'and'increases'cooling'efficiency'
• 25%'perforated'>les'used'to'provide'coldSair'to'coldSaisles'• Fully'compliant'with'the'requisite'Na>onal'Fire'Protec>on'Associa>on'(NFPA)'codes'
• Total'space'required'is'672'square'feet'
19 Sarp Oral | SC’13, DDN User Meeting
Spider%II%R%Facili5es%• Ran'series'tests'on'a'DDN'SFA12KS40'testbed'unit'under'various'I/O'mode'and'load'scenarios'– 9'kW'per'DDN'rack'nominal'load'
• Total'file'system'load'including'infrastructure'racks'is'400'kW'and'total'cooling'load'is'114'tons'
• Each'rack'is'fed'with'a'pair'of'208VAC'3Sphase'electrical'feeds,'protected'by'a'50A'10%Srated'breaker'– Fed'from'two'different'transformer'sources'
– DDN'SFA12K'power'distribu>on'system'is'both'load'balanced'and'supports'failSover,'OLCF'can'conduct'both'scheduled'and'unscheduled'maintenance'on'one'transformer'without'disrup>ng'the'file'system'opera>on'
– Neither'electrical'connec>on'is'protected'by'UPS'
20 Sarp Oral | SC’13, DDN User Meeting
Integra5on%efforts%
• Lustre'2.4'tes>ng'– SmallSscale'
• Round'the'clock'tes>ng'for'stability,'regression,'and'performance'on'a'single'cabinet'Cray'XK7'(Arthur)'
• Home'built'Cray'Lustre'2.4'client'as'well'as'servers'
• Early'detec>on'and'correc>on'of'problems'and'bugs'
– LargeSscale'• Weekly'tes>ng'on'Titan'
• Iden>fied'some'number'of'problems'at'scale'
• IB'FDR'tes>ng'on'Cray'– Cray'and'Mellanox'
21 Sarp Oral | SC’13, DDN User Meeting
Schedule%• System'infrastructure'delivery''
– Completed'
• Block'storage'delivery'– Completed'
• Block'acceptance'– Completed'– Achieved'1.3'TB/s'for'reads'and'1.2'TB/s'for'writes'at'the'blockSlevel'– Need'to'reSvisit'for'a'few'items'Q1’14'
• Lustre'support'with'Intel''– Completed.'Level'1,'2,'and'3'support'with'Intel'
• File'system'integra>on'– Completed'
• Rolling'into'produc>on'– Completed'
• Performance'tuning'– On'going.'To'be'completed'by'Q1‘14.'
22 Sarp Oral | SC’13, DDN User Meeting
23 Sarp Oral | SC’13, DDN User Meeting
Ques>ons?''[email protected]'
23
The research and activities described in this presentation were performed using the resources of the National Center for Computational Sciences at
Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy
under Contract No. DE-AC0500OR22725.