![Page 2: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/2.jpg)
ThemeGap between data and knowledge (as has been discussed here before)
High Performance Computing continues to exponentially increase our ability to generate data
This can be an enabler of new science...
...but also a huge obstacle
...or an excuse not to think
![Page 3: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/3.jpg)
Outline
How much data is “large”.
Evolution of system design to deal with large data
What to do with it all - Analytics
![Page 4: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/4.jpg)
How Much Data?What is a “large” dataset nowadays?
My current machine:
2+ Tflops
Network bisection bandwidth ~1Tb/s
I/O subsystem writes ~500MB/s
(30 GB/minute)
![Page 5: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/5.jpg)
How Much Data? Mars project: ~60TB
One ASU faculty member has contacted me about a ~2 Petabyte dataset.
A Chilean observatory can produce more than 1TB an hour (12 hours data must be processed before next pass starts...)
A potential Australian array telescope would produce multiple EXABYTES per year by 2010.
Not unique to astronomy...
![Page 6: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/6.jpg)
How Much Data?Machines will be constructed in next 12 months with several tens of thousands of processors (hundreds of TF)
Network bandwidth >10TB/sec
1PB/2 minutes
1 Exabyte per 30 hours
1 Zettabyte during machine 3 yr. lifetime (yottabytes are next, if anyone’s counting...)
Google has much more computation, much less network/flop
![Page 7: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/7.jpg)
Evolution of Storage Systems
Evolution at all levels:
RAW/Text Files -> Hierarchical Formats -> Schemas -> Database
Filesystems -> LVM -> Parallel Filesystems -> Global Name Space/Storage Request Brokers
Single disk volumes -> RAID1-5 -> RAID 10 -> Storage Hierarchies
![Page 8: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/8.jpg)
HPC Storage Hierarchy
Master Node
Compute nodes
Interconnection Network
Internet or Internal Network
Basic Beowulf
![Page 9: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/9.jpg)
Master Node
Interconnection Network
Public Network
Compute Nodes
Parallel Filesystem I/O Nodes
BeowulfCluster
Tier 1 StorageIn Cluster High Speed Scratch
Parallel Filesystems support this: PVFS, Panasas, Lustre, IBRIX -- MPI I/O is the interface
![Page 10: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/10.jpg)
Master Node
Interconnection Network
Public Network
Compute Nodes
Parallel Filesystem I/O Nodes
BeowulfCluster
Tier 2 StorageShared Home Directories
Home Directory Server(May be direct-attached to Master)
![Page 11: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/11.jpg)
Master Node
Interconnection Network
Public Network
Compute Nodes
Parallel Filesystem I/O Nodes
Tier 3 StorageCampus-Wide Research Storage
Interconnection Network
Public Network
Cluster B
Interconnection Network
Public Network
Cluster C
Other Research Servers (non-cluster)
Public Network
Campus Storage Servers Campus Storage Mirrors
Campus Research Network
![Page 12: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/12.jpg)
We can build Multi-PB Storage Systems - Now What?Applications spit out lots of this data (or sensors/sequencers/instruments wrapped in applications).
Status Quo:
Applications codes generate FORTRAN unformatted or ASCII text data to a (multitude of) files
Some domain exception (.pdb, gridgen)
![Page 13: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/13.jpg)
Three problems:
Too many files (my worst offender has 750,000 - find anything useful in that).
Files too big (one student generated 700GB in 18 hours)
Too many formats (can’t connect weather and ocean, application and visualization).
![Page 14: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/14.jpg)
Things are happeningBroad Domain Frameworks taking hold:
e.g. ESMF (Earth System Modeling Framework) - Connect WRF (climate) to ADCIRC (Ocean)
Hierarchical, standard, descriptive data formats
Broader introduction of metadata is the key...
This is the right trend, but has costs...
![Page 15: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/15.jpg)
Costs of FrameworksApplication complexity goes way up
Converse -> value of applications written “outside”community goes way down.
XML is not the most efficient format in the world...XML:<particle> <coordinates> 10 0 10 </coordinates> <velocity> <x> 12 </x> <y> 9 </y> <z> 8 </z> </velocity></particle>(~100 BYTES)
FORTRAN Raw:0a000a0c0908(6 bytes)
![Page 16: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/16.jpg)
Costs of FrameworksEnterprise-class backed-up storage:
~$10,000/Terabyte
Cost of 10-1 inefficiency on one PB of raw data:
$100,000,000.00
In fairness, compressed XML mitigates a fair amount of this... but an app-specific binary format will always win
![Page 17: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/17.jpg)
HPC AnalyticsWe can build systems, we can make filesystems, we can create well-ordered files
This can roughly be called “Data Management”
Well-ordered data is a foundation, but still not knowledge.
The next phase is the emerging field of Analytics
![Page 18: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/18.jpg)
AnalyticsSC05 - “HPC Analytics Challenge” 11/05“...showcase innovative techniques of rigorous data analysis...”
Dept. of Energy - Visual Analytics Center solicitation 10/05.
PNNL NVAC (Nat’l Visualization and Analytics Center)
Recommended Reading:
“Illuminating the Path” - National R&D Agenda in Visual Analytics http://nvac.pnl.gov/agenda.stm
![Page 19: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/19.jpg)
Analytics, acoording to the “Path”:
• The science of analytical reasoning• Visual representations and
interaction techniques• Data representations and
transformations• Production, presentation, and
dissemination.
![Page 20: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/20.jpg)
State of AnalyticsAt SC05, all five finalists did Visualization
Not an expansive view of analytics...
One used data mining to produce visualizations
While much, much quality work has been done in visualization techniques,
...visualizations are still used as much for fundraising as science
![Page 21: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/21.jpg)
Of course, *I* wouldn’t use
visulizations for this...
![Page 22: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/22.jpg)
HPC Applications
![Page 23: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/23.jpg)
![Page 24: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/24.jpg)
Visualization Advancing
3D visualization does add something Decision Theater
Formats are a key to making this routine.More tools beyond Excel, MatlabNeed to accelerate to real-time, “what-if” scenarioHierarchy matters here - don’t render whole earth at 30cm resolution
![Page 25: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/25.jpg)
Analytics Beyond VisualizationDatabases are a key to the HPC future - See Dr. Chen’s earlier talk for an excellent introduction
Large databases of small records well understood
Large databases of large, sparse records of ill-conforming data not understood.
Experimental Management tools increasing in value
Frameworks for parameter study, goal-directed search
![Page 26: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/26.jpg)
Analytics Beyond VisualizationTwo more technologies must be imported from other fields:
Data Mining (database-enabled) In large datasets, the trends are the knowledgeAcxiom is a good model (and, gets them out of the junk mail business).
SearchOne Word: GooglePre-(multi)-indexing, divided search space search multi-PB space in 0.01 seconds... by using a massive cluster to do most work ahead of time.
![Page 27: H P C A n a ly ticsplato.asu.edu/slides/stanzione.pdf · filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation,](https://reader034.vdocuments.site/reader034/viewer/2022042405/5f1c760b4f951c3cc12650d5/html5/thumbnails/27.jpg)
Takeaways:Intelligent I/O
Standard Formats
Hierarchy - multiple views of data
Database/Data Mining/ Search
Visualization
All of the above require more sophisticated application codes, more use of tools:
Computational Science Literacy