building a global namespace with nirvana
TRANSCRIPT
1
A distributed HSM solution
Building a Global Namespace
with Nirvana
Sept 2016 Igor Sfiligoi
2
• Nirvana® is a metadata, data placement
and data management solution optimized for
managing distributed unstructured data
• It supports many modes of operation
– In this talk we explore only its
global namespace/HSM capabilities– All the other capabilities can be used alongside, but will not be discussed
• Nirvana is a commercial software product,
developed by General Atomics
• More information at:– http://www.ga.com/nirvana
– https://en.Wikipedia.org/wiki/Nirvana_(software)
What is Nirvana
3
• Distributing the data over
several independent file systems allows for
– Easier scalability
– Tailor file system properties to the value of the stored data
– Support a geographically dispersed workforce
– High availability and disaster recovery
– Avoiding vendor lock-in
• But users still need a single logical view of the entire
system i.e. a global namespace
– Every file must be accessible,
no matter in which physical location it resides
– The system should be able to transparently replicate data
in response to changing use patterns
Why a Global Namespace
4
• The lifecycle of most files
follows the following cycle:
– Shortly after it is created, it is
accessed by multiple users and
applications on a regular basis
– As time goes by, users tend to
forget about this file and, it may
not be accessed for months
– For some files
• A trigger event makes the
file very relevant again
• It is again accessed by
multiple users/applications
on a regular basis
• As more time goes by, the
file is forgotten again
• Frequently used (hot) files must be
kept on high performance file systems
– To keep user productivity high
• But keeping seldom used files on the
high performance file system is
not a good idea
– Expensive
– May slow down access to the
important files
• Seldom used (cold) files should be
stored on a separate,
possibly cheaper, file system
• Nirvana can be used to automate
the movement of files
– Based on usage patterns
– With transparent user access
Use Case 1: Hot and Cold Data
5
• Most large companies have
offices spread across the world,
nowadays
• When valuable data is produced
at one office, users in other offices
likely want access to it, too
• Nirvana provides a global
namespace, so the moment a
new file is created anywhere, it
is immediately accessible to all
the authorized users
– Files are still protected from
un-authorized use
• Nirvana can also replicate hot
data, so a copy a file exist in all
the locations it is being used
– For faster access
– Can be completely automated
– Completely transparent to users
Use Case 2 : Distribute Workforce
6
• Cloud computing is a great
resource for spikes in CPU needs
– But what about data?
• Staging all the data in advance
not practical
– Only a fraction of all the data
we have will likely be accessed
– Per GByte fees
• Fetching on demand has its
drawbacks
– Potentially slow
– Network fees
• Nirvana can be used for data
caching
– Seldom used files are
fetched on-demand
– Hot files are staged into Cloud
storage and accessed from there
– As files grow colder, the replicas
are removed from Cloud storage
– Quotas can be established
• Completely transparent to users
• File movement can be
fully automated
– Based on use patterns
Use Case 3 : Data Caching
7
• Access to classified files
must be tracked
– Must be stored on secure
storage system
– Making a copy of the
whole file is not an option
• Workflows may need
access to a mix of classified and public files
– Users expect the
maximum efficiency from
the system
• Nirvana allows for different policies
for different files
– Classified files will never be replicated
(so access is always against original)
– Public files will be aggressively
cached (for performance)
• A file can be classified by its
– Location
– File name
– Content
– User-provided metadata
Use Case 4 : Confidentiality-based policy
8
• Cloud computing can be used
to produce data, too
– E.g. rendering, simulation
• Once the computation is done,
the file has to be returned to
the corporate storage for
long term use
• Cloud CPUs should be released as
soon as the computing is finished
– But transferring the file back
may take awhile
• Nirvana can optimize the big data
file creation workflows
• Users register the file(s) in the global
namespace, and terminate
– Without moving the file(s) to the
corporate owned storage
– The file(s) still reside
in the Cloud storage
• Nirvana will asynchronously move
the file(s) on a best-effort basis
– Then remove the Cloud replica
– Users can access the file at any time
• In case of error, the owner
will be notified
Use Case 5 : Data Creation
9
• New purchases are needed
to expand the capability of
the storage system
• Just adding more hardware
to the existing solution
not always possible
– Nor desirable
• The new system has to be
brought into production
– With minimal disruption
– Incrementally, if possible
• If the company uses Nirvana, adding
a new resource is non-disruptive
• Nirvana can be configured to move a
fraction of the files to the new system
– Can be based either/both
on file or user type
– A copy of the file can be kept
on the old system,
to avoid user-visible problems
in case of failures in the new system
• System administrators can gradually
move a larger fraction of the
workload to the new system– While monitoring the performance
– All without users ever noticing
Use Case 6: Evaluate new technology
10
• Nirvana does not manage the physical storage
– It relies on external file system solutions for that
(e.g. ext4, NAS, Spectrum Storage, Cloud storage)
– Nirvana’s global namespace is a logical entity
• Nirvana has two operations modes
– Explicit ingest of files (transactional)
– External file system scanning (eventual consistency)(speed wise , internal testing shows latencies of about a minute for filesystems with a few million files)
• The two modes can co-exist
Keeping logical and physical info in sync
11
• Nirvana also allows for direct access
to underlying physical storage
– E.g. for performance reasons
– Of course, users will only see
the local namespace at that point
• All operations allowed
– Including file creation and removal
Direct access to physical storage still allowed
12
• Nirvana provides a policy engine
for data movement automation
– Can make additional replicas,
– transactionally move data,
– reduce the number of replicas,
– enforce quotas
• Acts on files in physical locations
– While keeping the
global, logical view unchanged
• Rules based on any
combination of
– File properties
– File content
(automatically
extracted
metadata)
– User-provided
metadata
– Ownership
Data movement automation
13
• Each file has a unique normative path
in the global namespace
• The same file can however be also accessed
through many Virtual Collections
– Allowing for grouping on semantic meaning
– E.g. sequencing data could be stored by year,
and have one virtual collection per strain and
one virtual collection per sequencer model
• Any change to the normative path will be
atomically propagated to all the Virtual collections
Virtual collections