building a global namespace with nirvana

13

Click here to load reader

Upload: igor-sfiligoi

Post on 13-Apr-2017

83 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Building a Global Namespace with Nirvana

1

A distributed HSM solution

Building a Global Namespace

with Nirvana

Sept 2016 Igor Sfiligoi

Page 2: Building a Global Namespace with Nirvana

2

• Nirvana® is a metadata, data placement

and data management solution optimized for

managing distributed unstructured data

• It supports many modes of operation

– In this talk we explore only its

global namespace/HSM capabilities– All the other capabilities can be used alongside, but will not be discussed

• Nirvana is a commercial software product,

developed by General Atomics

• More information at:– http://www.ga.com/nirvana

– https://en.Wikipedia.org/wiki/Nirvana_(software)

What is Nirvana

Page 3: Building a Global Namespace with Nirvana

3

• Distributing the data over

several independent file systems allows for

– Easier scalability

– Tailor file system properties to the value of the stored data

– Support a geographically dispersed workforce

– High availability and disaster recovery

– Avoiding vendor lock-in

• But users still need a single logical view of the entire

system i.e. a global namespace

– Every file must be accessible,

no matter in which physical location it resides

– The system should be able to transparently replicate data

in response to changing use patterns

Why a Global Namespace

Page 4: Building a Global Namespace with Nirvana

4

• The lifecycle of most files

follows the following cycle:

– Shortly after it is created, it is

accessed by multiple users and

applications on a regular basis

– As time goes by, users tend to

forget about this file and, it may

not be accessed for months

– For some files

• A trigger event makes the

file very relevant again

• It is again accessed by

multiple users/applications

on a regular basis

• As more time goes by, the

file is forgotten again

• Frequently used (hot) files must be

kept on high performance file systems

– To keep user productivity high

• But keeping seldom used files on the

high performance file system is

not a good idea

– Expensive

– May slow down access to the

important files

• Seldom used (cold) files should be

stored on a separate,

possibly cheaper, file system

• Nirvana can be used to automate

the movement of files

– Based on usage patterns

– With transparent user access

Use Case 1: Hot and Cold Data

Page 5: Building a Global Namespace with Nirvana

5

• Most large companies have

offices spread across the world,

nowadays

• When valuable data is produced

at one office, users in other offices

likely want access to it, too

• Nirvana provides a global

namespace, so the moment a

new file is created anywhere, it

is immediately accessible to all

the authorized users

– Files are still protected from

un-authorized use

• Nirvana can also replicate hot

data, so a copy a file exist in all

the locations it is being used

– For faster access

– Can be completely automated

– Completely transparent to users

Use Case 2 : Distribute Workforce

Page 6: Building a Global Namespace with Nirvana

6

• Cloud computing is a great

resource for spikes in CPU needs

– But what about data?

• Staging all the data in advance

not practical

– Only a fraction of all the data

we have will likely be accessed

– Per GByte fees

• Fetching on demand has its

drawbacks

– Potentially slow

– Network fees

• Nirvana can be used for data

caching

– Seldom used files are

fetched on-demand

– Hot files are staged into Cloud

storage and accessed from there

– As files grow colder, the replicas

are removed from Cloud storage

– Quotas can be established

• Completely transparent to users

• File movement can be

fully automated

– Based on use patterns

Use Case 3 : Data Caching

Page 7: Building a Global Namespace with Nirvana

7

• Access to classified files

must be tracked

– Must be stored on secure

storage system

– Making a copy of the

whole file is not an option

• Workflows may need

access to a mix of classified and public files

– Users expect the

maximum efficiency from

the system

• Nirvana allows for different policies

for different files

– Classified files will never be replicated

(so access is always against original)

– Public files will be aggressively

cached (for performance)

• A file can be classified by its

– Location

– File name

– Content

– User-provided metadata

Use Case 4 : Confidentiality-based policy

Page 8: Building a Global Namespace with Nirvana

8

• Cloud computing can be used

to produce data, too

– E.g. rendering, simulation

• Once the computation is done,

the file has to be returned to

the corporate storage for

long term use

• Cloud CPUs should be released as

soon as the computing is finished

– But transferring the file back

may take awhile

• Nirvana can optimize the big data

file creation workflows

• Users register the file(s) in the global

namespace, and terminate

– Without moving the file(s) to the

corporate owned storage

– The file(s) still reside

in the Cloud storage

• Nirvana will asynchronously move

the file(s) on a best-effort basis

– Then remove the Cloud replica

– Users can access the file at any time

• In case of error, the owner

will be notified

Use Case 5 : Data Creation

Page 9: Building a Global Namespace with Nirvana

9

• New purchases are needed

to expand the capability of

the storage system

• Just adding more hardware

to the existing solution

not always possible

– Nor desirable

• The new system has to be

brought into production

– With minimal disruption

– Incrementally, if possible

• If the company uses Nirvana, adding

a new resource is non-disruptive

• Nirvana can be configured to move a

fraction of the files to the new system

– Can be based either/both

on file or user type

– A copy of the file can be kept

on the old system,

to avoid user-visible problems

in case of failures in the new system

• System administrators can gradually

move a larger fraction of the

workload to the new system– While monitoring the performance

– All without users ever noticing

Use Case 6: Evaluate new technology

Page 10: Building a Global Namespace with Nirvana

10

• Nirvana does not manage the physical storage

– It relies on external file system solutions for that

(e.g. ext4, NAS, Spectrum Storage, Cloud storage)

– Nirvana’s global namespace is a logical entity

• Nirvana has two operations modes

– Explicit ingest of files (transactional)

– External file system scanning (eventual consistency)(speed wise , internal testing shows latencies of about a minute for filesystems with a few million files)

• The two modes can co-exist

Keeping logical and physical info in sync

Page 11: Building a Global Namespace with Nirvana

11

• Nirvana also allows for direct access

to underlying physical storage

– E.g. for performance reasons

– Of course, users will only see

the local namespace at that point

• All operations allowed

– Including file creation and removal

Direct access to physical storage still allowed

Page 12: Building a Global Namespace with Nirvana

12

• Nirvana provides a policy engine

for data movement automation

– Can make additional replicas,

– transactionally move data,

– reduce the number of replicas,

– enforce quotas

• Acts on files in physical locations

– While keeping the

global, logical view unchanged

• Rules based on any

combination of

– File properties

– File content

(automatically

extracted

metadata)

– User-provided

metadata

– Ownership

Data movement automation

Page 13: Building a Global Namespace with Nirvana

13

• Each file has a unique normative path

in the global namespace

• The same file can however be also accessed

through many Virtual Collections

– Allowing for grouping on semantic meaning

– E.g. sequencing data could be stored by year,

and have one virtual collection per strain and

one virtual collection per sequencer model

• Any change to the normative path will be

atomically propagated to all the Virtual collections

Virtual collections