national computational infrastructure download training ... · introduction to nci - part 2...

39
Introduction to NCI - Part 2 National Computational Infrastructure Download training materials here: http://nci.org.au/services-support/training/

Upload: vuonghanh

Post on 19-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Introduction to NCI - Part 2

National Computational Infrastructure

Download training materials here:http://nci.org.au/services-support/training/

Filesystems II VDI Data Collections

Outline

1 Filesystems II

2 VDI

3 Data Collections

2 / 39

Filesystems II VDI Data Collections

NCI storage architecture

3 / 39

Filesystems II VDI Data Collections

Massdata

The Mass Data Store was migrated to a new SGI HierarchicalStorage Management System in January 2012.

MDSS is used for long term storage of large datasets.

Every project has a directory on the MDSS.

All members of the project group have read and write accessto the top project directory.

If you have numerous small files to archive - bundle into atarfile FIRST.

mdss dmls -l gives information what is online (on diskcache) and what is on tape.

4 / 39

Filesystems II VDI Data Collections

Using the MDSS

The mdss command can be used to “get” and “put” databetween the interactive nodes of Raijin and the MDSS, aswell as to list files and directories on the MDSS.

netcp and netmv can be used from within batch jobs to

Generate a batch script for copying/moving files to the MDSS

Submit the generated batch script to the special copyq whichruns copy/move job on an interactive node.

netcp and netmv can also be used interactively to save youwork creating tarfiles and generating mdss commands.

-t create a tarfile to transfer-z/-Z gzip/compress the file to be transferred

Caution!

Always use -l other=mdss when using mdss commands in copyq.This is so that jobs only run when the the mdss system is available.

5 / 39

Filesystems II VDI Data Collections

Exercise 4: Using the MDSS

To see these commands in action do

cd /short/$PROJECT/$USERmdss get Data/data.tarls -ltar xvf data.tarlsrm data.tarmdss mkdir $USERnetmv -t $USER.tar DATA $USERwatch qstat -u $USER... (wait until job finishes, use Ctrl+C to quit)...less DATA.o*mdss ls $USERmdss rm $USER/$USER.tar

6 / 39

Filesystems II VDI Data Collections

Using /jobfs

Only available through queueing system:

Request like -ljobfs=1GBAccess via $PBS JOBFS environment variable

All files are deleted at end of job. Copy what you need to/short or other global filesystem in job script.

Cannot use mdss or netcp commands for files on /jobfs.

7 / 39

Filesystems II VDI Data Collections

Exercise 5: Managing Files between /short, /jobfs and MDSS

Submit a batch job with a /jobfs request, where the job:

Copies an input file from /short to /jobfs

Runs a code to use the input file and generate some output

Saves the output data back to the /short area

Uses the netcp command to archive the data to the MDSS

Read the runjobfs script then submit it to the queueing system,monitor the job with qstat, and examine the job output files:

cd /short/$PROJECT/$USER/INTRO_COURSEqsub runjobfswatch qstat -u $USER... (wait until job finishes, use Ctrl+C to quit)...cat runjobfs.e*cat runjobfs.o*

8 / 39

Filesystems II VDI Data Collections

Exercise 5: Managing Files between /short, /jobfs and MDSS(cont)

Check out the output file that this job created on /short and thecopy on the MDSS

cd /short/$PROJECT/$USERls -ltrless save_data.o*mdss ls $USERmdss rm -r $USER

9 / 39

Filesystems II VDI Data Collections

Outline

1 Filesystems II

2 VDI

3 Data Collections

10 / 39

Filesystems II VDI Data Collections

What is a virtual laboratory?

A Virtual Laboratory is an interactive environment for creating andconducting simulated experiments via a computer interface. Itprovides a range of domain-specific digitally enabled data,programs and tools.The National eResearch Collaboration Tools and Resources(NeCTAR) project provides infrastructure and project fundingenabling Virtual Laboratories and other eResearch tools.

11 / 39

Filesystems II VDI Data Collections

CSIRO - Virtual Geophysics Laboratory

Genomics Virtual Laboratory

University of Tasmania - Marine Virtual Laboratory

The All Sky Virtual Observatory

CWSLab Climate and Weather Science Laboratory

Humanities Networked Infrastructure (HuNI) unlocking and uniting Australia’scultural data

The Characterisation Virtual Laboratory: research environments for exploringinner space

Endocrine Genomics Virtual Laboratory (EndoVL)

Biodiversity and Climate Change Virtual Laboratory

Above and beyond speech, language and music: a Virtual Laboratory for humancommunication science

The Industrial Ecology Virtual Laboratory

12 / 39

Filesystems II VDI Data Collections

The Virtual Desktop Infrastructure (VDI)

The Virtual Desktop Infrastructure at NCI offers Australian researchers

access to spatial data analysis software

an extensive data library including climate, weather and satellite data

new research, analysis and visualisation tools

integration with NCI’s HPC infrastructure

a platform for sharing code and results across the community.

The VDI makes it easier for new scientists to get started at NCI, and to collaboratewith others.

13 / 39

Filesystems II VDI Data Collections

Why use the VDI?

NCI provides a Virtual Desktop Infrstructure (VDI) supporting virtuallaboratories including the CWSLab and the AGDC. It provides usefulsystems and tools for Earth systems related researchers:

The VDI allows users to interact with parts of the NCI system asthough they were on a local machine, with fast scratch space anddedicated CPUs

Interact with data - perform analyses and plot data in real timewithout relying on qsub -I jobs on Raijin

Familiar desktop environment for code development etc.

Not a supercomputer - think of it as your own desktop, butconnected to NCI systems.

Nodes are no more powerful than a standard PC, but they areconnected to /g/data and share standard Raijin modules forPython, R, Matlab, QGIS, etc.

14 / 39

Filesystems II VDI Data Collections

VDI Desktop Specs

Each node has 8 vCPUs, 32GB RAM and 148GB local space.

Looks like (and is) a normal Linux UI - CentOS6

Some software packages are available through menus, othersas command line modules

Session time limits apply, see help documentation for details

15 / 39

Filesystems II VDI Data Collections

How to access the VDI

Need an NCI account and access to relevant project(s) anddatasets

Note only a limited number of NCI projects can access theVDIs at this time

Install TurboVNC and Strudel software (already done for thiscourse)

Follow instructions here http://training.nci.org.au/

For assistance or to request additional data mounts orsoftware packages please contact [email protected]

Really, do work through the VDI training course or read the Googledocument, it has the answers to lots of questions!

16 / 39

Filesystems II VDI Data Collections

Exercise 6: Let’s get acquainted!

Start Strudel

select site “NCI Virtual Desktop”supply your usernameclick “login”.select “Don’t remember me” when in a lab. On your owncomputer, create a passphrase at this step instead.

VNC session starts in a new window

Start a terminal by navigating

Applications → System Tools → Terminal

The VDIs share some modules with Raijin, commonly used inHPC Earth Systems work.

In the terminal, type module avail

Is the software you’re likely to want listed there?

Python (and iPython notebooks), Matlab, R, GDAL, QGIS,NCO, CDO, Ferret, UV-CDAT...

17 / 39

Filesystems II VDI Data Collections

Exercise 6: Let’s get acquainted! (cont)

18 / 39

Filesystems II VDI Data Collections

Caveats

/home IS NOT THE SAME/home on Raijin is different to /home on the VDI. Within theVDI, /home is shared, so you can log out and back in andyour data/code will still be there. However the content of yourRaijin /home will not be visible from the cloud and must becopied across as needed.The same goes for /local and other temporary space...Remember /home and /g/data are the only persistentspaces on the VDI, everything else is wiped on log-outQuotas apply to /home. Do not ignore any quota warningswhich appear when you connect to a desktop session. Takeaction immediately - after the grace period expires you willbe unable to start a new session.

19 / 39

Filesystems II VDI Data Collections

Caveats (cont)

Can access /g/data but to submit PBS jobs to the Raijin queue

requires an ssh qsub script. This means all dependencies MUST be

in standard modules (on both systems) and /g/data.

If developing code to run on Raijin, the code needs to be copiedback to Raijin’s /home and run from there, or run from a /g/dataspace via an ssh script. Note that not all libraries are the same sofurther development and testing is likely to be required.

Limited availability

Only 32 desktops are available on the current system. Multiple usersmay be allocated to share resources after this limit is reached. Thiswill not be apparent to the user but performance may be affected.Maximum session times apply, please completely log out whenevernot in ongoing use.

Note

If the “submit a debug report” dialog appears when using Strudel, clicking the“Submit” button does not send the report to NCI and so we never see it. Contact usif you are having problems.

20 / 39

Filesystems II VDI Data Collections

Exercise 7: Finding Data

Easiest through the command line

You can also use the file system explorer GUI, and we’reworking on web services like GeoNetwork and command linesearch tools)

Change to where the data lives:

cd /g/data2/rr5/satellite/obs/himawari8/FLDK/

Can find the data we want from here

cd 2015/09/23/0900

ls gives a list of all files for this time step.

21 / 39

Filesystems II VDI Data Collections

Exercise 8: Interacting with Data

Let’s plot some data!

We’ll look at Himawari8 satellite data.

First quickly with ncviewmodule load netcdfncview 20150923090000-P1S-ABOM OBS B09-PRJ GEOS141 2000-HIMAWARI8-AHI.nc

22 / 39

Filesystems II VDI Data Collections

Using Python in the VDI

You’re likely to want to do analysis in something like Python.

May also want the python NetCDF4 library, othernon-standard libraries, to update versions, etc.

The VDI has python module ’virtualenv’, which lets you defineisolated python environmentsDifferent environments can be defined for different projects oranalysis workflowsThis is particularly helpful when some libraries conflict withone another for a particular task, but not othersIf something breaks or goes wrong within an environment, youcan just delete it and start over

23 / 39

Filesystems II VDI Data Collections

Exercise 9: Interacting with Data - Python virtualenv

Create a ’virtualenv’

On the VDI, load the ’python’ module followed the virtualenvmodule.

$ module load python$ module load virtualenv

Make a directory for the virtualenv (give it any name you’dlike)

$ mkdir <directory>

Create the virtualenv inside the new directory. Note the nameyou enter here for <venv> will be the name that your terminaldisplays when you have activated this virtualenv.

$ cd <directory>$ virtualenv <venv>

24 / 39

Filesystems II VDI Data Collections

Exercise 9: Interacting with Data - Virtualenv (cont)

Activate/deactivate ’virtualenv’

To activate (to enter the virtualenv):

$ source <directory>/<venv>/bin/activate

To deactivate (leave the virtualenv, not delete it):

$ deactivate

25 / 39

Filesystems II VDI Data Collections

Exercise 9: Interacting with Data - Virtualenv (cont)

At some point, you will probably need to update a library or install onethat is not already included.

Updating a python library (e.g., NumPy) within the ’virtualenv’:

Using ’pip install’ along with ’–upgrade’ or alternatively,’–ignore-installed’

$ pip install numpy --upgrade

Installing a new python library (e.g., netCDF4) within the ’virtualenv’:

This package requires additional modules within the VDI tobuild.

$ module load netcdf/4.3.3.1$ module load hdf5/1.8.14$ module load szip

Use ’pip install’ along with defined paths to dependent libraries

$ HDF5_DIR=/apps/hdf5/1.8.14/$ NETCDF4_DIR=/apps/netcdf/4.3.3.1/$ pip install netCDF4

26 / 39

Filesystems II VDI Data Collections

Exercise 9: Interacting with Data - **Notes (cont)

If you are not using ’virtualenv’:Remember to include ’–user’ and ’–build’ to install locally.

$ HDF5_DIR=/apps/hdf5/1.8.14/$ NETCDF4_DIR=/apps/netcdf/4.3.3.1/$ pip install --user --build $TMPDIR/pip_build netcdf4

27 / 39

Filesystems II VDI Data Collections

Exercise 10: VDI - iPython Notebook

Need two more python libraries (inside your virtualenv):

Update ’ipython’

$ pip install --upgrade ipython

Install ’jupyter’ notebook

$ pip install jupyter

Now let’s look at a IPython Notebook example:

/home/900/kad900/NCI_Training

To start notebook:

$ jupyter notebook

28 / 39

Filesystems II VDI Data Collections

Outline

1 Filesystems II

2 VDI

3 Data Collections

29 / 39

Filesystems II VDI Data Collections

NCI data collections

Datasets which are of national significance, or are otherwise usefulreference data which should be securely stored and assigned a DOI,may be hosted at NCI via the RDSI project.

Data is transferred to NCI, and for RDSI projects is stored in/g/dataData must be curated - a Data Management Plan is required, madein conjunction with Jingbo Wang at NCI and Irina Bastrakova atGA

https://datamgt.nci.org.auHosted data collections appear in our collection level GeoNetwork (aweb based tool for searching data holdings at NCI)

http://geonetwork.nci.org.au/Some data collections also have their own geonetwork for data, e.g.http://geonetworkrs0.nci.org.au/

For data publishing and geonetwork requests, [email protected]

30 / 39

Filesystems II VDI Data Collections

Research data collections at NCI

Collection Name Research Data Approved (TB)Australian Data Archive (Social Sciences) 4TERN eMAST Data Assimilation 110Phenology Monitoring: Near Surface Remote Sensing 12Satellite Soil Moisture Products 5Global Navigation Satellite System (Geodesy) Data Archive 5Australian Natural Hazards Data Archive - Tropical Cyclone, Earthquakes, Tsunami 27Synthetic Aperture Radar Data 118Key Water Assets 44CSIRO Coastal Modelling Products 2High Altitude Ice Crystals - High Ice Water Content 23D Geological Models of Australia 3Australian Marine Video and Imagery Collection 7Digitised Australian Aerial Survey Photography Collection 74Models of Land and Water Dynamics from Space Data Collection 22National CT-Lab Tomographic Data Collection 205SkyMapper Southern Sky Survey 227Plant Phenomics Digital Data Repository 10Ocean Model for the Earth Simulator Re-analysis Datasets 27Year Of Tropical Convection Re-analysis Datasets 90CORDEX Australasia 57Australian Bathymetry and Elevation Reference Dataset 113Australian Geophysical Data Collection 175Severe Weather Case Studies 50Tropical Cyclone Scenarios 250Atmospheric Forcing Products 5ACCESS-CM 0.25 degrees Simulations 30ACCESS Numerical Weather Prediction Models 3000

31 / 39

Filesystems II VDI Data Collections

Research data collections at NCI (cont.)

National Resource Management data (post-processed CMIP5) 4Remote and In-situ Observations Products for Earth System Modelling 366Ocean and Marine Modelling and Forecast Products 220Ocean Forecasting Australia Model 150Atmospheric Re-analysis Products 2ARC Centre of Excellence for Climate System Science Datasets Collection 166Seasonal Climate Prediction Data Collection 595Ecosystem Modelling and Scaling Infrastructure Facility (eMAST) Data 90Australian Earth Observation Data (Landsat) 1474Australian Moderate Resolution Satellite Products (NOAA/AVHRR, MODIS, VIIRS and AusCover) 428Bioplatforms Australia Melanoma Collection 278Coupled Model Intercomparison Project (CMIP5) 2322Community Atmosphere Biosphere Land Exchange (CABLE) Model Collection 9

32 / 39

Filesystems II VDI Data Collections

NCI data collections (cont)

NCI already hosts a number of data collections, many of which (thoughnot all!) are of interest to GA.

Earth system sciences, climate and weather model data assets andproducts

Earth and marine observations and products

Geosciences

Terrestrial ecosystems

Water management and hydrology

Astronomy, social sciences and biosciences

33 / 39

Filesystems II VDI Data Collections

NCI data collections (cont)

License (Thanks to Irina for providing the following information):

GA will be transitioning from Creative Commons Attribution 3.0 Australia(CC-BY 3.0) to Creative Commons Attribution 4.0 International (CC-BY4.0) from 31 March 2015. The CC-BY 4.0 licence is compatibleinternationally and is similar to other international copyright licences.

You can go and view the information on the new license and if you haveany queries please contact [email protected] or phoneElizabeth Fredericks on ext 9367 or Jeanette Holland ext 9731

34 / 39

Filesystems II VDI Data Collections

Backup and recovery

Backup strategy in forms are signed by the CI/data managers(A big thank-you to all RDS data managers in GA and IrinaBastrakova).

The backup will be done by the operational team at NCIbased on the frequency requirement.

By default, the latest backup copy will replace with the oldercopy. Multiple copies can be maintained as requested.However, it depends on the size of the storage.

NCI will inform the data managers about the backup status indue course.

35 / 39

Filesystems II VDI Data Collections

Exercise 11a: Data discovery

Visit our data collections geonetwork at geonetwork.nci.org.au

Select Advanced Search tab, click SearchShows 40+ data collections, including 4000+ records.Try searching for data that may be of interest to you, can youfind...?

MODIS satellite dataElevation or bathymetry dataHazards (eg earthquake) data

Enter a catalogue entry to see metadata

36 / 39

Filesystems II VDI Data Collections

Exercise 11b: Interrogating metadata

Select a GeoNetwork entry to see collection metadataEg Water observations from space (WOfS)Models of Land and Water Dynamics from Space Data Collection

Note fields include abstract, custodial information, keywords, access anduse constraints (licence info), geographic extent, access information, anda heap of other data - all data provided for the DMP and reflected in theGeonetwork is ISO19115 or ISO19139 compliant.

37 / 39

Filesystems II VDI Data Collections

Ways to access data

There are a number of ways data held at NCI may be accessed

Via a web service (if published)

Filesystem on Raijin

In a virtual desktop environment (e.g. AGDC, CWSLab)

The data listed in the geonetwork are all technically public, some may bepublished via a service like THREDDS or Geoserverhttp://dap.nci.org.au

If the data are not published online or you need to access the data onRaijin for computing purposes, you will need to request membership ofthe appropriate project

Check in the collection geonetwork for the dataset of interestUnder ”transfer options” there now appears a field like:Transfer optionsOnLine resource http://dap.nci.org.auOnLine resource The data is available at NCI raijin.nci.org.au:/g/data2/<proj>

where the 2nd link shows the project that needs to be joined in order toaccess the data (last 3 characters).If in doubt talk to the custodian or email [email protected].

Join data projects as needed using Mancini:https://my.nci.org.au/mancini/project

Add <project code >/join to go directly to the membership request page.38 / 39

Filesystems II VDI Data Collections

Exercise 12: Data collections

On Raijin, data collections (excluding long term storage data onMDSS), is found in /g/data

ssh raijin.nci.org.au -l abc123cd /g/data1/rr4cd /g/data1/rr9cd /g/data2/u39cd /g/data2/rs0ls

Explore data on file-system

Note

Not all datasets are globally readable, in general you will need to be inthe appropriate group to do this.Published data spaces should be well maintained, generally notappropriate areas to create private working directories (be careful withpermissions, too).

39 / 39