hdf & hdf-eos workshop xv 17 april 2012

23
HDF & HDF-EOS Workshop XV 17 April 2012 Acknowledgement: Thanks to Ed Masuoka, NASA Contract NNG06HX18C Using HDF5 and Python: The H5py module Daniel Kahn Science Systems and Applications, Inc.

Upload: ghalib

Post on 05-Feb-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Using HDF5 and Python: The H5py module. Daniel Kahn. Science Systems and Applications, Inc. Acknowledgement: Thanks to Ed Masuoka, NASA Contract NNG06HX18C. HDF & HDF-EOS Workshop XV 17 April 2012. Python has lists:. >>> for elem in ['FirstItem','SecondItem','ThirdItem']: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Acknowledgement: Thanks to Ed Masuoka, NASA Contract NNG06HX18C

Using HDF5 and Python: The H5py module

Daniel Kahn

Science Systems and Applications, Inc.

Page 2: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Python has lists:

>>> for elem in ['FirstItem','SecondItem','ThirdItem']:... print elem...FirstItemSecondItemThirdItem>>>

We can assign the list to a variable.

>>> MyList = ['FirstItem','SecondItem','ThirdItem']>>> for elem in MyList:... print elem...FirstItemSecondItemThirdItem>>>

Page 3: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Lists can contain a mix of objects:

>>> MixedList = ['MyString',5,[72, 99.44]]>>> for elem in MixedList:... print elem...MyString5[72, 99.44]

>>> MixedList[0]'MyString'>>> MixedList[2][72, 99.44]

Lists can be addressed by index:

A list inside a list

Page 4: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Python lists are one dimensional.

Arithmetic operations don’t work on them.

Don’t be tempted to use them for scientific array based data sets. More the ‘right way’ later...

A note about Python lists:

Page 5: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Python has dictionaries.

Dictionaries are key,value pairs

>>> Dictionary = {'FirstKey':'FirstValue', 'SecondKey':'SecondValue',

'ThirdKey':'ThirdValue'}>>> Dictionary{'SecondKey': 'SecondValue', 'ThirdKey': 'ThirdValue', 'FirstKey': 'FirstValue'}>>>

Notice that Python prints the key,value pairs in a different order than I typed them.

The Key,Value pairs in a dictionary are unordered.

Page 6: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Dictionaries are not lists, however we can easily create a list of the dictionary keys:

>>> list(Dictionary)['SecondKey', 'ThirdKey', 'FirstKey']>>>

>>> for Key in Dictionary:... print Key,"---->",Dictionary[Key]...SecondKey ----> SecondValueThirdKey ----> ThirdValueFirstKey ----> FirstValue>>>

We can use a dictionary in a loop without additional elaboration:

Page 7: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

HDF5 is made of “Dictionaries” a dataset name is the key, and the array is the value.

HDFView is a tool which shows use the keys (TreeView) and the values (TableView) of an HDF5 file.

Keys

Value

Page 8: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Andrew Collette’s H5py module allows us to use Python and HDF5 together.

We can use H5py to manipulate HDF5 files as if they were Python Dictionaries

>>> import h5py>>> in_fid = h5py.File('DansExample1.h5','r')>>> for DS in in_fid:... print DS,"------->",in_fid[DS]...FirstDataset -------> <HDF5 dataset "FirstDataset": shape (25,), type "<i4">SecondDataset -------> <HDF5 dataset "SecondDataset": shape (3, 3), type "<i4">ThirdDataset -------> <HDF5 dataset "ThirdDataset": shape (5, 5), type "<i4">

>>>

Keys Values

Page 9: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

So What? We need to be able to manipulate the arrays, not just the file.

The Numpy module by Travis Oliphant allows the manipulation of arrays in Python.

We will see examples of writing arrays later, but to get arrays from the H5py object we have the ellipses.

>>> import h5py>>> fid = h5py.File('DansExample1.h5','r')>>> fid['FirstDataset']<HDF5 dataset "FirstDataset": shape (25,), type "<i4">>>> fid['FirstDataset'][...]array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,

17, 18, 19, 20, 21, 22, 23, 24])>>> type(fid['FirstDataset'][...])<type 'numpy.ndarray'>>>>

Page 10: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Reasons to use Python and HDF5 instead of C or Fortran

The basic Python Dictionary object has a close similarity to the HDF5 Group. The object oriented and dynamic nature of Python allows the existing Dictionary syntax to be repurposedfor HDF5 manipulation.

In short, working with HDF5 in Python requires much less code than C or Fortran which means faster development and fewer errors.

Page 11: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Comparison to C, h5_gzip:

C Python from THG site

# Lines of code 106 37

Fewer lines of code means fewer places to make mistakes

The 37 line h5_gzip.py example is a “direct” translation of theC version. Some more advanced techniques offer insight intoadvantages of Python/H5py programming. Text in next slides is color coded to help match code with same functionality.First writing a file…

Page 12: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Original h5_gzip.py Pythonic h5_gzip.py# This example creates and writes GZIP compressed dataset.import h5pyimport numpy as np# Create gzip.h5 file.#file = h5py.File('gzip.h5','w')## Create /DS1 dataset; in order to use compression, dataset has to be chunked.#dataset = file.create_dataset('DS1',(32,64),'i',chunks=(4,8),compression='gzip',compression_opts=9) ## Initialize data.#data = np.zeros((32,64))for i in range(32): for j in range(64): data[i][j]= i*j-j # Write data.print "Writing data..."dataset[...] = datafile.close()

dataset[...] = datafile.close()

#!/usr/bin/env python# It's a UNIX thing.....

from __future__ import print_function # Code will work with python 3 as well....

# This example creates and writes GZIP compressed dataset.import h5py # load the HDF5 interface moduleimport numpy as np # Load the array processing module

# Initialize data. Note the numbers 32 and 64 only appear ONCE in the code!LeftVector = np.arange(-1,32-1,dtype='int32')RightVector = np.arange(64,dtype='int32')DataArray = np.outer(LeftVector,RightVector) # create 32x64 array of i*j-j# The _with_ construct will automatically create and close the HDF5 filewith h5py.File('gzip-pythonic.h5','w') as h5_fid: # Create and write /DS1 dataset; in order to use compression, dataset has to be chunked. h5_fid.create_dataset('DS1',data=DataArray,chunks=(4,8),compression='gzip',compression_opts=9)

Page 13: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

# Read data back; display compression properties and dataset max value. #file = h5py.File('gzip.h5','r')dataset = file['DS1']print "Compression method is", dataset.compressionprint "Compression parameter is", dataset.compression_optsdata = dataset[...]print "Maximum value in", dataset.name, "is:", max(data.ravel())file.close()

# Read data back; display compression properties and dataset max value. #

with h5py.File('gzip-pythonic.h5','r') as h5_fid: dataset = h5_fid['DS1'] print("Compression method is", dataset.compression) print("Compression parameter is", dataset.compression_opts) print("Maximum value in", dataset.name, "is:",

dataset.value.max())

Reading data….

Page 14: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

And finally, just to see what the file looks like…

Page 15: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Real world example: Table Comparison

Background:

For the OMPS Instruments we need to design binaryarrays to be uploaded to the satellite to sub-sample theCCD to reduced data rate.

For ground processing use we store these arrays in HDF5.

As part of the design process we want to be able to compare arrays in two different files.

Page 16: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Here is an example of a Sample Table

Page 17: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Here is another example:

Page 18: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Here is the “difference” of the arrays. Red pixels are unique to the first array.

Page 19: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

#!/usr/bin/env python""" Documentation """from __future__ import print_function,division

import h5pyimport numpyimport ViewFrame

def CompareST(ST1,ST2,IntTime):

with h5py.File(ST1,'r') as st1_fid,h5py.File(ST2,'r') as st2_fid:

ST1 = st1_fid['/DATA/'+IntTime+'/SampleTable'].value ST2 = st2_fid['/DATA/'+IntTime+'/SampleTable'].value

ST1[ST1!=0] = 1 ST2[ST2!=0] = 1 Diff = (ST1 - ST2)

ST1[Diff == 1] = 2

ViewFrame.ViewFrame(ST1)

The code: CompareST.py

Page 20: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

if __name__ == "__main__":

import argparse

OptParser = argparse.ArgumentParser(description = __doc__)

OptParser.add_argument("--ST1",help="SampleTableFile1") OptParser.add_argument("--ST2",help="SampleTableFile2") OptParser.add_argument("--IntTime",help="Integration Time", default='Long')

options = OptParser.parse_args()

CompareST(options.ST1,options.ST2,options.IntTime)

..and the command line argument parsing.

Page 21: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Recursive descent into HDF5 file

Print group names, number of children and dataset names.

#!/usr/bin/env pythonfrom __future__ import print_functionimport h5py

def print_num_children(obj): if isinstance(obj,h5py.highlevel.Group): print(obj.name,"Number of Children:",len(obj)) for ObjName in obj: # ObjName will a string print_num_children(obj[ObjName]) else: print(obj.name,"Not a group")

with h5py.File('OMPS-NPP-NPP-LP_STB', 'r+') as f: print_num_children(f)

Page 22: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

The Result….

ssai-s01033@dkahn: ~/python % ./print_num_children.py / Number of Children: 1 /DATA Number of Children: 10 /DATA/AutoSplitLong Not a group /DATA/AutoSplitShort Not a group /DATA/AuxiliaryData Number of Children: 6 /DATA/AuxiliaryData/FeatureNames Not a group /DATA/AuxiliaryData/InputSpecification Not a group /DATA/AuxiliaryData/LongLowEndSaturationEstimate Not a group /DATA/AuxiliaryData/ShortLowEndSaturationEstimate Not a group /DATA/AuxiliaryData/Timings Number of Children: 2 /DATA/AuxiliaryData/Timings/Long Not a group /DATA/AuxiliaryData/Timings/Short Not a group /DATA/AuxiliaryData/dummy Not a group /DATA/Long Number of Children: 14 /DATA/Long/BadPixelTable Not a group /DATA/Long/BinTransitionTable Not a group /DATA/Long/FeatureNamesIndexes Not a group /DATA/Long/Gain Not a group /DATA/Long/InverseOMPSColumns Not a group

Page 23: HDF & HDF-EOS Workshop XV 17 April 2012

HDF & HDF-EOS Workshop XV 17 April 2012

Summary

Python with H5py and Numpy modules make developingPrograms to manipulate HDF5 files and perform calculationsWith HDF5 arrays simpler which increase development speed and reduces errors.