Python module - HDF5

HDF5 => HDF5 file stands for Hierarchical Data Format 5. It's also called h5 in short. The h5py package is a Pythonic interface to this HDF5 binary data format.

It is an open-source file which comes in handy to store large amount of data. As the name suggests, it stores data in a hierarchical structure within a single file. So if we want to quickly access a particular part of the file rather than the whole file, we can easily do that using HDF5. This functionality is not seen in normal text files.

HDF5 files are the ones used in AI projects, since they can be store TB of data, and can easily be sliced as if they were NumPy arrays.

We'll need to install HDF5 module in Python. To use HDF5, numpy also needs to be imported. Look in numpy section for it's installation.

Installation:

CentOS: We install it using pip.

sudo python3.6 -m pip install h5py => installs HDF5 for python 3.6

HDF5 Format:

Very good tutorial on HDF5 is on this link: https://twiki.cern.ch/twiki/pub/Sandbox/JaredDavidLittleSandbox/PythonandHDF5.pdf

or from local link HDF5

HDF5 includes only two basic structures: a multidimensional array of record structures, and a grouping structure. H5py uses straightforward NumPy array and python dictionary syntax. For example, you can iterate over datasets in HDF5 file, or check out the .shape or .dtype attributes of datasets.

HDF5 files are organized in a hierarchical structure, with two primary structures: groups and datasets.

  • HDF5 group: a grouping structure containing instances of zero or more groups or datasets, together with supporting metadata. A group has two parts:
    • A group header, which contains a group name and a list of group attributes.
    • A group symbol table, which is a list of the HDF5 objects that belong to the group.
  • HDF5 dataset: a multidimensional array of data elements, together with supporting metadata. A dataset is stored in a file in two parts: a header and a data array.
    • dataset header header contains information that is needed to interpret the array portion of the dataset, as well as metadata (or pointers to metadata) that describes or annotates the dataset. Header information includes the name of the object, its dimensionality, its number-type, information about how the data itself is stored on disk, and other information used by the library to speed up access to the dataset or maintain the file's integrity.
    • data array: Data array is where actual data is stored.
Ex of HDF5 file: trefer1.h5

HDF5 "trefer1.h5" { GROUP "/" { DATASET "Dataset3" { DATATYPE { H5T_REFERENCE } DATASPACE { SIMPLE ( 4 ) / ( 4 ) } DATA { DATASET 0:1696, DATASET 0:2152, GROUP 0:1320, DATATYPE 0:2268 } } GROUP "Group1" { DATASET "Dataset1" { DATATYPE { H5T_STD_U32LE } DATASPACE { SIMPLE ( 4 ) / ( 4 ) } DATA { 0, 3, 6, 9 } } DATASET "Dataset2" { DATATYPE { H5T_STD_U8LE } DATASPACE { SIMPLE ( 4 ) / ( 4 ) } DATA { 0, 0, 0, 0 } } DATATYPE "Datatype1" { H5T_STD_I32BE "a"; H5T_STD_I32BE "b"; H5T_IEEE_F32BE "c"; } } } }

Usage:

Most of the times when doing an AI project, we waon't be doing anything more than reading or writing HDF5. Let's look at these 2 operations.

ex: reading an h5 file

import numpy as np

import h5py
test_dataset = h5py.File('dir1/test.h5', "r") #opens the file in read mode
test_set_x = np.array(test_dataset["test_x"][:]) # get all of array from beginning to end
 
ex: writing an h5 file

 f=h5py.File("testfile.hdf5")

arr=np.ones((5,2))

f["my dataset"]=arr #this stores the 5X2 array into file testfile.hdf5