HDF5 data file export

With data set sizes captured using PicoLog 6 potentially in the gigabytes, exporting that amount of data for processing in third-party applications becomes impractical.

Exporting data in CSV format

PicoLog already has a CSV export format, which consists of an ASCII file where values are separated by commas (in the UK and US) or semicolons (in Europe). Because it's an open format, CSV is commonly used to exchange data between businesses, consumers and scientific applicatations, and is human-readable using simple text editors, spreadsheets and even data processing applications such as MATLAB, making it a popular choice. CSV format is operating-system-agnostic, meaning it can be first created on a Linux or macOS computer and then seamlessly opened on a Windows PC, thus covering all the OSs that PicoLog 6 supports.

Great! So why would I need any other fornats? It sounds like CSV is perfect.

Size and speed versus ease of use

For CSVs to be human-readable, the data is displayed in the ASCII character set. This contains the Roman alphabet, numbers and a few symbols. Each ASCII character can be represented by 7 bits, which means it fits in one byte of storage.

So we come to the first significant factor in the size of a CSV file: the number of characters that represent one sample point. Measured temperatures, for example, can be short, often 3 to 5 significant digits. In the case of a thermocouple temperature sensor, 3 significant digits such as "20.3 °C" (the units aren't included in the CSV) would be practical for thermocouple accuracy, but for more precision applications when using a PT100 sensor, 5 significant digits such as "20.305 °C" would be more fitting for the accuracy of the sensor and logger. But what if I told you that "20.305" was 8 characters? 

  • Some countries and regions use the comma as a decimal separator. Many applications would read this as a column delimiter, so we use quoted CSV values to encapsulate the value. Your system settings will choose the decimal separator based on your regional settings. The quotes add two characters.
  • Don't forget that the decimal separator is also a character. 

So even with only 5 numerical digits, each sample point is 8 bytes (5 numerical characters + 3 formatting characters, each 1 byte). This is the Achilles' Heel of the CSV format when used with very large datasets. Making it human-readable means more data is used to encode each sample. 

If your system has multiple channels and devices, recording for many hours, days or even weeks of capture, you can start to see how the size of CSVs can grow quickly.

The second issue is that the CSV format is uncompressed data, we'll go into a bit more detail on compression later. 

The .picolog file format

PicoLog also exports data into our proprietary format, the .picolog file which is a closed format. It's useful for storing and sharing large data files in the PicoLog application, allows the PicoLog file viewer page to preview a thumbnail of the waveform, and embedded in the files are user-definable search tags.

The .picolog file format stores all values as binary 32-bit floating-point numbers, which take up 4 bytes, and although that still sounds big, .picolog files are compressed. This can reduce the size by up to 4 times compared with CSV format.

But this is a problem for users who want to import a very large dataset into third-party applications such as MATLAB. The .picolog file format is closed to help prevent data being manipulated. 

Introducing ... HDF5 file export in PicoLog 6

Introduced from PicoLog 6.1.16 onwards is the feature to export raw binary data in HDF5 format.

We chose the HDF5 (Hierarchical Data Format 5) file format for a number of reasons.

  1. It is designed from the ground up to store and organize “big data”.
  2. It is available under a BSD type license for general use.
  3. It uses very effective compression, making the file size much smaller than CSV. 
  4. The freely available, open format HDF distribution consists of the library, command-line utilities, test suite source, Java interface, and the Java-based HDF Viewer (HDFView).
  5. It's ideal for use with third-party applications such as MATLAB for data analysis and number crunching.

Speaking of compression, HDF5 is compatible with a number of compression methods – another bonus for developers using HDF5 in their applications. When you select HDF5 file export in PicoLog, the app tells the HDF5 driver what block size to use and away it goes, exporting the data and compressing. The compression method we use works well with repeated samples, so captures with slow-moving values will compress well. If you're recording fast-moving data, it compresses less well. The worst-case scenario is white noise, which will hardly compress at all.

Example code and demonstrations

Gif of Matlab HDF5 demonstration We've put together two code examples and short screen capture videos to demonstrate importing a .HDF5 file into both MATLAB and Python, which also display the data in a graph. The .HDF5 file is a 30-day capture of three-phase current using a PicoLog CM3 data logger.

You can find the MATLAB example here. Here is a short video showing how it works:

You can also find the Python example here