RT-DC datasets

Knowing and understanding the RT-DC dataset classes is an important prerequisite when working with dclab. They are all derived from RTDCBase which gives access to features with a dictionary-like interface, facilitates data export or filtering, and comes with several convenience methods that are useful for data visualization. RT-DC datasets can be based on a data file format (RTDC_TDMS and RTDC_HDF5), accessed from an online repository (RTDC_HDF5), created from user-defined dictionaries (RTDC_Dict), or derived from other RT-DC datasets (RTDC_Hierarchy).

Basic usage

The convenience function dclab.new_dataset() takes care of determining the data format and returns the corresponding derived class.

In [1]: import dclab

In [2]: ds = dclab.new_dataset("data/example.rtdc")

In [3]: ds.__class__.__name__
Out[3]: 'RTDC_HDF5'

Working with other data

It is also possible to load other data into dclab from a dictionary.

In [4]: data = dict(deform=np.random.rand(100),
   ...:             area_um=np.random.rand(100))
   ...: 

In [5]: ds_dict = dclab.new_dataset(data)

In [6]: ds_dict.__class__.__name__
Out[6]: 'RTDC_Dict'

Using filters

Filters are used to mask e.g. debris or doublets from a dataset.

# Restrict the deformation to 0.15
In [7]: ds.config["filtering"]["deform min"] = 0

In [8]: ds.config["filtering"]["deform max"] = .15

# Manually excluding events using array indices is also possible:
# `ds.filter.manual` is a 1D boolean array of size `len(ds)`
# where `False` values mean that the events are excluded.
In [9]: ds.filter.manual[[0, 400, 345, 1000]] = False

In [10]: ds.apply_filter()

# The boolean array `ds.filter.all` represents the applied filter
# and can be used for indexing.
In [11]: ds["deform"].mean(), ds["deform"][ds.filter.all].mean()
Out[11]: (0.0287258, 0.026486598)

Note that ds.apply_filter() must be called, otherwise ds.filter.all will not be updated.

Creating hierarchies

When applying filtering operations, it is sometimes helpful to use hierarchies for keeping track of the individual filtering steps.

In [12]: child = dclab.new_dataset(ds)

In [13]: child.config["filtering"]["area_um min"] = 0

In [14]: child.config["filtering"]["area_um max"] = 80

In [15]: grandchild = dclab.new_dataset(child)

In [16]: grandchild.apply_filter()

In [17]: len(ds), len(child), len(grandchild)
Out[17]: (5000, 4933, 4778)

In [18]: ds.filter.all.sum(), child.filter.all.sum(), grandchild.filter.all.sum()
Out[18]: (4933, 4778, 4778)

Note that calling grandchild.apply_filter() automatically calls child.apply_filter() and ds.apply_filter(). Also note that, as expected, the size of each hierarchy child is identical to the sum of the boolean filtering array from its hierarchy parent.

Scripting goodies

Here are a few useful functionalities for scripting with dclab.

# unique identifier of the RTDCBase instance (not reproducible)
In [19]: ds.identifier
Out[19]: 'mm-hdf5_e2d6dd4'

# reproducible hash of the dataset
In [20]: ds.hash
Out[20]: '8ff19f702a236cbf91e13667e144e722'

# dataset format
In [21]: ds.format
Out[21]: 'hdf5'

# available features
In [22]: ds.features
Out[22]: 
['area_cvx',
 'area_msd',
 'area_ratio',
 'area_um',
 'aspect',
 'bright_avg',
 'bright_sd',
 'circ',
 'deform',
 'frame',
 'index',
 'inert_ratio_cvx',
 'inert_ratio_raw',
 'nevents',
 'pos_x',
 'pos_y',
 'size_x',
 'size_y',
 'time']

# test feature availability (success)
In [23]: "area_um" in ds
Out[23]: True

# test feature availability (failure)
In [24]: "image" in ds
Out[24]: False

# accessing a feature and computing its mean
In [25]: ds["area_um"].mean()
Out[25]: 49.728645

# accessing the measurement configuration
In [26]: ds.config.keys()
Out[26]: dict_keys(['filtering', 'experiment', 'imaging', 'online_contour', 'setup'])

In [27]: ds.config["experiment"]
Out[27]: 
{'date': '2017-07-16',
 'event count': 5000,
 'run index': 1,
 'sample': 'docs-data',
 'time': '19:01:36'}

# determine the identifier of the hierarchy parent
In [28]: child.config["filtering"]["hierarchy parent"]
Out[28]: 'mm-hdf5_e2d6dd4'

Statistics

The statistics module comes with a predefined set of methods to compute simple feature statistics.

In [29]: import dclab

In [30]: ds = dclab.new_dataset("data/example.rtdc")

In [31]: stats = dclab.statistics.get_statistics(ds,
   ....:                                         features=["deform", "aspect"],
   ....:                                         methods=["Mode", "Mean", "SD"])
   ....: 

In [32]: dict(zip(*stats))
Out[32]: 
{'Mode Deformation': 0.016635261,
 'Mean Deformation': 0.0287258,
 'SD Deformation': 0.028740086,
 'Mode Aspect ratio of bounding box': 1.1091421916433233,
 'Mean Aspect ratio of bounding box': 1.2719607587337494,
 'SD Aspect ratio of bounding box': 0.2523385371130096}

Note that the statistics take into account the applied filters:

In [33]: ds.config["filtering"]["deform min"] = 0

In [34]: ds.config["filtering"]["deform max"] = .1

In [35]: ds.apply_filter()

In [36]: stats2 = dclab.statistics.get_statistics(ds,
   ....:                                          features=["deform", "aspect"],
   ....:                                          methods=["Mode", "Mean", "SD"])
   ....: 

In [37]: dict(zip(*stats2))
Out[37]: 
{'Mode Deformation': 0.017006295,
 'Mean Deformation': 0.02476519,
 'SD Deformation': 0.015638638,
 'Mode Aspect ratio of bounding box': 1.1232223188589807,
 'Mean Aspect ratio of bounding box': 1.240720618624576,
 'SD Aspect ratio of bounding box': 0.15993707940243287}

These are the available statistics methods:

In [38]: dclab.statistics.Statistics.available_methods.keys()
Out[38]: dict_keys(['Mean', 'Median', 'Mode', 'SD', 'Events', '%-gated', 'Flow rate'])

Export

The RTDCBase class has the attribute RTDCBase.export which allows to export event data to several data file formats. See export for more information.

In [39]: ds.export.tsv(path="export_example.tsv",
   ....:               features=["area_um", "deform"],
   ....:               filtered=True,
   ....:               override=True)
   ....: 

In [40]: ds.export.hdf5(path="export_example.rtdc",
   ....:                features=["area_um", "aspect", "deform"],
   ....:                filtered=True,
   ....:                override=True)
   ....: 

Note that data exported as HDF5 files can be loaded with dclab (reproducing the previously computed statistics - without filters).

In [41]: ds2 = dclab.new_dataset("export_example.rtdc")

In [42]: ds2["deform"].mean()
Out[42]: 0.02476519