RT-DC datasets

Knowing and understanding the RT-DC dataset classes is an important prerequisite when working with dclab. They are all derived from RTDCBase which allows read-only access to all features with a dictionary-like interface, facilitates data export or filtering, and comes with several convenience methods that are useful for data visualization. RT-DC datasets can be based on a data file format (RTDC_TDMS and RTDC_HDF5), accessed from an online repository (RTDC_DCOR), created from user-defined dictionaries (RTDC_Dict), or derived from other RT-DC datasets (RTDC_Hierarchy).

Opening a dataset

The convenience function dclab.new_dataset() takes care of determining the data format and returns the corresponding derived class.

In [1]: import dclab

In [2]: ds = dclab.new_dataset("data/example.rtdc")

In [3]: ds.__class__.__name__
Out[3]: 'RTDC_HDF5'

Creating an in-memory dataset

It is also possible to load other data into dclab from a dictionary.

In [4]: data = dict(deform=np.random.rand(100),
   ...:             area_um=np.random.rand(100))
   ...: 

In [5]: ds_dict = dclab.new_dataset(data)

In [6]: ds_dict.__class__.__name__
Out[6]: 'RTDC_Dict'

If you would like to create your own .rtdc files, you can make use of the RTDCWriter class.

In [7]: with dclab.RTDCWriter("my-data.rtdc", mode="reset") as hw:
   ...:     hw.store_metadata({"experiment": {"sample": "my sample",
   ...:                                       "run index": 1}})
   ...:     hw.store_feature("deform", np.random.rand(100))
   ...:     hw.store_feature("area_um", np.random.rand(100))
   ...: 

In [8]: ds_custom = dclab.new_dataset("my-data.rtdc")

In [9]: print(ds_custom.features)
['area_um', 'deform', 'index']

In [10]: print(ds_custom.config["experiment"])
{'event count': 100, 'run index': 1, 'sample': 'my sample'}

Exporting data

The RTDCBase class has the attribute RTDCBase.export which allows to export event data to several data file formats. See Export for more information.

In [11]: ds.export.tsv(path="export_example.tsv",
   ....:               features=["area_um", "deform"],
   ....:               filtered=True,
   ....:               override=True)
   ....: 

In [12]: ds.export.hdf5(path="export_example.rtdc",
   ....:                features=["area_um", "aspect", "deform"],
   ....:                filtered=True,
   ....:                override=True)
   ....: 

Note that data exported as HDF5 files can be loaded with dclab (reproducing the previously computed statistics - without filters).

In [13]: ds2 = dclab.new_dataset("export_example.rtdc")

In [14]: ds2["deform"][:].mean()
Out[14]: 0.0287258

Filtering (Gating)

Filters are used to mask e.g. debris or doublets from a dataset.

# Restrict the deformation to 0.15
In [15]: ds.config["filtering"]["deform min"] = 0

In [16]: ds.config["filtering"]["deform max"] = .15

# Manually excluding events using array indices is also possible:
# The writeable `ds.filter.manual` is a 1D boolean array of size `len(ds)`
# where `False` can manually set, implying that these events are excluded.
# The following line sets the four events located at indices
# 0, 345, 400, and 1000 to False, so that they are added to ds.filter.all
# when `ds.apply_filter()` is called.
In [17]: ds.filter.manual[[0, 400, 345, 1000]] = False

In [18]: ds.apply_filter()

In [19]: assert not ds.filter.all[345]

# The read-only boolean array `ds.filter.all` represents the applied filter
# and can be used for indexing.
In [20]: ds["deform"][:].mean(), ds["deform"][ds.filter.all].mean()
Out[20]: (0.0287258, 0.026486598)

Note that ds.apply_filter() must be called, otherwise ds.filter.all is not updated.

Creating hierarchies

When applying filtering operations, it is sometimes helpful to use hierarchies for keeping track of the individual filtering steps.

In [21]: child = dclab.new_dataset(ds)

In [22]: child.config["filtering"]["area_um min"] = 0

In [23]: child.config["filtering"]["area_um max"] = 80

In [24]: grandchild = dclab.new_dataset(child)

In [25]: grandchild.rejuvenate()

In [26]: len(ds), len(child), len(grandchild)
Out[26]: (5000, 4933, 4778)

In [27]: ds.filter.all.sum(), child.filter.all.sum(), grandchild.filter.all.sum()
Out[27]: (4933, 4778, 4778)

Note that calling grandchild.rejuvenate() automatically calls child.rejuvenate() and ds.apply_filter(). Also note that, as expected, the size of each hierarchy child is identical to the sum of the boolean filtering array from its hierarchy parent.

Always make sure to call rejuvenate to the youngest members of your hierarchy (here grandchild), when you changed a filter in the hierarchy or when you modified an ancillary feature or the dataset metadata/configuration. Otherwise you cannot be sure that all information properly propagated through your hierarchy (Your grandchild might be an orphan).

Computing feature statistics

The statistics module comes with a predefined set of methods to compute simple feature statistics.

In [28]: import dclab

In [29]: ds = dclab.new_dataset("data/example.rtdc")

In [30]: stats = dclab.statistics.get_statistics(ds,
   ....:                                         features=["deform", "aspect"],
   ....:                                         methods=["Mode", "Mean", "SD"])
   ....: 

In [31]: dict(zip(*stats))
Out[31]: 
{'Mode Deformation': 0.016635261,
 'Mean Deformation': 0.0287258,
 'SD Deformation': 0.028740086,
 'Mode Aspect ratio of bounding box': 1.1091421916433233,
 'Mean Aspect ratio of bounding box': 1.2719607587337494,
 'SD Aspect ratio of bounding box': 0.2523385371130096}

Note that the statistics take into account the applied filters:

In [32]: ds.config["filtering"]["deform min"] = 0

In [33]: ds.config["filtering"]["deform max"] = .1

In [34]: ds.apply_filter()

In [35]: stats2 = dclab.statistics.get_statistics(ds,
   ....:                                          features=["deform", "aspect"],
   ....:                                          methods=["Mode", "Mean", "SD"])
   ....: 

In [36]: dict(zip(*stats2))
Out[36]: 
{'Mode Deformation': 0.017006295,
 'Mean Deformation': 0.02476519,
 'SD Deformation': 0.015638638,
 'Mode Aspect ratio of bounding box': 1.1232223188589807,
 'Mean Aspect ratio of bounding box': 1.240720618624576,
 'SD Aspect ratio of bounding box': 0.15993707940243287}

These are the available statistics methods:

In [37]: dclab.statistics.Statistics.available_methods.keys()
Out[37]: dict_keys(['Mean', 'Median', 'Mode', 'SD', 'Events', '%-gated', 'Flow rate'])

Commonly used scripting examples

Here are a few useful functionalities for scripting with dclab.

# unique identifier of the RTDCBase instance (not reproducible)
In [38]: ds.identifier
Out[38]: 'mm-hdf5_c3b57ec'

# reproducible hash of the dataset
In [39]: ds.hash
Out[39]: 'e9726c7be19cde2cc415a56a69272a19'

# dataset format
In [40]: ds.format
Out[40]: 'hdf5'

# all available features
In [41]: ds.features
Out[41]: 
['area_cvx',
 'area_msd',
 'area_ratio',
 'area_um',
 'aspect',
 'bright_avg',
 'bright_sd',
 'circ',
 'circ_times_area',
 'deform',
 'frame',
 'index',
 'inert_ratio_cvx',
 'inert_ratio_raw',
 'nevents',
 'pos_x',
 'pos_y',
 'size_x',
 'size_y',
 'time']

# scalar (one number per event) features
In [42]: ds.features_scalar
Out[42]: 
['area_cvx',
 'area_msd',
 'area_ratio',
 'area_um',
 'aspect',
 'bright_avg',
 'bright_sd',
 'circ',
 'circ_times_area',
 'deform',
 'frame',
 'index',
 'inert_ratio_cvx',
 'inert_ratio_raw',
 'nevents',
 'pos_x',
 'pos_y',
 'size_x',
 'size_y',
 'time']

# innate (present in the underlying data file) features
In [43]: ds.features_innate
Out[43]: 
['area_cvx',
 'area_msd',
 'bright_avg',
 'bright_sd',
 'circ',
 'frame',
 'inert_ratio_cvx',
 'inert_ratio_raw',
 'nevents',
 'pos_x',
 'pos_y',
 'size_x',
 'size_y']

# loaded (innate and computed ancillaries) features
In [44]: ds.features_loaded
Out[44]: 
['area_cvx',
 'area_msd',
 'area_ratio',
 'area_um',
 'aspect',
 'bright_avg',
 'bright_sd',
 'circ',
 'deform',
 'frame',
 'index',
 'inert_ratio_cvx',
 'inert_ratio_raw',
 'nevents',
 'pos_x',
 'pos_y',
 'size_x',
 'size_y',
 'time']

# test feature availability (success)
In [45]: "area_um" in ds
Out[45]: True

# test feature availability (failure)
In [46]: "image" in ds
Out[46]: False

# accessing a feature and computing its mean
In [47]: ds["area_um"][:].mean()
Out[47]: 49.728645

# accessing the measurement configuration
In [48]: ds.config.keys()
Out[48]: KeysView({'filtering': {'remove invalid events': False, 'enable filters': True, 'limit events': 0, 'polygon filters': [], 'hierarchy parent': 'none', 'deform min': 0, 'deform max': 0.1}, 'experiment': {'date': '2017-07-16', 'event count': 5000, 'run index': 1, 'sample': 'docs-data', 'time': '19:01:36'}, 'imaging': {'flash device': 'LED (undefined)', 'flash duration': 2.0, 'frame rate': 2000.0, 'pixel size': 0.34, 'roi position x': 504, 'roi position y': 472, 'roi size x': 256, 'roi size y': 96}, 'online_contour': {'bin area min': 50, 'bin kernel': 5, 'bin threshold': 6, 'image blur': 0, 'no absdiff': True}, 'setup': {'channel width': 20.0, 'chip region': 'channel', 'flow rate': 0.06, 'flow rate sample': 0.041999999999999996, 'flow rate sheath': 0.018, 'identifier': 'ZMDD-AcC-8ecba5-cd57e2', 'medium': 'CellCarrier', 'module composition': 'Cell_Flow_2 + Fluor', 'software version': 'dclab 0.5.2.dev13'}})

In [49]: ds.config["experiment"]
Out[49]: {'date': '2017-07-16', 'event count': 5000, 'run index': 1, 'sample': 'docs-data', 'time': '19:01:36'}

# determine the identifier of the hierarchy parent
In [50]: child.config["filtering"]["hierarchy parent"]
Out[50]: 'mm-hdf5_95c8025'