Using DC data
Knowing and understanding the RT-DC dataset classes
is an important prerequisite when working with dclab. They are all
derived from RTDCBase
which
allows read-only access to all features with a dictionary-like interface,
facilitates data export or filtering, and comes with several convenience
methods that are useful for data visualization.
RT-DC datasets can be based on a data file format
(RTDC_TDMS
and
RTDC_HDF5
), accessed
from an online repository (RTDC_DCOR
or
RTDC_S3
),
created from user-defined
dictionaries (RTDC_Dict
),
or derived from other RT-DC datasets
(RTDC_Hierarchy
).
Opening a file
The convenience function dclab.new_dataset()
takes care of determining
the data format and returns the corresponding derived class.
In [1]: import dclab
In [2]: ds = dclab.new_dataset("data/example.rtdc")
In [3]: ds.__class__.__name__
Out[3]: 'RTDC_HDF5'
Creating an in-memory dataset
It is also possible to load other data into dclab from a dictionary.
In [4]: data = dict(deform=np.random.rand(100),
...: area_um=np.random.rand(100))
...:
In [5]: ds_dict = dclab.new_dataset(data)
In [6]: ds_dict.__class__.__name__
Out[6]: 'RTDC_Dict'
Filtering (Gating)
Filters are used to mask e.g. debris or doublets from a dataset.
# Restrict the deformation to 0.15
In [7]: ds.config["filtering"]["deform min"] = 0
In [8]: ds.config["filtering"]["deform max"] = .15
# Manually excluding events using array indices is also possible:
# The writeable `ds.filter.manual` is a 1D boolean array of size `len(ds)`
# where `False` can manually set, implying that these events are excluded.
# The following line sets the four events located at indices
# 0, 345, 400, and 1000 to False, so that they are added to ds.filter.all
# when `ds.apply_filter()` is called.
In [9]: ds.filter.manual[[0, 400, 345, 1000]] = False
In [10]: ds.apply_filter()
In [11]: assert not ds.filter.all[345]
# The read-only boolean array `ds.filter.all` represents the applied filter
# and can be used for indexing.
In [12]: ds["deform"][:].mean(), ds["deform"][ds.filter.all].mean()
Out[12]: (np.float32(0.0287258), np.float32(0.026486598))
Note that ds.apply_filter()
must be called, otherwise
ds.filter.all
is not updated.
Creating hierarchies
When applying filtering operations, it is sometimes helpful to use hierarchies for keeping track of the individual filtering steps.
In [13]: child = dclab.new_dataset(ds)
In [14]: child.config["filtering"]["area_um min"] = 0
In [15]: child.config["filtering"]["area_um max"] = 80
In [16]: grandchild = dclab.new_dataset(child)
In [17]: grandchild.rejuvenate()
In [18]: len(ds), len(child), len(grandchild)
Out[18]: (5000, 4933, 4778)
In [19]: ds.filter.all.sum(), child.filter.all.sum(), grandchild.filter.all.sum()
Out[19]: (np.int64(4933), np.int64(4778), np.int64(4778))
Note that calling grandchild.rejuvenate()
automatically calls
child.rejuvenate()
and ds.apply_filter()
. Also note that,
as expected, the size of each hierarchy child is identical to the sum of the
boolean filtering array from its hierarchy parent.
Always make sure to call rejuvenate to the youngest members of your hierarchy (here grandchild), when you changed a filter in the hierarchy or when you modified an ancillary feature or the dataset metadata/configuration. Otherwise you cannot be sure that all information properly propagated through your hierarchy (Your grandchild might be an orphan).
Computing feature statistics
The statistics module comes with a predefined set of methods to compute simple feature statistics.
In [20]: import dclab
In [21]: ds = dclab.new_dataset("data/example.rtdc")
In [22]: stats = dclab.statistics.get_statistics(ds,
....: features=["deform", "aspect"],
....: methods=["Mode", "Mean", "SD"])
....:
In [23]: dict(zip(*stats))
Out[23]:
{'Mode Deformation': np.float32(0.01663526),
'Mean Deformation': np.float32(0.0287258),
'SD Deformation': np.float32(0.028740086),
'Mode Aspect ratio of bounding box': np.float64(1.1091421916433233),
'Mean Aspect ratio of bounding box': np.float64(1.2719607587337494),
'SD Aspect ratio of bounding box': np.float64(0.2523385371130096)}
Note that the statistics take into account the applied filters:
In [24]: ds.config["filtering"]["deform min"] = 0
In [25]: ds.config["filtering"]["deform max"] = .1
In [26]: ds.apply_filter()
In [27]: stats2 = dclab.statistics.get_statistics(ds,
....: features=["deform", "aspect"],
....: methods=["Mode", "Mean", "SD"])
....:
In [28]: dict(zip(*stats2))
Out[28]:
{'Mode Deformation': np.float32(0.017006295),
'Mean Deformation': np.float32(0.02476519),
'SD Deformation': np.float32(0.015638638),
'Mode Aspect ratio of bounding box': np.float64(1.1232223188589807),
'Mean Aspect ratio of bounding box': np.float64(1.240720618624576),
'SD Aspect ratio of bounding box': np.float64(0.15993707940243287)}
These are the available statistics methods:
In [29]: dclab.statistics.Statistics.available_methods.keys()
Out[29]: dict_keys(['Mean', 'Median', 'Mode', 'SD', 'Events', '%-gated', 'Flow rate'])
Commonly used scripting examples
Here are a few useful functionalities for scripting with dclab.
# unique identifier of the RTDCBase instance (not reproducible)
In [30]: ds.identifier
Out[30]: 'mm-hdf5_44d56b2'
# reproducible hash of the dataset
In [31]: ds.hash
Out[31]: 'e9726c7be19cde2cc415a56a69272a19'
# dataset format
In [32]: ds.format
Out[32]: 'hdf5'
# all available features
In [33]: ds.features
Out[33]:
['area_cvx',
'area_msd',
'area_ratio',
'area_um',
'aspect',
'bright_avg',
'bright_sd',
'circ',
'deform',
'frame',
'index',
'inert_ratio_cvx',
'inert_ratio_raw',
'nevents',
'pos_x',
'pos_y',
'size_x',
'size_y',
'time']
# scalar (one number per event) features
In [34]: ds.features_scalar
Out[34]:
['area_cvx',
'area_msd',
'area_ratio',
'area_um',
'aspect',
'bright_avg',
'bright_sd',
'circ',
'deform',
'frame',
'index',
'inert_ratio_cvx',
'inert_ratio_raw',
'nevents',
'pos_x',
'pos_y',
'size_x',
'size_y',
'time']
# innate (present in the underlying data file) features
In [35]: ds.features_innate
Out[35]:
['area_cvx',
'area_msd',
'bright_avg',
'bright_sd',
'circ',
'frame',
'inert_ratio_cvx',
'inert_ratio_raw',
'nevents',
'pos_x',
'pos_y',
'size_x',
'size_y']
# loaded (innate and computed ancillaries) features
In [36]: ds.features_loaded
Out[36]:
['area_cvx',
'area_msd',
'area_ratio',
'area_um',
'aspect',
'bright_avg',
'bright_sd',
'circ',
'deform',
'frame',
'index',
'inert_ratio_cvx',
'inert_ratio_raw',
'nevents',
'pos_x',
'pos_y',
'size_x',
'size_y',
'time']
# test feature availability (success)
In [37]: "area_um" in ds
Out[37]: True
# test feature availability (failure)
In [38]: "image" in ds
Out[38]: False
# accessing a feature and computing its mean
In [39]: ds["area_um"][:].mean()
Out[39]: np.float32(49.728645)
# accessing the measurement configuration
In [40]: ds.config.keys()
Out[40]: KeysView({'filtering': {'remove invalid events': False, 'enable filters': True, 'limit events': 0, 'polygon filters': [], 'hierarchy parent': 'none', 'deform min': 0, 'deform max': 0.1}, 'experiment': {'date': '2017-07-16', 'event count': 5000, 'run index': 1, 'sample': 'docs-data', 'time': '19:01:36'}, 'imaging': {'flash device': 'LED (undefined)', 'flash duration': 2.0, 'frame rate': 2000.0, 'pixel size': 0.34, 'roi position x': 504, 'roi position y': 472, 'roi size x': 256, 'roi size y': 96}, 'online_contour': {'bin area min': 50, 'bin kernel': 5, 'bin threshold': 6, 'image blur': 0, 'no absdiff': True}, 'setup': {'channel width': 20.0, 'chip region': 'channel', 'flow rate': 0.06, 'flow rate sample': 0.041999999999999996, 'flow rate sheath': 0.018, 'identifier': 'ZMDD-AcC-8ecba5-cd57e2', 'medium': 'CellCarrier', 'module composition': 'Cell_Flow_2 + Fluor', 'software version': 'dclab 0.5.2.dev13'}})
In [41]: ds.config["experiment"]
Out[41]: {'date': '2017-07-16', 'event count': 5000, 'run index': 1, 'sample': 'docs-data', 'time': '19:01:36'}
# determine the identifier of the hierarchy parent
In [42]: child.config["filtering"]["hierarchy parent"]
Out[42]: 'mm-hdf5_da19157'