RT-DC datasets¶
Knowing and understanding the RT-DC dataset classes
is an important prerequisite when working with dclab. They are all
derived from RTDCBase
which
gives access to feature with a dictionary-like interface, facilitates data export
and filtering, and comes with several convenience methods that are useful
for data visualization.
RT-DC datasets can be based on a data file format
(RTDC_TDMS
and
RTDC_HDF5
), created from user-defined
dictionaries (RTDC_Dict
),
or derived from other RT-DC datasets
(RTDC_Hierarchy
).
Basic usage¶
The convenience function dclab.new_dataset()
takes care of determining
the data file format (tdms or hdf5) and returns the corresponding derived
class.
In [1]: import dclab
In [2]: ds = dclab.new_dataset("data/example.rtdc")
In [3]: ds.__class__.__name__
Out[3]: 'RTDC_HDF5'
Working with other data¶
It is also possible to load other data into dclab from a dictionary.
In [4]: data = dict(deform=np.random.rand(100),
...: area_um=np.random.rand(100))
...:
In [5]: ds_dict = dclab.new_dataset(data)
In [6]: ds_dict.__class__.__name__
Out[6]: 'RTDC_Dict'
Using filters¶
Filters are used to mask e.g. debris or doublets from a dataset.
# Restrict the deformation to 0.15
In [7]: ds.config["filtering"]["deform max"] = .15
# Manually excluding events using array indices is also possible:
# `ds.filter.manual` is a 1D boolean array of size `len(ds)`
# where `False` values mean that the events are excluded.
In [8]: ds.filter.manual[[0, 400, 345, 1000]] = False
In [9]: ds.apply_filter()
# The boolean array `ds.filter.all` represents the applied filter
# and can be used for indexing.
In [10]: ds["deform"].mean(), ds["deform"][ds.filter.all].mean()
Out[10]: (0.0287258, 0.026486598)
Note that ds.apply_filter()
must be called, otherwise
ds.filter.all
will not be updated.
Creating hierarchies¶
When applying filtering operations, it is sometimes helpful to use hierarchies for keeping track of the individual filtering steps.
In [11]: child = dclab.new_dataset(ds)
In [12]: child.config["filtering"]["area_um max"] = 80
In [13]: grandchild = dclab.new_dataset(child)
In [14]: grandchild.apply_filter()
In [15]: len(ds), len(child), len(grandchild)
Out[15]: (5000, 4933, 4778)
In [16]: ds.filter.all.sum(), child.filter.all.sum(), grandchild.filter.all.sum()
Out[16]: (4933, 4778, 4778)
Note that calling grandchild.apply_filter()
automatically calls
child.apply_filter()
and ds.apply_filter()
. Also note that,
as expected, the size of each hierarchy child is identical to the sum of the
boolean filtering array from its hierarchy parent.
Scripting goodies¶
Here are a few useful functionalities for scripting with dclab.
# unique identifier of the RTDCBase instance (not reproducible)
In [17]: ds.identifier
Out[17]: 'mm-hdf5_796563a'
# reproducible hash of the dataset
In [18]: ds.hash
Out[18]: '8ff19f702a236cbf91e13667e144e722'
# dataset format
In [19]: ds.format
Out[19]: 'hdf5'
# available features
In [20]: ds.features
Out[20]:
['area_cvx',
'area_msd',
'area_ratio',
'area_um',
'aspect',
'bright_avg',
'bright_sd',
'circ',
'deform',
'frame',
'index',
'inert_ratio_cvx',
'inert_ratio_raw',
'nevents',
'pos_x',
'pos_y',
'size_x',
'size_y',
'time']
# test feature availability (success)
In [21]: "area_um" in ds
Out[21]: True
# test feature availability (failure)
In [22]: "image" in ds
Out[22]: False
# accessing a feature and computing its mean
In [23]: ds["area_um"].mean()
Out[23]: 49.728645
# accessing the measurement configuration
In [24]: ds.config.keys()
Out[24]: dict_keys(['filtering', 'experiment', 'imaging', 'online_contour', 'setup'])
In [25]: ds.config["experiment"]
Out[25]:
{'date': '2017-07-16',
'event count': 5000,
'run index': 1,
'sample': 'docs-data',
'time': '19:01:36'}
# determine the identifier of the hierarchy parent
In [26]: child.config["filtering"]["hierarchy parent"]
Out[26]: 'mm-hdf5_796563a'
Statistics¶
The statistics module comes with a predefined set of methods to compute simple feature statistics.
In [27]: import dclab
In [28]: ds = dclab.new_dataset("data/example.rtdc")
In [29]: stats = dclab.statistics.get_statistics(ds,
....: features=["deform", "aspect"],
....: methods=["Mode", "Mean", "SD"])
....:
In [30]: dict(zip(*stats))
Out[30]:
{'Mode Deformation': 0.016635261,
'Mean Deformation': 0.0287258,
'SD Deformation': 0.028740086,
'Mode Aspect ratio of bounding box': 1.1091421916433233,
'Mean Aspect ratio of bounding box': 1.2719607587337494,
'SD Aspect ratio of bounding box': 0.2523385371130096}
Note that the statistics take into account the applied filters:
In [31]: ds.config["filtering"]["deform max"] = .1
In [32]: ds.apply_filter()
In [33]: stats2 = dclab.statistics.get_statistics(ds,
....: features=["deform", "aspect"],
....: methods=["Mode", "Mean", "SD"])
....:
In [34]: dict(zip(*stats2))
Out[34]:
{'Mode Deformation': 0.017006295,
'Mean Deformation': 0.02476519,
'SD Deformation': 0.015638638,
'Mode Aspect ratio of bounding box': 1.1232223188589807,
'Mean Aspect ratio of bounding box': 1.240720618624576,
'SD Aspect ratio of bounding box': 0.15993707940243287}
These are the available statistics methods:
In [35]: dclab.statistics.Statistics.available_methods.keys()
Out[35]: dict_keys(['Mean', 'Median', 'Mode', 'SD', 'Events', '%-gated', 'Flow rate'])
Export¶
The RTDCBase
class has the attribute
RTDCBase.export
which allows to export event data to several data file formats. See
export for more information.
In [36]: ds.export.tsv(path="export_example.tsv",
....: features=["area_um", "deform"],
....: filtered=True,
....: override=True)
....:
In [37]: ds.export.hdf5(path="export_example.rtdc",
....: features=["area_um", "aspect", "deform"],
....: filtered=True,
....: override=True)
....:
Note that data exported as HDF5 files can be loaded with dclab (reproducing the previously computed statistics - without filters).
In [38]: ds2 = dclab.new_dataset("export_example.rtdc")
In [39]: ds2["deform"].mean()
Out[39]: 0.02476519