Machine learning
To simplify machine-learning (ML) tasks in the context of RT-DC, dclab offers
a few convenience methods. This section describes the recommended way
of implementing and distributing ML models based on RT-DC data. Please
make sure that you have installed dclab with the ml extra
(pip install dclab[ml]
).
Using models in dclab
For RT-DC analysis, the most common task for ML is to determine the probability for a specific event (e.g. a cell) to belong to a specific class (e.g. red blood cell). Since RT-DC data always has a very specific format, it is worthwile to standardize this regression/classification process.
In dclab, you are not directly using the bare models that you would e.g.
get from tensorflow/keras. Instead, models are wrapped via a specific
dclab.rtdc_dataset.feat_anc_ml.ml_model.BaseModel
class that holds
additional information about the features from which and to which a model maps.
For instance, a model might have the inputs deform
and area_um
and make
predictions regarding a defined output feature, e.g. ml_score_rbc
.
Output features for machine learning are always of the form ml_score_xxx
where x
can be any alphanumeric character (you are free to choose).
from dclab.rtdc_dataset.feat_anc_ml import hook_tensorflow
import tensorflow as tf
# do your magic
bare_model = tf.keras.Sequential(...)
bare_model.compile(...)
bare_model.fit(...)
# create a dclab model
dc_model = hook_tensorflow.TensorflowModel(
bare_model=bare_model,
inputs=["deform", "area_um"],
outputs=["ml_score_rbc"],
info={
"description": "RBC identification",
"output labels": ["Red Blood Cells"]
}
)
# once you get here, you can use your model directly for inference
ds = dclab.new_dataset("path/to/a/dataset")
# `prediction` is a dictionary with the key "ml_score_rbc" mapping
# to a 1D ndarray of length `len(ds)`, holding the probability data.
prediction = dc_model.predict(ds)["ml_score_rbc"]
For user convenience, a model can also be registered with dclab as an ancillary feature.
dclab.MachineLearningFeature(feature_name="ml_score_rbc",
dc_model=dc_model)
prediction = ds["ml_score_rbc"] # same result as above
Please have a look at this example to see dclab models in action.
The .modc file format
The .modc file format is not a reinvention of the wheel. It is merely
a wrapper around other ML file formats and describes which input
features (e.g. deform
, area_um
, image
, etc.) a machine learning
method maps onto which output features (e.g. ml_score_rbc
). A .modc file is
just a .zip file containing an index.json file that lists all
models. A model may be stored in multiple file formats (e.g. as a
tensorflow SavedModel
and as a Frozen Graph). Alongside the models, the .modc file format
also contains human-readable versions of the output features, SHA256
checksums, and the creation date:
example.modc (ZIP file contents)
├── index.json
├── model_0
│ ├── another-format
│ │ └── another_formats_file.suffix
│ └── tensorflow-SavedModel.tf
│ ├── assets
│ ├── saved_model.pb
│ └── variables
│ ├── variables.data-00000-of-00001
│ └── variables.index
└── model_1
└── tensorflow-SavedModel.tf
├── assets
├── saved_model.pb
└── variables
├── variables.data-00000-of-00001
└── variables.index
The corresponding index.json file could look like this:
{
"model count": 2,
"models": [
{
"date": "2020-11-03 17:01",
"description": "Determine sensitivity X",
"formats": {
"tensorflow-SavedModel": "tensorflow-SavedModel.tf",
"library-OtherFormat": "another-format"
},
"index": 0,
"input features": [
"deform"
],
"output features": [
"ml_score_low",
"ml_score_hig"
],
"output labels": [
"Low",
"High"
],
"path": "model_0",
"sha256": "ec11c73ae870da4551d9fa9cc73271566b8f2356f284d4c2cb02057ecb5bf6ce"
},
{
"date": "2020-11-03 17:02",
"description": "Find RBCs and sad cells",
"formats": {
"tensorflow-SavedModel": "tensorflow-SavedModel.tf"
},
"index": 1,
"input features": [
"area_um",
"image"
],
"output features": [
"ml_score_rbc",
"ml_score_sad"
],
"output labels": [
"red blood cells",
"sad cells"
],
"path": "model_1",
"sha256": "ac43c73ae870da4551d9fa9cc73271566b8f2356f284d4c2cb02057ecb5ba812"
}
]
}
The great advantage of such a file format is that users can transparently exchange machine learning methods and apply them in a reproducible manner to any RT-DC dataset using dclab or Shape-Out.
To save a machine learning model to a .modc file, you can use the
dclab.save_modc
function:
dclab.save_modc("path/to/file.modc", dc_model)
Conversely, you can load such a model at any time and use it for inference
using the dclab.load_modc
.
To directly load the model as an ancillary feature, use
dclab.load_ml_feature
:
dclab.load_ml_feature("path/to/file.modc")
prediction = ds["ml_score_rbc"] # same result as above
The methods for saving and loading .modc files are described in the code reference.
Helper functions
If you are working with tensorflow, you might find the functions in the dclab.rtdc_dataset.feat_anc_ml.hook_tensorflow submodule helpful. Please also have a look at the machine-learning examples.