# Machine learning¶

To simplify machine-learning (ML) tasks in the context of RT-DC, dclab offers a few convenience methods. This section describes the recommended way of implementing and distributing ML models based on RT-DC data. Please make sure that you have installed dclab with the ml extra (pip install dclab[ml]).

## Using models in dclab¶

For RT-DC analysis, the most common task for ML is to determine the probability for a specific event (e.g. a cell) to belong to a specific class (e.g. red blood cell). Since RT-DC data always has a very specific format, it is worthwile to standardize this regression/classification process.

In dclab, you are not directly using the bare models that you would e.g. get from tensorflow/keras. Instead, models are wrapped via a specific dclab.ml.models.BaseModel class that holds additional information about the features from which and to which a model maps. For instance, a model might have the inputs deform and area_um and make predictions regarding a defined output feature, e.g. ml_score_rbc. Output features for machine learning are always of the form ml_score_xxx where x can be any alphanumeric character (you are free to choose).

import dclab.ml
import tensorflow as tf

bare_model = tf.keras.Sequential(...)
bare_model.compile(...)
bare_model.fit(...)

# create a dclab model
dc_model = dclab.ml.models.TensorflowModel(
bare_model=bare_model,
inputs=["deform", "area_um"],
outputs=["ml_score_rbc"],
model_name="RBC identification",
output_labels="red blood cells")

# once you get here, you can use your model directly for inference
ds = dclab.new_dataset("path/to/a/dataset")
# prediction is a dictionary with the key "ml_score_rbc" mapping
# to a 1D ndarray of length len(ds), holding the probability data.
prediction = dc_model.predict(ds)["ml_score_rbc"]


For user convenience, a model can also be registered with dclab as an ancillary feature.

dc_model.register()
prediction = ds["ml_score_rbc"]  # same result as above
dc_model.unregister()  # optional


If it is inconvenient for you to call the register() and unregister methods (e.g. when you would like to perform predictions for multiple models), then you can use dc_model in combination with the with statement:

with dc_model:
prediction = ds["ml_score_rbc"]  # same result as above


## The .modc file format¶

The .modc file format is not a reinvention of the wheel. It is merely a wrapper around other ML file formats and describes which input features (e.g. deform, area_um, image, etc.) a machine learning method maps onto which output features (e.g. ml_score_rbc). A .modc file is just a .zip file containing an index.json file that lists all models. A model may be stored in multiple file formats (e.g. as a tensorflow SavedModel and as a Frozen Graph). Alongside the models, the .modc file format also contains human-readable versions of the output features, SHA256 checksums, and the creation date:

example.modc (ZIP file contents)
├── index.json
├── model_0
│         ├── another-format
│         │        └── another_formats_file.suffix
│         └── tensorflow-SavedModel.tf
│             ├── assets
│             ├── saved_model.pb
│             └── variables
│                 ├── variables.data-00000-of-00001
│                 └── variables.index
└── model_1
└── tensorflow-SavedModel.tf
├── assets
├── saved_model.pb
└── variables
├── variables.data-00000-of-00001
└── variables.index


The corresponding index.json file could look like this:

{
"model count": 2,
"models": [
{
"date": "2020-11-03 17:01",
"formats": {
"tensorflow-SavedModel": "tensorflow-SavedModel.tf",
"library-OtherFormat": "another-format"
},
"index": 0,
"input features": [
"deform"
],
"name": "Example Model 1",
"output features": [
"ml_score_low",
"ml_score_hig"
],
"output labels": [
"Low",
"High"
],
"path": "model_0",
"sha256": "ec11c73ae870da4551d9fa9cc73271566b8f2356f284d4c2cb02057ecb5bf6ce"
},
{
"date": "2020-11-03 17:02",
"formats": {
"tensorflow-SavedModel": "tensorflow-SavedModel.tf"
},
"index": 1,
"input features": [
"area_um",
"image"
],
"name": "Example Model 2",
"output features": [
"ml_score_rbc",
],
"output labels": [
"red boold cells",
],
"path": "model_1",
"sha256": "ac43c73ae870da4551d9fa9cc73271566b8f2356f284d4c2cb02057ecb5ba812"
}
]
}


The great advantage of such a file format is that users can transparently exchange machine learning methods and apply them in a reproducible manner to any RT-DC dataset using dclab or Shape-Out.

To save a machine learning model to a .modc file, you can use the dclab.ml.save_modc function:

dclab.ml.save_modc("path/to/file.modc", dc_model)


Conversely, you can load such a model at any time and use it for inference:

dc_model_loaded = dclab.ml.load_modc("path/to/file.modc")