Machine learning¶

To simplify machine-learning (ML) tasks in the context of RT-DC, dclab offers a few convenience methods. This section describes the recommended way of implementing and distributing ML models based on RT-DC data. Please make sure that you have installed dclab with the ml extra (pip install dclab[ml]).

Using models in dclab¶

For RT-DC analysis, the most common task for ML is to determine the probability for a specific event (e.g. a cell) to belong to a specific class (e.g. red blood cell). Since RT-DC data always has a very specific format, it is worthwile to standardize this regression/classification process.

In dclab, you are not directly using the bare models that you would e.g. get from tensorflow/keras. Instead, models are wrapped via a specific dclab.ml.models.BaseModel class that holds additional information about the features from which and to which a model maps. For instance, a model might have the inputs deform and area_um and make predictions regarding a defined output feature, e.g. ml_score_rbc. Output features for machine learning are always of the form ml_score_xxx where x can be any alphanumeric character (you are free to choose).

import dclab.ml
import tensorflow as tf

# do your magic
bare_model = tf.keras.Sequential(...)
bare_model.compile(...)
bare_model.fit(...)

# create a dclab model
dc_model = dclab.ml.models.TensorflowModel(
    bare_model=bare_model,
    inputs=["deform", "area_um"],
    outputs=["ml_score_rbc"],
    model_name="RBC identification",
    output_labels="red blood cells")

# once you get here, you can use your model directly for inference
ds = dclab.new_dataset("path/to/a/dataset")
# `prediction` is a dictionary with the key "ml_score_rbc" mapping
# to a 1D ndarray of length `len(ds)`, holding the probability data.
prediction = dc_model.predict(ds)["ml_score_rbc"]

For user convenience, a model can also be registered with dclab as an ancillary feature.

dc_model.register()
prediction = ds["ml_score_rbc"]  # same result as above
dc_model.unregister()  # optional

If it is inconvenient for you to call the register() and unregister methods (e.g. when you would like to perform predictions for multiple models), then you can use dc_model in combination with the with statement:

with dc_model:
    prediction = ds["ml_score_rbc"]  # same result as above

The .modc file format¶

The .modc file format is not a reinvention of the wheel. It is merely a wrapper around other ML file formats and describes which input features (e.g. deform, area_um, image, etc.) a machine learning method maps onto which output features (e.g. ml_score_rbc). A .modc file is just a .zip file containing an index.json file that lists all models. A model may be stored in multiple file formats (e.g. as a tensorflow SavedModel and as a Frozen Graph). Alongside the models, the .modc file format also contains human-readable versions of the output features, SHA256 checksums, and the creation date:

example.modc (ZIP file contents)
├── index.json
├── model_0
│         ├── another-format
│         │        └── another_formats_file.suffix
│         └── tensorflow-SavedModel.tf
│             ├── assets
│             ├── saved_model.pb
│             └── variables
│                 ├── variables.data-00000-of-00001
│                 └── variables.index
└── model_1
    └── tensorflow-SavedModel.tf
        ├── assets
        ├── saved_model.pb
        └── variables
            ├── variables.data-00000-of-00001
            └── variables.index

The corresponding index.json file could look like this:

{
  "model count": 2,
  "models": [
    {
      "date": "2020-11-03 17:01",
      "formats": {
        "tensorflow-SavedModel": "tensorflow-SavedModel.tf",
        "library-OtherFormat": "another-format"
      },
      "index": 0,
      "input features": [
        "deform"
      ],
      "name": "Example Model 1",
      "output features": [
        "ml_score_low",
        "ml_score_hig"
      ],
      "output labels": [
        "Low",
        "High"
      ],
      "path": "model_0",
      "sha256": "ec11c73ae870da4551d9fa9cc73271566b8f2356f284d4c2cb02057ecb5bf6ce"
    },
    {
      "date": "2020-11-03 17:02",
      "formats": {
        "tensorflow-SavedModel": "tensorflow-SavedModel.tf"
      },
      "index": 1,
      "input features": [
        "area_um",
        "image"
      ],
      "name": "Example Model 2",
      "output features": [
        "ml_score_rbc",
        "ml_score_sad"
      ],
      "output labels": [
        "red boold cells",
        "sad cells"
      ],
      "path": "model_1",
      "sha256": "ac43c73ae870da4551d9fa9cc73271566b8f2356f284d4c2cb02057ecb5ba812"
    }
  ]
}

The great advantage of such a file format is that users can transparently exchange machine learning methods and apply them in a reproducible manner to any RT-DC dataset using dclab or Shape-Out.

To save a machine learning model to a .modc file, you can use the dclab.ml.save_modc function:

dclab.ml.save_modc("path/to/file.modc", dc_model)

Conversely, you can load such a model at any time and use it for inference:

dc_model_loaded = dclab.ml.load_modc("path/to/file.modc")
with dc_model_loaded:
    prediction = ds["ml_score_rbc"]  # same result as above

The methods for saving and loading .modc files are described in the code reference.

Helper functions¶

If you are working with tensorflow, you might find the functions in the dclab.ml.tf_dataset submodule helpful. Please also have a look at the machine-learning examples.