Machine learning

To simplify machine-learning (ML) tasks in the context of RT-DC, dclab offers a few convenience methods. This section describes the recommended way of implementing and distributing ML models based on RT-DC data. Please make sure that you have installed dclab with the ml extra (pip install dclab[ml]).

Using models in dclab

For RT-DC analysis, the most common task for ML is to determine the probability for a specific event (e.g. a cell) to belong to a specific class (e.g. red blood cell). Since RT-DC data always has a very specific format, it is worthwile to standardize this regression/classification process.

In dclab, you are not directly using the bare models that you would e.g. get from tensorflow/keras. Instead, models are wrapped via a specific dclab.rtdc_dataset.feat_anc_ml.ml_model.BaseModel class that holds additional information about the features from which and to which a model maps. For instance, a model might have the inputs deform and area_um and make predictions regarding a defined output feature, e.g. ml_score_rbc. Output features for machine learning are always of the form ml_score_xxx where x can be any alphanumeric character (you are free to choose).

from dclab.rtdc_dataset.feat_anc_ml import hook_tensorflow
import tensorflow as tf

# do your magic
bare_model = tf.keras.Sequential(...)
bare_model.compile(...)
bare_model.fit(...)

# create a dclab model
dc_model = hook_tensorflow.TensorflowModel(
    bare_model=bare_model,
    inputs=["deform", "area_um"],
    outputs=["ml_score_rbc"],
    info={
        "description": "RBC identification",
        "output labels": ["Red Blood Cells"]
        }
    )

# once you get here, you can use your model directly for inference
ds = dclab.new_dataset("path/to/a/dataset")
# `prediction` is a dictionary with the key "ml_score_rbc" mapping
# to a 1D ndarray of length `len(ds)`, holding the probability data.
prediction = dc_model.predict(ds)["ml_score_rbc"]

For user convenience, a model can also be registered with dclab as an ancillary feature.

dclab.MachineLearningFeature(feature_name="ml_score_rbc",
                             dc_model=dc_model)
prediction = ds["ml_score_rbc"]  # same result as above

Please have a look at this example to see dclab models in action.

The .modc file format

The .modc file format is not a reinvention of the wheel. It is merely a wrapper around other ML file formats and describes which input features (e.g. deform, area_um, image, etc.) a machine learning method maps onto which output features (e.g. ml_score_rbc). A .modc file is just a .zip file containing an index.json file that lists all models. A model may be stored in multiple file formats (e.g. as a tensorflow SavedModel and as a Frozen Graph). Alongside the models, the .modc file format also contains human-readable versions of the output features, SHA256 checksums, and the creation date:

example.modc (ZIP file contents)
├── index.json
├── model_0
│         ├── another-format
│         │        └── another_formats_file.suffix
│         └── tensorflow-SavedModel.tf
│             ├── assets
│             ├── saved_model.pb
│             └── variables
│                 ├── variables.data-00000-of-00001
│                 └── variables.index
└── model_1
    └── tensorflow-SavedModel.tf
        ├── assets
        ├── saved_model.pb
        └── variables
            ├── variables.data-00000-of-00001
            └── variables.index

The corresponding index.json file could look like this:

{
  "model count": 2,
  "models": [
    {
      "date": "2020-11-03 17:01",
      "description": "Determine sensitivity X",
      "formats": {
        "tensorflow-SavedModel": "tensorflow-SavedModel.tf",
        "library-OtherFormat": "another-format"
      },
      "index": 0,
      "input features": [
        "deform"
      ],
      "output features": [
        "ml_score_low",
        "ml_score_hig"
      ],
      "output labels": [
        "Low",
        "High"
      ],
      "path": "model_0",
      "sha256": "ec11c73ae870da4551d9fa9cc73271566b8f2356f284d4c2cb02057ecb5bf6ce"
    },
    {
      "date": "2020-11-03 17:02",
      "description": "Find RBCs and sad cells",
      "formats": {
        "tensorflow-SavedModel": "tensorflow-SavedModel.tf"
      },
      "index": 1,
      "input features": [
        "area_um",
        "image"
      ],
      "output features": [
        "ml_score_rbc",
        "ml_score_sad"
      ],
      "output labels": [
        "red blood cells",
        "sad cells"
      ],
      "path": "model_1",
      "sha256": "ac43c73ae870da4551d9fa9cc73271566b8f2356f284d4c2cb02057ecb5ba812"
    }
  ]
}

The great advantage of such a file format is that users can transparently exchange machine learning methods and apply them in a reproducible manner to any RT-DC dataset using dclab or Shape-Out.

To save a machine learning model to a .modc file, you can use the dclab.save_modc function:

dclab.save_modc("path/to/file.modc", dc_model)

Conversely, you can load such a model at any time and use it for inference using the dclab.load_modc. To directly load the model as an ancillary feature, use dclab.load_ml_feature:

dclab.load_ml_feature("path/to/file.modc")
prediction = ds["ml_score_rbc"]  # same result as above

The methods for saving and loading .modc files are described in the code reference.

Helper functions

If you are working with tensorflow, you might find the functions in the dclab.rtdc_dataset.feat_anc_ml.hook_tensorflow submodule helpful. Please also have a look at the machine-learning examples.