ML Platforms

The problem:

When you start small Machine Learning team with a few projects, your experiment is done via Jupyter Notebook and maybe the notebook is in github. The notebook might contain a method to download data so it can be reproducible but it is getting harder and harder to track various experiments.

We also need to make sure models does not train on corrupted / skewed data and only high quality model are pushed to production. Currently these processes are manual and not centralized nor has unified common tool.

How to solve it

Typical ML workflow look like this:

Facebook built FBLearner Flow but it is internal software toolset and not available for public to use. The platform manages:

Manage data
Train models
Evaluate models
Deploy models
Make predictions
Monitor predictions

Google’s TFX is available as 3 components: TensorFlow Transform (data transformation), TensorFlow Model Analysis and TensorFlow Serving. As name indicates, the toolset is narrowed down to TensorFlow. The platform manages:

Data ingestion
Data Analysis
Data Transformation
Data validation
Trainer
Model Evaluation and validation
Serving
Logging –> Data ingestion

Uber built michelangelo but just like FBLearner this is is an internal ML-as-a-service platform and not available for public.

At Databricks, a creator of Spark, announced MLflow: an open source machine learning platform!

Documentation is located here.

On 06/05/2018 they announced this on their blog and 3 days later today, there are already 1,596 stars on github

Quick starts

# ensure I have pipenv to create virtualenv
$ pip3 install pipenv --user
# install mlflow
$ pipenv install mlflow
# activate mlflow environment
$ pipenv shell

# run test experiment
$ git clone https://github.com/databricks/mlflow.git
$ python mlflow/example/quickstart/test.py

# start UI
$ mlflow ui -h 0.0.0.0

# then you should see the test experiment 

It consists of 3 components: ml_flow_components

Tracking: For querying and recording data on experiments. Using the web UI, you can view and compare the output of multiple runs.
Projects: Provides a simple format for reproducing code
Models: For managing and deploying models into production

Would be nice to have data components to do (a) Analysis (b) Transformation (c) Validation but I want to watch / evaluate and see how this grows!

Also I should compare other available tools such as Amazon SageMaker and IDSIA sacred

More reporting to come.

Cheers!

ML Platforms

July 08, 2018

The problem:

How to solve it

Imogene Run

Running Post Hysterectomy

Refinance