Contents

Machine Learning tools evaluation

Jarosław Kijanowski

29 Sep 2020.9 minutes read

Machine Learning tools evaluation webp image

Photo by @yuyeunglau

This post is rather a journey around the world of tooling, based on Thoughtworks’ Technology Radar and by far is not complete. It’s supposed to give you a good feeling on what is available in terms of operations of ML data, models and results and a few discoveries.

What you’ll miss, like me as well, is for example a deep dive into the sheer amount of new architectures being published in science magazines, the increasing quality of labeling services or a detailed description of areas where ML techniques are currently being adopted.

ML is becoming mainstream. Low-level libraries like Google’s Tensorflow or Facebook’s PyTorch are being hidden behind assimilable higher level APIs like Keras and fast.ai.

Additionally ML is entering the world of devices with limited compute and memory resources. Mobile phones as well as IoT devices no longer have to upload data to be processed by a server, but can do it themselves with TensorFlow Lite. This deep learning framework first converts a regular Tensorflow model into a compressed and serialized file which is then loaded onto a mobile or embedded device and finally optimized.

The converted model supports just a limited subset of Tensorflow operations though. Optimization techniques are weight pruning - trimming less important weights - and quantization - reducing the precision of numbers, like weights, activations, biases and other model parameters. All of this allows to achieve faster download times of the model and reduce the time required for inference at the cost of lower accuracy and more complex training.

Long proven software development practices worm their way into ML

Long gone are days, when engineers exchanged python files via email. Source code control systems like git and notebooks like Jupyter are common and it is no surprise that Continuous Delivery for Machine Learning (CD4ML) is being proposed as the next step in application development. Interestingly this technique is more than just about installing a pipeline delivering the final bits to production.

It’s about experimenting with different data and models, leading to different branches of an application, which are never meant to be merged, but rather compared with each other. CD4ML is also about auto-scaling the underlying environments (CPU, RAM, OS) in an elastic fashion since a ML application requires far more resources and a different kind of them during the training phase than during the regular operation phase.

Another topic popping up especially recently is the need for discovering whether a model is biased. For example excluding people or treating disadvantaged or underrepresented groups not equally can be discovered and consequently addressed with ethical bias testing. Google plans to offer an ethics service spotting racial bias and developing ethical guidelines. Mislabeling chihuahuas as muffins or influencing the credit score based on race show how important this topic can be. The ACM FAccT conference not only addresses this problem but also asks questions whether some kind decisions should be made by ML driven systems:

Algorithmic systems are being adopted in a growing number of contexts, fueled by big data. These systems filter, sort, score, recommend, personalize, and otherwise shape human experience, increasingly making or informing decisions with major impact on access to, e.g., credit, insurance, healthcare, parole, social security, and immigration. Although these systems may bring myriad benefits, they also contain inherent risks, such as codifying and entrenching biases; reducing accountability, and hindering due process; they also increase the information asymmetry between individuals whose data feed into these systems and big players capable of inferring potentially relevant information.

Other than that, the technology Radar explains Semi-supervised learning loops, Transfer learning for NLP (a critical insight saying a pre-trained model can be further trained to solve problems it was not trained for in first place), Data meshes (spreading domain data among different owners instead of storing it in a central place), DeepWalk (helps to extract features from data sets represented as a graph) or Google BigQueryML which is an extension to Google’s BigQuery warehouse and allows to create and execute machine learning models using standard SQL queries.

Data is key to ML and Marquez is a governance tool for collecting, searching, applying metadata and visualizing the lineage for ML models. Model versioning for example lets us monitor and track the quality of data and rollback to a better model as well as prevent training runs with bad data.

Similar to how developers keep code in a version control system, data analysts and engineers have DVC, a repository for machine learning models supporting large files, data sets, models and obviously code. Branches allow us to experiment with different versions of models and data. Pipelines automate the process of publishing our projects to a staging or production environment in an automated fashion.

Next to DVC and Marquez, there is another set of tools especially useful for tracking experiments in machine learning projects. Neptune.ai, Comet and MLflow provide ways to visualize, track, compare and explain experiments to teams. Having all necessary details, you can reproduce every single experiment.

In regular development code that doesn’t work (as expected) is not thrown away immediately. A debugger is a helpful tool to find out why an algorithm doesn’t behave as expected. Similarly in machine learning engineers encounter problems in identifying what went wrong with a model. In such cases, Manifold, a model-agnostic visual debugging tool can help:

Manifold allows ML practitioners to look beyond overall summary metrics to detect which subset of data a model is inaccurately predicting. Manifold also explains the potential cause of poor model performance by surfacing the feature distribution difference between better and worse-performing subsets of data.

The core idea is to move from the model space to the data space. Instead of asking what went wrong with the model, the debugger helps to answer which data (sub)set made the model do a mistake. Similarly we may not figure out why the model behaved badly, but get insight which feature contributed to the mistake. I must admit I felt disappointed when testing the online demo with the built-in sample data. It crashed many times and was not in line with the descriptions in the documentation. An alternative is Google’s What-If Tool. It also highlights the importance of various data features and visualizes the behavior of models.

Getting things done / real

Application development is where machine learning techniques are verified against real world requirements. One way to build ML solutions for mobile (iOS, Android), edge devices and the desktop (C++) as well as for web browsers is to go with MediaPipe.

Input streams of data like loudness of sound in a room, brightness of the room and image frames of the room captured by microphones, cameras and other sensors allow the application to perceive the world around it.

MediaPipe ships with ready-to-use solutions like Face Detection (based on BlazeFace), Face Mesh (a face geometry solution allowing to apply AR effects like adding a mask, sunglasses or bunny ears), Hands (determine the motion of hands in 3D) any many more.

Promising discoveries in NLP

Natural Language Processing is far more about than just this little lonely chatbot on a website somewhere in the corner. Although you’re able to (re-)order a pizza, the ML community took NLP algorithms further and is applying them to analyse

  • customers’ sentiments (how good this movie is , how much I like this new dress, how tasty this new salad is),
  • emails to prevent delivering of spam messages,
  • and especially in healthcare to get the meaning, this is recognizing and predicting diseases, out of health-records based on sole descriptions of symptoms.
    According to this guide to natural language processing currently:

    NLP is battling to detect nuances in language meaning, whether due to lack of context, spelling errors or dialectal differences.

The General Language Understanding Evaluation benchmark (GLUE) is a way to test NLP systems. It comes with a manually-curated evaluation dataset for fine-grained analysis of system performance on a broad range of linguistic phenomena. By looking at the leaderboard you’ll notice several derivatives of BERT, the state of the art model from Google. Keep an eye on ERNIE though. The Enhanced Representation through kNowledge IntEgration implementation from Baidu outperformed BERT on several GLUE tasks and all nine Chinese NLP tasks.

Data augmentation

Real-looking but fake AI generated human faces are not that new anymore. NVIDIA is also busy in this field exploring the possibilities of generative adversarial networks, aka GAN, and additionally to artificial faces you’ll find landscapes as well.

This is the effect of playing around with GauGAN for 5 minutes, a paint-like application.

The idea behind GAN is to create new, artificial data from a dataset consisting of real data. In our case that was especially useful, where we had to categorize diseases of a lemon - 16 types, can you imagine? - but the training set we got was limited to a few thousand photos showing defects of this fruit. We required additional data for the learning phase to get better endresults from the model.

In another project about satellite image processing we not only had to create artificial data samples, but also enhance and even fix images by inpainting cloudy areas. In this case we’ve successfully implemented a conditional flavour of GAN.

NVIDIA open-sourced an official TensforFlow implementation of its style-based GAN architecture on GitHub.

To leave you with an eye-candy, take a look at the Depth-Aware video frame INterpolation (DAIN) model based on a Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement and what it is capable of doing with a stop-motion movie, typically made of 15 up to 30 frames per second.

As you can imagine after reading the case with lemons, one of our specialities in SoftwareMill is machine learning. Here you cazn find out more
about our machine learning services.

Check out all the tools you just read about

Fast.aiKerasTensorFlow Light
MarquezDVCNeptune.ai
CometMLflowMediaPipe

Wrap up of the series

How do we leverage technology trends? Radars, hype cycles, conferences and various blog posts tell us what technologies are top notch, what are already a well-established industry standard or they try to predict that. And indeed, most of the mentioned frameworks, tools and platforms are used by us or are at least are widely known.

It's still worth giving such kind of publications a read once a couple of months. They not only prove us right, but draw our attention to technologies or techniques only just rising.

Most importantly though, radars fine-tune SoftwareMill's strategy and support the decision process on where to place ourselves in terms of technology and what to invest in with regards to supporting personal development. At the end our capability to learn new skills is limited as well as limited is our time.

Blog Comments powered by Disqus.