Contents

Machine Learning libraries for any project

Machine Learning libraries for any project webp image

There are many libraries out there that can be used in Machine Learning projects. Of course, some of them gained considerable reputation through the years. Such libraries are the straight away picks for anyone starting a new project which utilizes the machine learning algorithms. However choosing the correct set (or stack) may be quite challenging.

The Why

In this post, I would like to give you a general overview of the machine learning libraries landscape and share some of my thoughts about working with them. If you are starting your journey with a machine learning library, my text can give you some general knowledge of the machine learning libraries and provide a better starting point for learning more.

The libraries described here will be divided by the role they can play in your project.

The categories are as follows:

  1. Model Creation - libraries that can be used to create machine learning models
  2. Working With Data - libraries that can be used both for feature engineering, future extraction, and all other operations that involve working with features
  3. Hyperparameters Optimization - libraries and tools that can be used for optimizing model hyperparameters
  4. Experiment Tracking - libraries and tools used for experiment tracking
  5. Problem Specific Libraries - libraries that can be used for tasks like time series forecasting, computer vision, and working with spatial data
  6. Utils - non-strictly machine learning libraries but nevertheless the ones I found useful in my projects

Model Creation

PyTorch

Developed by people from Facebook and open-sourced in 2017. It is one of the market's most famous machine learning libraries - based on the open-source Torch package. The PyTorch ecosystem can be used for all types of machine learning problems and has a great variety of purpose-built libraries like torchvision or torchaudio.

The basic data structure of PyTorch is the Tensor object, which is used to hold multidimensional data utilized by our model. It is similar in its conception to NumPy ndarray. PyTorch can also use computation accelerators, it supports CUDA capable NVIDIA GPUs, ROCm, Metal API, and TPU.

The most important part of the core PyTorch library is nn modules which contain layers and tools to build complex models layer by layer easily.

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

An example of a simple neural network with 3 linear layers in PyTorch.

Additionally, PyTorch 2.0 is already released, and it makes PyTorch even better. Moreover PyTorch is used by a variety of companies like Uber, Tesla, or Facebook, just to name a few.

PyTorch Lighting

It is a sort of “extension” for PyTorch, which aims to greatly reduce the amount of boilerplate code needed to utilize our models.

Lightning is based on the concept of hooks - functions called on specific phases of the model train/eval loop. Such an approach allows us to pass callback functions executed at a specific time, like the end of the training step.

Trainers Lighting automates many functions that one has to take care of in PyTorch, for example, loop, hardware call, or zero grads.

Below are roughly equivalent code fragments of PyTorch (Left) and PyTorch Lightning (Right).

table

from: https://lightning.ai/docs/pytorch/latest/starter/introduction.html

TensorFlow

A library developed by a team from Google Brain was originally released in 2015 under Apache 2.0 License, version 2.0 was released in 2019. It provides clients in Java, C++, Python, and even JavaScript.

Similar to PyTorch, it is widely adopted throughout the market and used by companies like Google (surprise), AirBnb, and Intel. TensorFlow also has quite an extensive ecosystem built around it by Google. It contains tools and libraries such as optimization toolkit, TensorBoard (more about it below), or recommenders. TensorFlow ecosystem also includes a web based sandbox to play around with your model’s visualization.

Again the tf.nn module plays the most vital part provided all the building blocks required to build machine learning models. Tensorflow uses its own Tensor (flow ;p) object for holding data utilized by deep learning models. It also supports all the common computation accelerators like CUDA or RoCm (community), Metal API, and TPU.

class NeuralNetwork(models.Model):
    def __init__(self):
        super().__init__()
        self.flatten = layers.Flatten()
        self.linear_relu_stack = models.Sequential([
            layers.Dense(512, activation='relu'),
            layers.Dense(512, activation='relu'),
            layers.Dense(10)
        ])
    def call(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

Note that in TensorFlow and Keras, we use the Dense layer instead of the Linear layer used in PyTorch. We also use the call method instead of the forward method to define the forward pass of the model.

Keras

It is a library similar in meaning and inception to PyTorch Lightning but for TensorFlow. It offers a more high-level interface over TensorFlow. Developed by François Chollet and released in 2015, it provides only Python clients. Keras also has its own set of python libraries and of problem-specific libraries like KerasCV or KerasNLP for more specialized use cases.

Before version 2.4 Keras supported more backends than just TensorFlow, but after the release, TensorFlow became the only supported backend. As Keras is just an interface for TensorFlow it shares similar base concepts as its underlying backend. Same holds true for supported computation accelerators. Keras is used by companies like IBM, PayPal and Netflix.

class NeuralNetwork(models.Model):
    def __init__(self):
        super().__init__()
        self.flatten = layers.Flatten()
        self.linear_relu_stack = models.Sequential([
            layers.Dense(512, activation='relu'),
            layers.Dense(512, activation='relu'),
            layers.Dense(10)
        ])
    def call(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

Note that in TensorFlow and Keras, we use the Dense layer instead of the Linear layer used in PyTorch. We also use the call method instead of the forward method to define the forward pass of the model.

PyTorch vs. TensorFlow

I wouldn’t be fully honest If I would not introduce some comparison between these two.

As you could read, a moment before, both of them are quite similar in offered features and the ecosystem around them. Of course, there are some minor differences and quirks in how both work or the features they provide. In my opinion, these are more or less insignificant.

The real difference between them comes from their approach defining and executing computational graphs of machine and deep learning and models.

  • PyTorch uses dynamic computational graphs, which means that the graph is defined on-the-fly during execution. This allows for more flexibility and intuitive debugging, as developers can modify the graph at runtime and easily inspect intermediate outputs. On the other hand, this approach may be less efficient than static graphs, particularly for complex models. However PyTorch 2.0 attempts to address these issues via torch.compile and FX graphs.
  • TensorFlow uses static computational graphs, which are compiled before execution. This allows for more efficient execution, as the graph can be optimized and parallelized for the target hardware. However, it can also make debugging more difficult, as intermediate outputs are not readily accessible.

Another noticeable difference is that PyTorch seems to be more low level than Keras while being more high level than pure TensorFlow. Such a setting makes PyTorch more elastic and easier to use for making tailored models with many customizations.

As a side note, I would like to add that both libraries are on equal terms in the case of market share. Additionally, despite the fact that TensorFlow uses the call method and PyTorch uses the forward method both libraries support call semantics as a shorthand for calling model(x).

Working with Data

pandas

A library that you must have heard of if you are using python. It is probably the most famous python library for working with data of any type. Originally released in 2008, version 1.0 in 2012. It provides functions for filtering, aggregating, and transforming data, as well as merging multiple datasets.

The cornerstone of this library is a DataFrame object which represents multidimensional table any type data. Library heavily focuses on performance with some parts written in pure C to boost performance.

Besides being performance-focused, pandas provide a lot of features related to:

  • data cleaning and preprocessing:
    • removing duplicates,
    • filling null values or nan values.
  • time series analysis:
    • resampling,
    • windowing,
    • time shift.

Additionally, it can perform a variety of input/output operations:

  • reading from/to .csv or .xlsx files
  • performing database queries
  • load data from GCP BigQuery (with the help of pandas-gbq)

numpy

It is yet another famous library for working with data - mostly numeric data science in kind. The most famous part of numpy is an ndarray - a structure representing a multidimensional array of numbers.

Besides, ndarray numpy provides a lot of high level mathematical functions and mathematical operations used to work with this data. It is also probably the oldest library in this set as the first version was released in 2005. It was implemented by Travis Oliphant, based on an even older library called numeric (released in 1996).

Numpy is extremely focused on performance with contributors trying to get more and more of the currently implemented algorithm to reduce the execution time of numpy functions even more.

Of course, as with all libraries described here, numpy is also open source and uses BSD license.

scipy

It is a library focused on supporting scientific computations. It is even older than numpy (2005), released in 2001. It is built atop numpy, with ndarray being the basic data structure used throughout scipy. Among other things, the library adds functions for optimization, linear algebra, signal processing, interpolation, and spares matrix support. In general, it is more high-level than numpy and thus can provide more complex functions.

Hyperparameter optimizations

Ray-tune

It is part of the ray toolset, a bundle of related libraries for building distributed applications focusing on machine learning and Python. Tune part of ml library is focused on providing hyperpriming optimization features by providing a variety of search algorithms, for example grid search, hyperband, or bayesian optimization.

Ray-tune can work with models created in most of programming languages and the libraries available on the market. All the libraries described in the paragraph about model creation are supported by ray-tune.

The key concepts of ray-tune are:

  • Trainables - objects pass to tune runs they are our model we want to optimize parameters for
  • Search space - contains all the values of hyperparameters we want to check in the current trial
  • Tuner - an object responsible for executing runs, calling tuner.fit() starts the process of searching for optimal hyperparameters set. It requires passing at least a trainable object and search space
  • Trial - each trail represents a particular run of a trainable object with an exact set of parameters from a search space. Trials are generated by Ray Tune Tuner. As it represents the output of running the tuner, Trial contains a ton of information:
    • config used for a particular trial,
    • trial ID,
    • many others.
  • Search algorithms - the algorithm used for a particular execution of Tuner.fit. If not provided Ray Tune will use RadomSearch as default
  • Schedulers - objects that are in charge of managing runs. They can pause, stop and run trials within the run. This can result in increased efficiency and reduced time of the run. If none is select the Tune will pick FIFO as default - run will be executed one by one as in a classic queue
  • Run Analyses - the object conniving the results of Tuner.fit execution in the form of ResultGrid object. It contains all the data related to the run like the best result among all trials or data from all trials

BoTorch

A library built atop PyTorch andis part of the PyTorch ecosystem. It focuses solely on providing hyperparameter optimization with the use of Bayesian Optimization.

As the only library in this part that is designed to work with a specific model library, it may make it problematic for BoTorch to be used with libraries other than PyTorch. Also as the only library it is currently in version beta and under intensive development, so some unexpected problems may occur.

The key feature of BoTorch is its integration with PyTorch, which greatly impacts the ease of interaction between the two.

Experiment Tracking

Neptune.ai

It is a web-based tool that serves both as an experiment tracking and model registry. The tool is cloud-based in the classic SASS model, but if you are determined enough, there is the possibility of using a self-hosted variant.

It provides the dashboard where you can view and compare the results of training your model. It can also be used to store the parameters used for particular runs. Additionally you can easily version the dataset used for particular runs and all the metadata you think may come in handy.

Moreover, it enables easy version control for your models. Obviously the tool is library agnostic and can host models created using any library. To make the integration possible, neptune exposes REST style API with its own client. The client can be downloaded and installed via pip or any other python dependency tool. API is decently documented and quite easy to grasp.

The tool is paid and has a simple pricing plan divided into three categories. Yet if you need it just for a personal project or you are in the research or academic unit, then you can apply to use the tool for free.

Neptune.ai is a new tool, so some features known from other experiment tracing tools may not be present. Despite that fact, neptune.ai support is keen to react to their user feedback and implement missing functionalities – at least, it was the thing in our case as we use Neptune.ai extensively in The Codos Project.

Weights&Biases

Also known as WandB or W&B is a web-based tool that exposes all the needed functionalities to be used as an experiment tracing tool and model registry. It exposes a more or less similar set of functionalities as neptune.ai.

However, weight and biases seem to have better visualization and, in general, is a more mature tool than Neptune. Additionally WandB seems to be more focused on individual projects and researchers with less emphasis on collaboration.

It also has a simple pricing plan divided into 3 categories with a free tier for private use. Yet W&B has the same approach as Neptune to researchers and academic units - they can always use Weights and Biases for free.

Weight and Biases also expose REST like API to ease the integration process. It seems to be a better document and offers more feathers than the one exposed by Neptune.ai. What is curious is that they expose the client library written in Java. If, for some reason, you wrote a machine learning model in Java instead of Python.

TensorBoard

It is a dedicated visualization toolkit for the TensorFlow ecosystem. It is designed mostly to work as an experiment tracking tool with a focus on metrics visualizations. Despite being a dedicated TensorFlow tool, it can also be used with Keras (not surprisingly) and PyTorch.

Additionally, it is the only free tool from all three described in this section. Here you can host and track your experiments. However Tensorbord is missing the functionalities responsible for model registry which can be quite problematic and force you to use some 3rd party tool to cover this missing feature. Anyway there is for sure a tool for this in the TensorFlow ecosystem.

As it is a directory part of the TensorFlow ecosystem, its integration with Keras or Tensorflow is much smoother than any of the previous two tools.

Problem specific libraries

tsaug

One of a few libraries for augmentation of time series. It is an open source library created and maintained by a single person under the GitHub nick tailaiw. Released in 2019 currently in version 0.2.1. It provides a set of 9 augmentations, Crop, Add noise, or TimeWrap among them. The library is reasonably well documented for such a project and easy to use from a user perspective.

Unfortunately, for unknown reasons (at least to me), the library seems dead and has not been updated for 3 years. There are many open issues but they are not getting any attention. Such a situation is quite sad in my opinion as there are not so many other libraries that provide augmentation for time series data.

However, if you are looking for a time series for data mining or augmentation and want to use a more up-to-date library, Tsfresh may be a good choice.

Opencv

It is a library focused on providing functions for working with image processing and computer vision. Developed by Intel is now open source based on the Apache 2 license.

OpenCv provides a set of functions related to image and video processing, image classification data analysis, and tracking alongside ready-made machine learning models for working with images and video.

If you want to read more about OpenCV my coworker, Kamil Rzechowski, wrote an article that quite extensively describes the topic.

Geopandas

It is a library built atop pandas that aim to provide functions for working with spatial and data structures. It allows easy reading and writing data in geojson, shapefile formats, or reading data from postgis systems. Besides pandas it has a lot of other dependencies on spatial data libraries like pygeos, geopy or shapely.

Library basic structure are:

  • GeoSeries - a column of geospatial data, such as a series of points, lines, or polygons
  • GeoDataFrame - tabular structure holding a set of GeoSeries.

Utils

matplotlib

As the name suggests, it is a library for creating various types of plots ;p. Besides basic plots like lines or histograms, matplotlib allows us to create more complex plots - 3d shapes or polar plots. Of course, it also allows us to customize things like color plots or labels.

Despite being somewhat old (released in 2003, so 20 years old at the time of writing), it is actively maintained and developed. With around 17k starts on GitHub it has quite a community around it and is probably the second pick for anyone needing a data visualization tool.

It is well-documented and reasonably easy to grasp for a newcomer. It is also worth noting that matplotlib is used as a base for more high-level visualization libraries.

seaborn

Speaking ml libraries all of which, we have seaborn as an example of such a library. Thus the set of functionalities provided by seaborn is similar to one provided by matplotlib.

However, the API is more high-level and requires less boilerplate code to achieve similar results. As for other minor differences, the color palette provided by seaborn is softer, and the design of plots is more modern and nice looking. Additionally seaborn is easiest to integrate with pandas, which may be a significant advantage.

Below you can find code used to create a heatmap in matplotlib and seaborn, alongside their output plots. Imports are common.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

data = np.random.rand(5, 5)

fig, ax = plt.subplots()
heatmap = ax.pcolor(data, cmap=plt.cm.Blues)

ax.set_xticks(np.arange(data.shape[0])+0.5, minor=False)
ax.set_yticks(np.arange(data.shape[1])+0.5, minor=False)

ax.set_xticklabels(np.arange(1, data.shape[0]+1), minor=False)
ax.set_yticklabels(np.arange(1, data.shape[1]+1), minor=False)

plt.title("Heatmap")
plt.xlabel("X axis")
plt.ylabel("Y axis")

cbar = plt.colorbar(heatmap)

plt.show()

heatmap 1

sns.heatmap(data, cmap="Blues", annot=True)
# Set plot title and axis labels
plt.title("Heatmap")
plt.xlabel("X axis")
plt.ylabel("Y axis")
# Show plot
plt.show()

heatmap 2

hydra

In every project, sooner or later, there is a need to make something a configurable value. Of course, if you are using a tool like jupyter then the matter is pretty straightforward. You can just move a desired value to the .env file - et voila it is configurable.

However, if you are building a more standard application, things are not that simple. Here Hydra shows its ugly (but quite useful) head. It is an open source tool for managing and running configurations of Python-based applications.

It is based on Omega-Conf library and quoting from their main page: “The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line.”

What proved quite useful for me is the ability described in the quote above – hierarchical configuration. In my case, it worked pretty well and allowed clearer separation of config files.

coolname

Having a unique identifier for your taint runs is always a good idea. If, from various reasons, you do not like UUID or just want your ids to be humanly understandable, coolname is the answer. It generates unique alphabetical word based identifiers of various lengths from 2 to 4 words.

As for the number of combinations, it looks more or less like this.

  • 4 words length identifier has 1010 combinations
  • 3 words length identifier has 108 combinations
  • 2 words length identifier has 105 combinations

The number is significantly lower than in the case of UUID, so the probability of collision is also higher. However comparing the two is not the point of this text.

The vocabulary is hand-picked by the creators. Yet they described it as positive and neutral (more about this here), so you will not see an identifier like you-ugly-unwise-human-being. Of course, the library is fully open source.

tqdm

This library provides a progress bar functionally for your application. Despite the fact that having such information displayed is maybe not the most important thing you need. It is still nice to look at and check the progress made by your application during the execution of an important task.

Tqdm also uses complex algorithms to estimate the remaining time of a particular task which may be a game changer and help you organize your time around. Additionally, tqdm states that it has barely noticeable performance overhead - in nanoseconds.

What is more, it is totally standalone and needs only python to run. Thus it will not download half of the internet to your disk.

jupyter-notebook (+JupyterLab)

Notebooks are a great way to share results and work on the project. Through the concept of cells, it is easy to separate different fragments and responsibilities of your code.

Additionally, the fact that the single notebook file can contain code, images, and complex text outputs (tables) together only adds to its existing advantages.

Moreover, notebooks allow run pip install inside the cells and use .env files for configuration. Such an approach moves a lot of software engineering complexity out of the way.

Summary

These are all the various libraries for machine learning that I wanted to describe for you. I aimed to provide a general overview of all the libraries alongside their possible use cases enriched by a quick note of my own experience with using them. I hope that my goal was achieved and this article will deepen your knowledge of the machine learning libraries landscape.

Nevertheless, if you need machine learning consultancy, our experts from ReasonField Lab, a part of the SoftwareMill Company group, are keen to help you. Just let us know, and we will contact you.
Thank you for your time.

Blog Comments powered by Disqus.