Blog | Machine Learning

Python Packages for Data Science: Towards AutoML

share on

by Sanjeev Kapoor 11 Aug 2020

Data Science, Machine Learning (ML) and Artificial Intelligence (AI) are currently dominating the IT world. This is largely due to the unprecedented amount of data that are generated every day from state-of-the-art IT systems, including social networking platforms, cyber-physical systems and internet connected devices. Modern organizations are trying to take advantage of these data to increase the automation and accuracy of their processes, as well as to improve the quality and timeliness of their decisions. ML/AI is largely about processing very large datasets in order to derive insights for business process optimization and effective managerial decision making. From a technical perspective, this entails the development of data-intensive software systems that collect, transform, model, mine and visualize large amounts of data. There are many programming platforms that enable these tasks, including R, Java, Julia, Javascript, and Python. Python is a clear winner in this front, as more and more systems are built based on it. It combines simplicity with versatility, which make it appropriate for developing a broad range of data intensive systems.

When it comes to data science, Python comes with a host of libraries that facilitate common data processing tasks, as well as the development of machine learning models. Using Python, developers can train and apply popular ML models (e.g., Decision Trees, Random Forests, Neural Networks) based on a handful of simple commands. Likewise, Python provides libraries that ease complex data transformations, as well as packages that visualize large datasets. Such Python packages have been around for decades and provide the basis for enabling complex data operations. In recent years, a wave of new and novel packages has also emerged, to facilitate the implementation of latest ML programming paradigms such as Automated Machine Learning (AutoML). These new packages leverage Python’s legacy capabilities to open new horizons in developers’ productivity.

Refreshing the Basics

AutoML is about automating ML pipelines, while optimizing them at the same time. Nevertheless, this increased automation does not mean that developers can get completely rid of conventional data operations such as calculation of statistics and data transformations. The latter are very commonly used inside AutoML functions. Hence, traditional packages for data science remain into the foreground.

Machine Learning or something else.

Let's help you with your IT project.

Python Pandas is one of the most prominent examples of such legacy packages. Pandas is a key enabler of most data operations such as cleaning, transforming and analysing large datasets. Pandas very basic operations involve placing the data into a DataFrame i.e. Pandas tabular structure for managing data. Accordingly, the package provides functions for calculating statistics, producing the distribution of certain attributes and for performing common cleaning operations. The latter may for example include removing missing values and filtering rows or columns by a combination of criteria. In conjunction with the Matplotlib library, Pandas provides also the means for visualizing datasets in the form of bar plots, line charts, histograms, bubble charts and pie charts. Most importantly, Pandas eases the task of storing the transformed (e.g., cleaned) data into proper output channels, mediums and formats such as CSV files and databases.

Python data processing is largely about working with multi-dimensional arrays. The Python NumPy package (i.e. “Numeric Python”) empowers data programmers to work with arrays. For instance, it enables them to perform mathematical and logical operations on arrays, including Fourier transforms and shape manipulation. Moreover, NumPy provides the means for performing linear algebra functions, as well as random number generation. When used in conjunction with the SciPy packages (i.e. “Scientific Python”), NumPy enables a rich set of operations similar to those found in popular mathematical packages like MATLAB. Hence, developers are commonly using these Python libraries in their data science programs. In most cases, the latter combine NumPy with Pandas as well.

AutoML in Python

AutoML environments and tools aim at automating and optimizing the data science tasks that comprise an end-to-end Machine Learning pipeline. Hence, they support a wide array of activities ranging from data cleaning to feature engineering and automatic selection of the optimal model for a task at hand. During recent years, several Python packages for AutoML have emerged. Some of the most notable mentions are:

MLBox: MLBox enables fast reading and distributed data pre-processing for ML pipelines. It supports common data science tasks like data cleaning and formatting. Moreover, it supports the development and validation of predictive models for regression and classification, including deep learning models. Furthermore, it provides the means for interpreting some models. A typical usage of MLBox involves deploying it to support end-to-end ML applications from data ingestion to model building and evaluation. Similar to other AutoML packages, MLBox offers intelligence functionalities like the removal of drift i.e. the automatic detection and removal of features that have very different distributions in the train and test datasets. In a typical application development task, the developer would have to manual detect and remove drift. Using MLBox, this task can be automated based on a dedicated library (Class) that is part of the package.
Auto-Sklearn: This package implements AutoML functionalities based on Scikit-learn. The primary use of the library is the development of efficient and optimized machine learning models. In this direction it includes a rich set of tools for statistical modelling, including support for classification, regression, clustering and dimensionality reduction tasks. The library automates and optimizes traditional machine learning tasks i.e. the models supported by Sklearn. To this end, it uses Bayesian optimization for hyperparameter tuning over functionalities of the traditional scikit-learn library. Auto-Sklearn supports XGBoost as well.
Auto-Keras: Auto-Keras complements the functionalities of Auto-Sklearn through providing support for deep learning and deep neural networks. To this end, it supports automatic Neural Architecture Search (NAS) functionalities, which refer to the comparative evaluation of different neural network architectures for a given problem. The library supports intelligent search based on the “network morphism” approach. The essence of this approach lies in keeping network functionality while changing the architecture. It also leverages Bayesian optimization towards identifying most efficient neural network architecture.

Overall, Python provides a rich set of libraries for automating machine learning tasks, including support for traditional machine learning models and for deep learning models. AutoML packages automate application development and bring machine learning closer to users that are not experts in data science. Specifically, it batches together multiple ML tasks that are usually performed manually, while performing intelligence statistical processing functions (e.g., drift removal) that are hardly known and understandable by non-experts. The release and wider use of these packages reinforces Python’s popularity in the data science community. If you are already working with Python, it is certainly worth spending some time learning how to use and fully leverage these AutoML packages.

Python Packages for Data Science: Towards AutoML

Refreshing the Basics

AutoML in Python

Recent Posts