AutoX

This is the documentation of AutoX.

AutoX is a python package. It’s an open source AutoML solution for tabular data.

You can jump right into the package by looking into our Quick Start.

Installation

$ git clone https://github.com/4paradigm/autox.git
$ pip install ./autox

Contents

The following chapters will explain the AutoX package in detail:

Introduction

What is AutoX?

AutoX is an efficient automl tool, mainly aimed at data mining competitions with tabular data. Its features include:

  • SOTA: AutoX outperforms other solutions in many datasets(see performance improvement under different tasks).

  • Easy to use: The design of interfaces is similar to sklearn.

  • Generic & Universal: Supporting tabular data, including binary classification, multi-class classification and regression problems.

  • Auto: Fully automated pipeline without human-intervention.

  • Out of the box: Providing flexible modules which can be used alone.

  • Summary of magics: Organize and publish magics of competitions.

What’s Included in AutoX?

AutoML Competition

autox_competition is AutoML Solutions for Competition.

Demo

data

description

link

Elo

Help understand customer loyalty

autox_elo

Rossmann

Forecast sales using store, promotion, and competitor data

autox_Rossmann

Allstate

How severe is an insurance claim?

autox_Allstate

House Prices

Predict house sales prices

autox_house_price

IEEE

Can you detect fraud from customer transactions?

autox_ieee

springleaf

Forecast sales using store, promotion, and competitor data

autox_springleaf

stumbleupon

Forecast sales using store, promotion, and competitor data

autox_stumbleupon

ventilator

Forecast sales using store, promotion, and competitor data

autox_ventilator

walmart

Use historical markdown data to predict store sales

autox_walmart

Pipeline
The rolling mechanism
API
feature engineer
count features
class autox.autox_competition.feature_engineer.fe_count.FeatureCount[source]

Convert categorical features into the number of occurrences.

fit(df, degree=1, target=None, df_feature_type=None, silence_cols=[], select_all=True, max_num=None)[source]
Parameters:
  • df – dataframe, train_test.

  • degree – int, degree equal to 1 or 2.

  • target – str, target column.

  • df_feature_type – dict, {col: type of col}.

  • silence_cols

  • select_all

  • max_num

transform(df)[source]
Parameters:

df – dataframe, train_test.

Returns:

dataframe, count features.

cross features
class autox.autox_competition.feature_engineer.fe_cross.FeatureCross(importance_type='split')[source]

synthetic feature formed by multiplying (crossing) two features.

fit(X, y, objective, category_cols, top_k=10, used_cols=[])[source]
Parameters:
  • X – {array-like, sparse matrix} of shape (n_samples, n_features). Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y – array-like of shape (n_samples,). Target vector relative to X.

  • objective – str, objective equal to ‘binary’ or ‘regression’.

  • category_cols – list, column names of categorical features.

  • top_k – int, keep the top_k importance cross features, default top_k = 10.

  • used_cols – list, columns will be used for training model, default top_k = 10.

transform(X)[source]
Parameters:

X – {array-like, sparse matrix} of shape (n_samples, n_features). Training vector, where n_samples is the number of samples and n_features is the number of features.

Returns:

dataframe, cross features.

cumsum features
class autox.autox_competition.feature_engineer.fe_cumsum.FeatureCumsum[source]

cumsum特征描述

denoising autoencoder features
class autox.autox_competition.feature_engineer.fe_denoising_autoencoder.FeatureDenoisingAutoencoder[source]

DenoisingAutoencoder特征描述

diff features
class autox.autox_competition.feature_engineer.fe_diff.FeatureDiff[source]

diff特征描述

dimension reduction features
exp weighted mean features
gbdt features
image to vector features
nlp features
features from other table (one to many relationship)
rank features
rolling statistics features (for time-series data)
shift features
shift features (for time-series data)
statistics features
target encoding features
time features
feature selection
adversarial validation
class autox.autox_competition.feature_selection.adversarial_validation.AdversarialValidation[source]

Bases: object

Remove features with inconsistent distribution between train and test.

Example::

elo_AdversarialValidation_AutoX

fit(train, test, id_, target, categorical_features=[], p=0.6)[source]
Parameters:
  • train – dataframe, the training input samples.

  • test – dataframe, the testing input samples.

  • id – list, columns as id.

  • target – str, target column.

  • categorical_features – list, columns with categorical type.

  • p – float, threshold. If the auc is greater than this threshold, the algorithm will continuously remove the most important feature.

transform(df)[source]
Parameters:

df – dataframe, dataframe needs to be transformed.

Returns:

dataframe, transformed dataframe.

GRN feature selection

Each feature weight is output according to the feature column definition.

Example::

GRN_FeatureSelection_AutoX

Metrics
param y_true:

array-like of shape (n_samples,) or (n_samples, n_outputs). Ground truth (correct) target values.

param y_pred:

array-like of shape (n_samples,) or (n_samples, n_outputs). Estimated target values.

param metric:

str, one of [‘mae’, ‘mape’, ‘mse’, ‘rmse’, ‘msle’, ‘rmsle’, ‘smape’], default = ‘mape’.

return:

metric.

Overview of feature engineer API

operation

description

count

count the occurrences of some categorical features within dataset.

cumsum

the calculation of the cumulative sum.

denoising autoencoder

train a denoising autoencoder neural network for feature extraction. reference

dimension reduction

use dimension reduction technology for feature extraction, such as Principal Component Analysis (PCA).

gbdt

Generating Features with Gradient Boosted Decision Trees. reference

rank

Compute numerical data ranks.

rolling

statistics calculation within rolling windows.

shift

lag feature.

diff

first Difference.

statistics

statistics calculation.

time parse feature

parse time feature for time column, such as year, month, day, hour, dayofweek, and so on.

cross feature

synthetic feature formed by multiplying (crossing) two or more features. reference

Overview of feature selection API

operation

description

Adversarial Validation

a feature selection solution for battling overfitting. reference

GRN

a feature selection using Gated Residual Networks (GRN) and Variable Selection Networks (VSN). reference

Overview of Metrics API

operation

description

MAE

mean absolute error

MAPE

mean absolute percentage error

MSE

mean squared error

MSLE

mean squared logarithmic error

RMSLE

root mean squared logarithmic error

AutoML Server

autox_server is AutoML Solutions for Development.

Demo
CASE 1: Customer Loan Risk Prediction
  • description: Given the user’s basic information, consumption behavior, repayment situation, etc., an accurate overdue prediction model is established to predict whether the user will overdue repayment.

  • data download link: google cloud

  • data details: link

  • autox_server training: bank_train.ipynb

  • autox_server inference: bank_test.ipynb

Pipeline
The rolling mechanism
AutoML Interpreter

autox_interpreter is AutoML Solutions for Machine Learning interpretation.

AutoX covers following interpretable machine learning methods:

Model-based interpretation
Golbel interpretation
Local interpretation
Influential interpretation
Prototypes and Criticisms

Quick Start

Install AutoX

As the compiled autox package is hosted on the Python Package Index (PyPI) you can easily install it with pip

pip install automl-x -i https://www.pypi.org/simple/

or

$ git clone https://github.com/4paradigm/autox.git
$ pip install ./autox

Before boring yourself by reading the docs in detail, you can dive right into AutoX with the following examples:

Binary classification example(Transaction Prediction)

We are provided with an anonymized dataset containing numeric feature variables, the binary target column, and a string ID_code column. The task is to predict the value of target column in the test set.

print(train.head())
The rolling mechanism
print(test.head())
The rolling mechanism
print(train.shape(), test.shape())
The rolling mechanism

We build the automl pipeline with AutoX as following:

from autox import AutoX
path = f'../input/santander-customer-transaction-prediction'
autox = AutoX(target = 'target', train_name = 'train.csv', test_name = 'test.csv', id = ['ID_code'], path = path)
sub = autox.get_submit()
sub.to_csv("./autox_Santander.csv", index = False)

We get a pandas.DataFrame sub which has the same number of rows as test.

print(sub.shape(), test.shape())
The rolling mechanism
print(sub.head())
The rolling mechanism

You can execute this example with this link: santander-autox.

Regression example(House Prices)

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, we need predict the SalePrice of each home.

print(train.head())
The rolling mechanism
print(test.head())
The rolling mechanism
print(train.shape(), test.shape())
The rolling mechanism

We build the automl pipeline with AutoX as following:

from autox import AutoX
path = '../input/house-prices-advanced-regression-techniques'
autox = AutoX(target = 'SalePrice', train_name = 'train.csv', test_name = 'test.csv', id = ['Id'], path = path)
sub = autox.get_submit()
sub.to_csv("submission.csv", index = False)

We get a pandas.DataFrame sub which has the same number of rows as test.

print(sub.shape(), test.shape())
The rolling mechanism
print(sub.head())
The rolling mechanism

You can execute this example with this link: house_price-autox.

Community

Welcome to join our community.

The rolling mechanism

Achievement

performance improvement under different tasks

data_type

compare with AutoGluon

compare with H2O

binary classification

+20.44%

+2.98%

regression

+37.54%

+39.66%

time-series

+28.40%

+32.46%

results comparison

data_type

single-or-multi

data_name

metric

AutoX

AutoGluon

H2o

binary classification

single-table

Springleaf

auc

0.78865

0.61141

0.78186

binary classification

single-table

stumbleupon

auc

0.87177

0.81025

0.79039

binary classification

single-table

santander

auc

0.89196

0.64643

0.88775

binary classification

multi-table

IEEE

accuracy

0.920809

0.724925

0.907818

regression

single-table

ventilator

mae

0.755

8.434

4.221

regression

single-table

Allstate Claims Severity

mae

1137.07885

1173.35917

1163.12014

regression

single-table

zhidemai

mse

1.0034

1.9466

1.1927

regression

single-table

Tabular Playground Series - Aug 2021

rmse

7.87731

10.3944

7.8895

regression

single-table

House Prices

rmse

0.13043

0.13104

0.13161

regression

single-table

Restaurant Revenue

rmse

2133204.32146

31913829.59876

28958013.69639

regression

multi-table

Elo Merchant Category Recommendation

rmse

3.72228

3.80801

22.88899

regression-ts

single-table

Demand Forecasting

smape

13.79241

25.39182

18.89678

regression-ts

multi-table

Walmart Recruiting

wmae

4660.99174

5024.16179

5128.31622

regression-ts

multi-table

Rossmann Store Sales

RMSPE

0.13850

0.20453

0.35757

competition

  1. First place solution for Alibaba Cloud Infrastructure Supply Chain Competition

Enterprise support

  1. 值得买

  2. 慕尚

How to contribute

We want AutoX to become a leading international AutoML solution. To achieve this goal, we need your help!

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.

If you want to help, just create a pull request on our github page. To the new user, working with Git can sometimes be confusing and frustrating. If you are not familiar with Git you can also contact us by email.

We are looking forward to hear from you! =)

FAQ

  1. How can I use autox with windows?

    We recommend to use Anaconda. After installing, open the Anaconda Prompt, create an environment and set up AutoX (Please be aware that we’re using multiprocessing, which can be problematic.):

    conda create -n ENV_NAME python=VERSION
    activate ENV_NAME
    pip install autox
    

Authors

Core Development Team

Contributions

  • Hengxing Cai

  • Hengwei Dai

  • Kele Xu

Indices and tables

Acknowledgements

The research and development of AutoX was funded in part by 4paradigm.