AutoX
This is the documentation of AutoX.
AutoX is a python package. It’s an open source AutoML solution for tabular data.
You can jump right into the package by looking into our Quick Start.
Installation
$ git clone https://github.com/4paradigm/autox.git
$ pip install ./autox
Contents
The following chapters will explain the AutoX package in detail:
Introduction
What is AutoX?
AutoX is an efficient automl tool, mainly aimed at data mining competitions with tabular data. Its features include:
SOTA: AutoX outperforms other solutions in many datasets(see performance improvement under different tasks).
Easy to use: The design of interfaces is similar to sklearn.
Generic & Universal: Supporting tabular data, including binary classification, multi-class classification and regression problems.
Auto: Fully automated pipeline without human-intervention.
Out of the box: Providing flexible modules which can be used alone.
Summary of magics: Organize and publish magics of competitions.
What’s Included in AutoX?
AutoML Competition
autox_competition is AutoML Solutions for Competition.
Demo
data |
description |
link |
Elo |
Help understand customer loyalty |
|
Rossmann |
Forecast sales using store, promotion, and competitor data |
|
Allstate |
How severe is an insurance claim? |
|
House Prices |
Predict house sales prices |
|
IEEE |
Can you detect fraud from customer transactions? |
|
springleaf |
Forecast sales using store, promotion, and competitor data |
|
stumbleupon |
Forecast sales using store, promotion, and competitor data |
|
ventilator |
Forecast sales using store, promotion, and competitor data |
|
walmart |
Use historical markdown data to predict store sales |
Pipeline

API
feature engineer
count features
- class autox.autox_competition.feature_engineer.fe_count.FeatureCount[source]
Convert categorical features into the number of occurrences.
cross features
- class autox.autox_competition.feature_engineer.fe_cross.FeatureCross(importance_type='split')[source]
synthetic feature formed by multiplying (crossing) two features.
- fit(X, y, objective, category_cols, top_k=10, used_cols=[])[source]
- Parameters:
X – {array-like, sparse matrix} of shape (n_samples, n_features). Training vector, where n_samples is the number of samples and n_features is the number of features.
y – array-like of shape (n_samples,). Target vector relative to X.
objective – str, objective equal to ‘binary’ or ‘regression’.
category_cols – list, column names of categorical features.
top_k – int, keep the top_k importance cross features, default top_k = 10.
used_cols – list, columns will be used for training model, default top_k = 10.
cumsum features
denoising autoencoder features
diff features
dimension reduction features
exp weighted mean features
gbdt features
image to vector features
nlp features
features from other table (one to many relationship)
rank features
rolling statistics features (for time-series data)
shift features
shift features (for time-series data)
statistics features
target encoding features
time features
feature selection
adversarial validation
- class autox.autox_competition.feature_selection.adversarial_validation.AdversarialValidation[source]
Bases:
object
Remove features with inconsistent distribution between train and test.
- Example::
- fit(train, test, id_, target, categorical_features=[], p=0.6)[source]
- Parameters:
train – dataframe, the training input samples.
test – dataframe, the testing input samples.
id – list, columns as id.
target – str, target column.
categorical_features – list, columns with categorical type.
p – float, threshold. If the auc is greater than this threshold, the algorithm will continuously remove the most important feature.
GRN feature selection
Each feature weight is output according to the feature column definition.
- Example::
Metrics
- param y_true:
array-like of shape (n_samples,) or (n_samples, n_outputs). Ground truth (correct) target values.
- param y_pred:
array-like of shape (n_samples,) or (n_samples, n_outputs). Estimated target values.
- param metric:
str, one of [‘mae’, ‘mape’, ‘mse’, ‘rmse’, ‘msle’, ‘rmsle’, ‘smape’], default = ‘mape’.
- return:
metric.
operation |
description |
---|---|
count |
count the occurrences of some categorical features within dataset. |
cumsum |
the calculation of the cumulative sum. |
denoising autoencoder |
train a denoising autoencoder neural network for feature extraction. reference |
dimension reduction |
use dimension reduction technology for feature extraction, such as Principal Component Analysis (PCA). |
gbdt |
Generating Features with Gradient Boosted Decision Trees. reference |
rank |
Compute numerical data ranks. |
rolling |
statistics calculation within rolling windows. |
shift |
lag feature. |
diff |
first Difference. |
statistics |
statistics calculation. |
time parse feature |
parse time feature for time column, such as year, month, day, hour, dayofweek, and so on. |
cross feature |
synthetic feature formed by multiplying (crossing) two or more features. reference |
operation |
description |
---|---|
Adversarial Validation |
a feature selection solution for battling overfitting. reference |
GRN |
a feature selection using Gated Residual Networks (GRN) and Variable Selection Networks (VSN). reference |
operation |
description |
---|---|
MAE |
mean absolute error |
MAPE |
mean absolute percentage error |
MSE |
mean squared error |
MSLE |
mean squared logarithmic error |
RMSLE |
root mean squared logarithmic error |
AutoML Server
autox_server is AutoML Solutions for Development.
Demo
CASE 1: Customer Loan Risk Prediction
description: Given the user’s basic information, consumption behavior, repayment situation, etc., an accurate overdue prediction model is established to predict whether the user will overdue repayment.
data download link: google cloud
data details: link
autox_server training: bank_train.ipynb
autox_server inference: bank_test.ipynb
Pipeline

AutoML Interpreter
autox_interpreter is AutoML Solutions for Machine Learning interpretation.
AutoX covers following interpretable machine learning methods:
- Model-based interpretation
nn model interpretation, see nn_interpret
light model interpretation, see lgb_interpret
lr model interpretation, see lr_interpret
- Golbel interpretation
tree-based model, see global_surrogate_tree_demo
- Local interpretation
- Influential interpretation
nn, see influential_interpretation_nn
nn_sgd, see influential_interpretation_nn_sgd
- Prototypes and Criticisms
MMD-critic, see MMD_demo
ProtoDash algorithm, see ProtodashExplainer
Quick Start
Install AutoX
As the compiled autox package is hosted on the Python Package Index (PyPI) you can easily install it with pip
pip install automl-x -i https://www.pypi.org/simple/
or
$ git clone https://github.com/4paradigm/autox.git
$ pip install ./autox
Before boring yourself by reading the docs in detail, you can dive right into AutoX with the following examples:
Binary classification example(Transaction Prediction)
We are provided with an anonymized dataset containing numeric feature variables, the binary target column, and a string ID_code column. The task is to predict the value of target column in the test set.
print(train.head())

print(test.head())

print(train.shape(), test.shape())

We build the automl pipeline with AutoX as following:
from autox import AutoX
path = f'../input/santander-customer-transaction-prediction'
autox = AutoX(target = 'target', train_name = 'train.csv', test_name = 'test.csv', id = ['ID_code'], path = path)
sub = autox.get_submit()
sub.to_csv("./autox_Santander.csv", index = False)
We get a pandas.DataFrame sub which has the same number of rows as test.
print(sub.shape(), test.shape())

print(sub.head())

You can execute this example with this link: santander-autox.
Regression example(House Prices)
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, we need predict the SalePrice of each home.
print(train.head())

print(test.head())

print(train.shape(), test.shape())

We build the automl pipeline with AutoX as following:
from autox import AutoX
path = '../input/house-prices-advanced-regression-techniques'
autox = AutoX(target = 'SalePrice', train_name = 'train.csv', test_name = 'test.csv', id = ['Id'], path = path)
sub = autox.get_submit()
sub.to_csv("submission.csv", index = False)
We get a pandas.DataFrame sub which has the same number of rows as test.
print(sub.shape(), test.shape())

print(sub.head())

You can execute this example with this link: house_price-autox.
Community
Welcome to join our community.

Achievement
performance improvement under different tasks
data_type |
compare with AutoGluon |
compare with H2O |
binary classification |
+20.44% |
+2.98% |
regression |
+37.54% |
+39.66% |
time-series |
+28.40% |
+32.46% |
results comparison
data_type |
single-or-multi |
data_name |
metric |
AutoX |
AutoGluon |
H2o |
binary classification |
single-table |
auc |
0.78865 |
0.61141 |
0.78186 |
|
binary classification |
single-table |
auc |
0.87177 |
0.81025 |
0.79039 |
|
binary classification |
single-table |
auc |
0.89196 |
0.64643 |
0.88775 |
|
binary classification |
multi-table |
accuracy |
0.920809 |
0.724925 |
0.907818 |
|
regression |
single-table |
mae |
0.755 |
8.434 |
4.221 |
|
regression |
single-table |
mae |
1137.07885 |
1173.35917 |
1163.12014 |
|
regression |
single-table |
mse |
1.0034 |
1.9466 |
1.1927 |
|
regression |
single-table |
rmse |
7.87731 |
10.3944 |
7.8895 |
|
regression |
single-table |
rmse |
0.13043 |
0.13104 |
0.13161 |
|
regression |
single-table |
rmse |
2133204.32146 |
31913829.59876 |
28958013.69639 |
|
regression |
multi-table |
rmse |
3.72228 |
3.80801 |
22.88899 |
|
regression-ts |
single-table |
smape |
13.79241 |
25.39182 |
18.89678 |
|
regression-ts |
multi-table |
wmae |
4660.99174 |
5024.16179 |
5128.31622 |
|
regression-ts |
multi-table |
RMSPE |
0.13850 |
0.20453 |
0.35757 |
competition
Enterprise support
值得买
慕尚
How to contribute
We want AutoX to become a leading international AutoML solution. To achieve this goal, we need your help!
All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.
If you want to help, just create a pull request on our github page. To the new user, working with Git can sometimes be confusing and frustrating. If you are not familiar with Git you can also contact us by email.
We are looking forward to hear from you! =)
FAQ
How can I use autox with windows?
We recommend to use Anaconda. After installing, open the Anaconda Prompt, create an environment and set up AutoX (Please be aware that we’re using multiprocessing, which can be problematic.):
conda create -n ENV_NAME python=VERSION activate ENV_NAME pip install autox
Indices and tables
Acknowledgements
The research and development of AutoX was funded in part by 4paradigm.