We provide a comprehensive benchmark framework for decision forest model inferences.
The framework covers three algorithms: RandomForest, XGBoost, and LightGBM.
The framework supports most of the popular decision forest inference platforms, including Scikit-Learn, XGBoost, LightGBM, ONNX, HummingBird, TreeLite, lleaves, TensorFlow TFDF, cu-ml, and FIL.
The framework also supports multiple well-known workloads, including Higgs, Airline, TPCx-AI fraud detection, Fraud, Year, Bosch, Epsilon, and Crieto.
We used AWS EC2 r4.x2large for CPU platforms and g4dn.2xlarge for GPU platforms in our benchmark, both of them run Ubuntu 20.04. Our code should run well on any Ubuntu machines, but the results from other type of machines should not be directly compared with the results in our paper.
We used PostgreSQL to manage data for non-netsDB platforms. Please refer to here to install it. We used the default username and password in our code. Please either leave it as default, or modify the username and password in the config.json
file.
We use Connector-X to connect to the PostgreSQL. Connector-X is the state-of-art of PostgreSQL connection that converts relational data to dataframes for science applications. Please refer to here to install it.
Some datasets are download from Kaggle, so you need to create a Kaggle account and download Kaggle API credentials (here for details)
It is recommended to use conda to manage your environment because one of the required package, TVM, is much easier to be installed using conda
. TVM also recommends a Python version of 3.7.X+ or 3.8.X+, so we also recommend to create a conda virtual environment with Python 3.7 or 3.8.
conda create -n [env-name] python=3.8
Then activate the virtual environment.
conda activate [env-name]
Install some useful tools.
sudo apt-get update
sudo apt-get install -y python3 python3-dev python3-setuptools gcc libtinfo-dev zlib1g-dev build-essential cmake libedit-dev libxml2-dev
It is important to install TVM first, because it might uninstall some other packages. We provide an easy way to install TVM in the following code block, which is tested in our environment. For other installation methods and other details, please refer to here.
(Note: If you choose to build TVM with CMake, you may meet error "collect2: fatal error: cannot find ‘ld’", try to change the linker, e.g., you may change 'fuse-ld=lld' to 'fuse-ld=gold' in the ./CMakeFiles/tvm_runtime.dir/link.txt, ./CMakeFiles/tvm.dir/link.txt, and ./CMakeFiles/tvm_allvisible.dir/link.txt. Remember to run 'make install' from the build directory after successfully compiling tvm to shared libraries.)
git clone --recursive https://github.com/apache/tvm tvm
# Update the current conda environment with the dependencies specified by the yaml
conda env update --file conda/build-environment.yaml
# Build TVM
conda build --output-folder=conda/pkg conda/recipe
# Run conda/build_cuda.sh to build with cuda enabled
conda install tvm -c ./conda/pkg
Install Nvidia cuML (here for instructions) to support Nvidia FIL.
Install Python packages. The command below install all packages for our benchmarck, but feel free to select some from these packages if you only want to run a subset of these frameworks.
pip install scikit-learn xgboost lightgbm pandas onnxruntime onnxruntime-gpu skl2onnx onnxmltools torch tensorflow tensorflow_decision_forests hummingbird-ml[extra] treelite treelite_runtime connectorx lleaves catboost py-xgboost-gpu pyyaml psycopg2-binary plotly
See here for installation of netsDB.
Please use the script to download most of the datasets that are supported by our benchmark framework, including: Epsilon, Fraud, Year, Bosch, Higgs, Criteo, and Airline.
The statistics of these datasets are summarized in the below table:
Dataset | NumRows | NumFeatures |
---|---|---|
Epsilon | 100K | 2000 |
Fraud | 285K | 28 |
Year | 515K | 90 |
Bosch | 1.184M | 968 |
Higgs | 11M | 28 |
Criteo | 51M | 1M |
Airline | 115M | 13 |
In addition, we can support TPCx-AI (SF=30), which involves 131M samples, and 7 features for each sample. To prepare the TPCx-AI dataset, you need follow the below instructions:
- Tool Download Link: TPCxAI Tool
- Documentation Link: TPCxAI Documentation
- Once Downloaded, in the root folder open file setenv.sh and find environment variable
TPCxAI_SCALE_FACTOR
. - Based on the required size, change the value of the Scale Factor. This value represents the size of the generated datasets across all the 10 Use Cases that TPCxAI supports (For more details on the use-cases, check the Documentation).
Scale Factor Size 1 1GB 3 3GB 10 10GB 30 30GB 100 100GB ... ... 10,000 10TB
TPCxAI Supports Scale Factors in multiples of form
(1|3)*10^x
upto10,000
. (i.e.: 1, 3, 10, 30, 100, 300, ..., 10,000)
- Once the value is set, save and close the file.
- Run the file
TPCx-AI_Benchmarkrun.sh
. It takes a while depending on the Scale Factor. - Once done, the generated datasets should be available at
[tool_root_dir]/output/data/
Datasets: higgs, bosch, etc. See details here. Models: xgboost, randomforest, lightgbm
Frameworks:
- Sklearn
- ONNXCPU
- TreeLite
- HummingbirdPytorchCPU-
- HummingbirdTorchScriptCPU
- HummingbirdTVMCPU
- LightGBM
- TFDF
- Lleaves
- HummingbirdPytorchGPU
- HummingbirdTorchScriptGPU
- ONNXGPU
- HummingbirdTVMGPU
- NvidiaFILGPU
- XGBoostGPU
To run a certain experiment:
python data_processing.py -d [dataset]
python train_model.py -d [dataset] -m [model] -t [max-num-trees] -D [max-tree-depth]
python convert_trained_model_to_framework.py -d [dataset] -m [model] -f [frameworks-separated-by-comma] -t [max-num-trees] -D [max-tree-depth]
python test_model.py -d [dataset] -m [model] -f [framework] --batch_size [batch-size] --query_size [query-size] -t [max-num-trees] -D [max-tree-depth] --threads [num-of-threads]
Some arguments are optional. The default values of these arguments are the following:
-t
: 10; -D
: 8; -threads
: -1 (use all threads/cores)
Except for TF-DF, all other platforms should have batch-size
equals to query-size
.
Here is an example to run xgboost on higgs
python data_processing.py -d higgs
python train_model.py -d higgs -m xgboost
python convert_trained_model_to_framework.py -d higgs -m xgboost -f onnx,treelite,lleaves,netsdb
python test_model.py -d higgs -m xgboost -f ONNXCPU --batch_size 100000 --query_size 100000
python test_model.py -d higgs -m xgboost -f TreeLite --batch_size 100000 --query_size 100000
or modify and run run_test.sh
nohup bash run.sh &> ./results/test_output.txt &
Add threads
argument to python test_model.py
, for example,
python test_model.py -d higgs -m xgboost -f TreeLite --batch_size 100000 --query_size 100000 --num_trees 10 --threads 1
To run Yggdrasil, which implements QuickScorer algorithm, first download the binaries from https://github.com/google/yggdrasil-decision-forests/releases to a separate directory and unzip it. Next, put the dataset and model to the right place. Yggdrasil requires the dataset has a header to generate meta data, which you don't need to care much about, but you should manually add a header to the first line of the dataset. For example, add a header to the fraud dataset, run
sed -i '1i feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,feature_11,feature_12,feature_13,feature_14,feature_15,feature_16,feature_17,feature_18,feature_19,feature_20,feature_21,feature_22,feature_23,feature_24,feature_25,feature_26,feature_27,label' datasets/creditcard_test.csv
Then run the benchmark:
./benchmark_inference --dataset=csv:datasets/creditcard_test.csv --model=models/fraud_xgboost_500_6_tfdf/assets/ --generic=false --num_runs=1 --batch_size=56962
First, compile model by running scons libDFTest
.
Make sure to start the cluster:
./scripts/cleanupNode.sh
./scripts/startPseudoCluster.py [num-of-threads] [shared-memory-size]
Then run UDF-centric:
NETSDB_ROOT=[root-path-for-model-and-data]
bin/testDecisionForest Y [row-number] [column-number] [block-size] [label-column-index] F A [page-size] [num-of-partition] [datsaet-path] [netsdb-model-path] [model] [missing] [task-type]
bin/testDecisionForest N [row-number] [column-number] [block-size] [label-column-index] F A [page-size] [num-of-partition] [dataset-path] [netsdb-model-path] [model] [missing] [task-type]
Or run Rel-centric:
NETSDB_ROOT=[root-path-for-model-and-data]
bin/testDecisionForestWithCrossProduct Y [row-number] [column-number] [block-size] [label-column-index] [page-size] [num-of-partitions] [datset-path] [netsdb-model-path] [model] [tree-number] [missing] [task-type]
bin/testDecisionForestWithCrossProduct N [row-number] [column-number] [block-size] [label-column-index] [page-size] [num-of-partitions] [datset-path] [netsdb-model-path] [model] [tree-number] [missing] [task-type]
Our configurations to run netsDB experiments are shown in run_netsdb.sh
.
Here is an example. To run LightGBM model on the Epsilon dataset:
NETSDB_ROOT='..'
./scripts/cleanupNode.sh
./scripts/startPseudoCluster.py 8 30000
bin/testDecisionForest Y 100000 2000 5000 0 F A 42 1 $NETSDB_ROOT/dataset/epsilon_test.csv $NETSDB_ROOT/models/epsilon_lightgbm_10_8_netsdb LightGBM withoutMissing classification
bin/testDecisionForest N 100000 2000 5000 0 F A 42 1 $NETSDB_ROOT/dataset/epsilon_test.csv $NETSDB_ROOT/models/epsilon_lightgbm_10_8_netsdb LightGBM withoutMissing classification
bin/testDecisionForestWithCrossProduct Y 100000 2000 5000 0 42 1 $NETSDB_ROOT/dataset/epsilon_test.csv model-inference/decisionTree/$NETSDB_ROOT/models/epsilon_lightgbm_10_8_netsdb LightGBM 10 withoutMissing classification
bin/testDecisionForestWithCrossProduct N 100000 2000 5000 0 42 1 $NETSDB_ROOT/dataset/epsilon_test.csv $NETSDB_ROOT/models/epsilon_lightgbm_10_8_netsdb LightGBM 10 withoutMissing classification
echo "CPU Usage: "$[100-$(vmstat 1 2|tail -1|awk '{print $15}')]"%"