This recipe will instruct how to use XGboost library on Apache Spark on AWS ElasticMapReduce 5.10
First, ssh into an EMR EC2 instance, then
Do not install cmake
from yum
, as the version from yum repository is out of date (2.8 as opposed to 3.3+). Instead, build from source:
wget https://cmake.org/files/v3.10/cmake-3.10.0.tar.gz
tar xzf cmake-3.10.0.tar.gz
cd cmake-3.10.0
./bootstrap --prefix=/usr
make
sudo make install
wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo
sed -i s/\$releasever/6/g /etc/yum.repos.d/epel-apache-maven.repo
sudo yum install -y apache-maven
mvn --version
sudo yum -y install git
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
make -j4
Add enviroment variable for JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk.x86_64
Build the source from maven
cd jvm-packages
mvn package -DskipTests
After the previous steps, there should be 8 jars built. Use command to list all of them,
find . -name "*.jar"
You will need xgboost4j-0.7-jar-with-dependencies.jar
and xgboost4j-spark-0.7-jar-with-dependencies.jar
.
Upload to S3 with command aws s3 cp
First, copy the 2 jars into *project*/lib
Then add the following lines to build.sbt
file
val xgboostSparkPath = "file://" + new File(".").getAbsolutePath + "/lib/xgboost4j-spark-0.7-jar-with-dependencies.jar"
val xgboostPath = "file://" + new File(".").getAbsolutePath + "/lib/xgboost4j-0.7-jar-with-dependencies.jar"
retrieveManaged := true
libraryDependencies ++= Seq(
"ml.dmlc" % "xgboost4j" % "0.7" % "provided" from xgboostPath,
"ml.dmlc" % "xgboost4j-spark" % "0.7" % "provided" from xgboostSparkPath
)
After successfully executing all the previous steps, you can use XGBoost on an EMR Spark cluster.