Knowledgebase

How to Install and Use Apache PredictionIO for Machine Learning on CentOS Print

  • 0

Traditional approaches to data analysis are impossible to use when datasets reach a certain size. A modern alternative to analyzing the huge sets of data is using machine learning methods. Machine learning is able to produce accurate results when using a fast and efficient algorithm.

Apache PredictionIO is an open source machine learning server used to create predictive engines for any machine learning task. It shortens the time of machine learning application from lab to production by using customizable engine templates which can be built and deployed quickly. It provides the data collection and serving components, and abstracts underlying technology to expose an API that allows developers to focus on transformation components. Once the engine server of PredictionIO is deployed as a web service, it can respond to dynamic queries in real-time.

Apache PredictionIO consists of different components.

  • PredictionIO Platform: An open source machine learning stack built on the top of some state-of-the-art open source application such as Apache Spark, Apache Hadoop, Apache HBase and Elasticsearch.
  • Event Server: This continuously gathers data from your web server or mobile application server in real-time mode or batch mode. The gathered data can be used to train the engine or to provide a unified view for data analysis. The event server uses Apache HBase to store the data.
  • Engine Server: The engine server is responsible for making the actual prediction. It reads the training data from the data store and uses one or more machine learning algorithm for building the predictive models. An engine, once deployed as a web service, responds to the queries made by a web or mobile app using REST API or SDK.
  • Template Gallery: This gallery offers various types of pre-built engine templates. You can choose a template which is similar to your use case and modify it according to your requirements.

Prerequisites

  • A Rcs CentOS 7 server instance with at least 8GB RAM. For testing and development purpose, you can choose an instance with 4GB RAM and another 4GB swap memory.
  • A sudo user.

In this tutorial, we will use 192.0.2.1 as the public IP address of the server. Replace all occurrences of 192.0.2.1 with your Rcs public IP address.

Update your base system using the guide How to Update CentOS 7. Once your system has been updated, proceed to install Java.

Install Java

Many of the components of PredictionIO require JDK, or Java Development Kit, version 8 to work. It supports both OpenJDK and Oracle Java. In this tutorial, we will install OpenJDK version 8.

OpenJDK can be easily installed, as the package is available in the default YUM repository.

sudo yum -y install java-1.8.0-openjdk-devel

Verify Java's version to ensure it was installed correctly.

java -version

You will get a similar output.

[user@vultr ~]$ java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

Before we can proceed further, we will need to set up the JAVA_HOME and JRE_HOME environment variables. Find the absolute path of the JAVA executable in your system.

readlink -f $(which java)

You will see a similar output.

[user@vultr ~]$ readlink -f $(which java)
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64/jre/bin/java

Now, set JAVA_HOME and JRE_HOME environment variable according to the path of the Java directory.

echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64" >> ~/.bash_profile
echo "export JRE_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64/jre" >> ~/.bash_profile

Execute the bash_profile file.

source ~/.bash_profile

Now you can run the echo $JAVA_HOME command to check if the environment variable is set.

[user@vultr ~]$ echo $JAVA_HOME
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64

Install PredictionIO

Apache provides PredictionIO source files which can be downloaded and compiled locally. Create a new temporary directory to download and compile the source file.

mkdir /tmp/pio_sourcefiles && cd /tmp/pio_sourcefiles

Download the PredictionIO source file archive using any Apache Mirror site.

wget http://apache.mirror.vexxhost.com/incubator/predictionio/0.12.0-incubating/apache-predictionio-0.12.0-incubating.tar.gz

Extract the archive and compile the source to create a distribution of PredictionIO.

tar xf apache-predictionio-0.12.0-incubating.tar.gz
./make-distribution.sh

The above distribution will be built against the default versions of the dependencies, which are Scala 2.11.8, Spark 2.1.1, Hadoop 2.7.3 and ElasticSearch 5.5.2. Wait for the build to finish, it will take around ten minutes to complete depending upon your system's performance.

Note: You are free to use the latest supported version of the dependencies, but you may see some warnings during the build as some functions might be deprecated. Run ./make-distribution.sh -Dscala.version=2.11.11 -Dspark.version=2.1.2 -Dhadoop.version=2.7.4 -Delasticsearch.version=5.5.3, replacing the version number according to your choice.

Once the build successfully finishes, you will see the following message at the end.

...
PredictionIO-0.12.0-incubating/python/pypio/__init__.py
PredictionIO-0.12.0-incubating/python/pypio/utils.py
PredictionIO-0.12.0-incubating/python/pypio/shell.py
PredictionIO binary distribution created at PredictionIO-0.12.0-incubating.tar.gz

The PredictionIO binary files will be saved in the PredictionIO-0.12.0-incubating.tar.gz archive. Extract the archive in the /opt directory and provide the ownership to the current user.

sudo tar xf PredictionIO-0.12.0-incubating.tar.gz -C /opt/
sudo chown -R $USER:$USER /opt/PredictionIO-0.12.0-incubating

Set the PIO_HOME environment variable.

echo "export PIO_HOME=/opt/PredictionIO-0.12.0-incubating" >> ~/.bash_profile
source ~/.bash_profile

Install Required Dependencies

Create a new directory to install PredictionIO dependencies such as HBase, Spark and Elasticsearch.

mkdir /opt/PredictionIO-0.12.0-incubating/vendors

Download Scala version 2.11.8 and extract it into the vendors directory.

wget https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz
tar xf scala-2.11.8.tgz -C /opt/PredictionIO-0.12.0-incubating/vendors

Download Apache Hadoop version 2.7.3 and extract it into the vendors directory.

wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
tar xf hadoop-2.7.3.tar.gz -C /opt/PredictionIO-0.12.0-incubating/vendors

Apache Spark is the default processing engine for PredictionIO. Download Spark version 2.1.1 and extract it into the vendors directory.

wget https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz
tar xf spark-2.1.1-bin-hadoop2.7.tgz -C /opt/PredictionIO-0.12.0-incubating/vendors

Download Elasticsearch version 5.5.2 and extract it into the vendors directory.

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.2.tar.gz
tar xf elasticsearch-5.5.2.tar.gz -C /opt/PredictionIO-0.12.0-incubating/vendors

Finally, download HBase version 1.2.6 and extract it into the vendors directory.

wget https://archive.apache.org/dist/hbase/stable/hbase-1.2.6-bin.tar.gz
tar xf hbase-1.2.6-bin.tar.gz -C /opt/PredictionIO-0.12.0-incubating/vendors

Open the hbase-site.xml configuration file to configure HBase to work in a standalone environment.

nano /opt/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/conf/hbase-site.xml

Find the empty configuration block and replace it with the following configuration.

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///home/user/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/data</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/home/user/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/zookeeper</value>
  </property>
</configuration>

The data directory will be created automatically by HBase. Edit the HBase environment file to set the JAVA_HOME path.

nano /opt/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/conf/hbase-env.sh

Uncomment line number 27 and set JAVA_HOME to the path of jre, your Java installation. You can find the path to the JAVA executable using the readlink -f $(which java) command.

# The java implementation to use.  Java 1.7+ required.
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64/jre

Also, comment out line numbers 46 and 47 as they are not required for JAVA 8.

# Configure PermSize. Only needed in JDK7. You can safely remove it for JDK8+
# export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m"
# export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m"

Configure the PredictionIO Environment

The default configuration in the PredictionIO environment file pio-env.sh assumes that we are using PostgreSQL or MySQL. As we have used HBase and Elasticsearch, we will need to modify nearly every configuration in the file. It's best to take a backup of the existing file and create a new PredictionIO environment file.

mv /opt/PredictionIO-0.12.0-incubating/conf/pio-env.sh /opt/PredictionIO-0.12.0-incubating/conf/pio-env.sh.bak

Now create a new file for PredictionIO environment configuration.

nano /opt/PredictionIO-0.12.0-incubating/conf/pio-env.sh

Populate the file with the following configuration.

# PredictionIO Main Configuration
#
# This section controls core behavior of PredictionIO. It is very likely that
# you need to change these to fit your site.

# SPARK_HOME: Apache Spark is a hard dependency and must be configured.
SPARK_HOME=$PIO_HOME/vendors/spark-2.1.1-bin-hadoop2.7

# POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar
# MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar

# ES_CONF_DIR: You must configure this if you have advanced configuration for
#              your Elasticsearch setup.
ES_CONF_DIR=$PIO_HOME/vendors/elasticsearch-5.5.2/config

# HADOOP_CONF_DIR: You must configure this if you intend to run PredictionIO
#                  with Hadoop 2.
HADOOP_CONF_DIR=$PIO_HOME/vendors/spark-2.1.1-bin-hadoop2.7/conf

# HBASE_CONF_DIR: You must configure this if you intend to run PredictionIO
#                 with HBase on a remote cluster.
HBASE_CONF_DIR=$PIO_HOME/vendors/hbase-1.2.6/conf

# Filesystem paths where PredictionIO uses as block storage.
PIO_FS_BASEDIR=$HOME/.pio_store
PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

# PredictionIO Storage Configuration
#
# This section controls programs that make use of PredictionIO's built-in
# storage facilities. Default values are shown below.
#
# For more information on storage configuration please refer to
# http://predictionio.incubator.apache.org/system/anotherdatastore/

# Storage Repositories

# Default is to use PostgreSQL
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS

# Storage Data Sources

# PostgreSQL Default Settings
# Please change "pio" to your database name in PIO_STORAGE_SOURCES_PGSQL_URL
# Please change PIO_STORAGE_SOURCES_PGSQL_USERNAME and
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD accordingly
# PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio
# PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio

# MySQL Example
# PIO_STORAGE_SOURCES_MYSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_MYSQL_URL=jdbc:mysql://localhost/pio
# PIO_STORAGE_SOURCES_MYSQL_USERNAME=pio
# PIO_STORAGE_SOURCES_MYSQL_PASSWORD=pio

# Elasticsearch Example
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200
PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=pio
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-5.5.2

# Optional basic HTTP auth
# PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret
# Elasticsearch 1.x Example
# PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
# PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=<elasticsearch_cluster_name>
# PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300
# PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-1.7.6

# Local File System Example
PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
PIO_STORAGE_SOURCES_LOCALFS_PATH=$PIO_FS_BASEDIR/models

# HBase Example
PIO_STORAGE_SOURCES_HBASE_TYPE=hbase
PIO_STORAGE_SOURCES_HBASE_HOME=$PIO_HOME/vendors/hbase-1.2.6

# AWS S3 Example
# PIO_STORAGE_SOURCES_S3_TYPE=s3
# PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio_bucket
# PIO_STORAGE_SOURCES_S3_BASE_PATH=pio_model

Save the file and exit from the editor.

Open the Elasticsearch configuration file.

nano /opt/PredictionIO-0.12.0-incubating/vendors/elasticsearch-5.5.2/config/elasticsearch.yml

Uncomment the line and set the cluster name to exactly the same as the one provided in the PredictionIO environment file. The cluster name is set to pio in the above configuration.

# Use a descriptive name for your cluster:
#
cluster.name: pio

Now add the $PIO_HOME/bin directory into the PATH variable so that the PredictionIO executables are executed directly.

echo "export PATH=$PATH:$PIO_HOME/bin" >> ~/.bash_profile
source ~/.bash_profile

At this point, PredictionIO is successfully installed on your server.

Starting PredictionIO

You can start all the services in PredictionIO such as Elasticsearch, HBase and Event server using a single command.

pio-start-all

You will see the following output.

[user@vultr ~]$ pio-start-all
Starting Elasticsearch...
Starting HBase...
starting master, logging to /opt/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/bin/../logs/hbase-user-master-vultr.guest.out
Waiting 10 seconds for Storage Repositories to fully initialize...
Starting PredictionIO Event Server...

Use the following command to check the status of the PredictionIO server.

pio status

You will see the following output.

[user@vultr ~]$ pio status
[INFO] [Management$] Inspecting PredictionIO...
[INFO] [Management$] PredictionIO 0.12.0-incubating is installed at /opt/PredictionIO-0.12.0-incubating
[INFO] [Management$] Inspecting Apache Spark...
[INFO] [Management$] Apache Spark is installed at /opt/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.7
[INFO] [Management$] Apache Spark 2.1.1 detected (meets minimum requirement of 1.3.0)
[INFO] [Management$] Inspecting storage backend connections...
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
[INFO] [Storage$] Verifying Model Data Backend (Source: LOCALFS)...
[INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)...
[INFO] [Storage$] Test writing to Event Store (App Id 0)...
[INFO] [HBLEvents] The namespace pio_event doesn't exist yet. Creating now...
[INFO] [HBLEvents] The table pio_event:events_0 doesn't exist yet. Creating now...
[INFO] [HBLEvents] Removing table pio_event:events_0...
[INFO] [Management$] Your system is all ready to go.

As we can see in the above messages, our system is ready to use for implementing an engine template and predicting data.

Implementing an Engine Template

Several ready to use engine templates are available on the PredictionIO Template Gallery which can be easily installed on the PredictionIO server. You are free to browse through the list of engine templates to find the one that is close to your requirements or you can write your own engine.

In this tutorial, we will implement the E-Commerce Recommendation engine template to demonstrate the functionality of PredictionIO server using some sample data. This engine template provides some personal recommendation to a user in an e-commerce website. By default, it has features such as excluding out of stock items or providing recommendations to a user who signs up after the model is trained. Also, by default, the engine template takes a user's view and buy events, items with categories and properties and list of unavailable items. Once the engine has been trained and deployed, you can send a query with the user id and number of items to be recommended. The generated output will be a ranked list of recommended item ids.

Install Git, as it will be used to clone the repository.

cd ~    
sudo yum -y install git

Clone the E-Commerce Recommender engine template on your system.

git clone https://github.com/apache/incubator-predictionio-template-ecom-recommender.git MyEComRecomm  

Create a new application for the E-Commerce Recommendation template engine. Each application in PredictionIO is used to store the data for a separate website. If you have multiple websites, then you can create multiple apps to store each website's data into a different application. You are free to choose any name for your application.

cd MyEComRecomm/
pio app new myecom

You will see the following output.

[user@vultr MyEComRecomm]$ pio app new myecom
[INFO] [HBLEvents] The table pio_event:events_1 doesn't exist yet. Creating now...
[INFO] [App$] Initialized Event Store for this app ID: 1.
[INFO] [Pio$] Created a new app:
[INFO] [Pio$]       Name: myecom
[INFO] [Pio$]         ID: 1
[INFO] [Pio$] Access Key: a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t

The output above also contains the access key which will be used to authenticate when sending the input data to the event server.

You can always find the access key along with the list of available applications by running.

pio app list

You will see the following output containing a list of applications and the access key.

[user@vultr MyEComRecomm]$ pio app list
[INFO] [Pio$]                 Name |   ID |                                                       Access Key | Allowed Event(s)
[INFO] [Pio$]               myecom |    1 | a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t | (all)
[INFO] [Pio$] Finished listing 1 app(s).

Now that we have created a new application, we will add some data to it. In the production environment, you would want to automatically send the data to the event server by integrating the event server API into the application. To learn how PredictionIO works, we will import some sample data into it. The template engine provides a Python script which can be easily used to import the sample data into the event server.

Install Python pip.

sudo yum -y install python-pip
sudo pip install --upgrade pip

Install PredictionIO Python SDK using pip.

sudo pip install predictionio

Run the Python script to add the sample data to the event server.

python data/import_eventserver.py --access_key a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t

Make sure to replace the access key with your actual access key. You will see a similar output.

[user@vultr MyEComRecomm]$ python data/import_eventserver.py --access_key a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t
Namespace(access_key='a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t', url='http://localhost:7070')
{u'status': u'alive'}
Importing data...
('Set user', 'u1')
('Set user', 'u2')

...

('User', 'u10', 'buys item', 'i30')
('User', 'u10', 'views item', 'i40')
('User', 'u10', 'buys item', 'i40')
204 events are imported.

The above script imports 10 users, 50 items in 6 categories and some random events of purchase and views. To check if the events are imported or not, you can run the following query.

curl -i -X GET "http://localhost:7070/events.json?accessKey=a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t"

The output will show you the list of all the imported events in JSON format.

Now, open the engine.json file into the editor. This file contains the configuration of the engine.

nano engine.json

Find both the occurrences of appName and replace the value with the actual name of the app you have created earlier.

{
  "id": "default",
  "description": "Default settings",
  "engineFactory": "org.example.ecommercerecommendation.ECommerceRecommendationEngine",
  "datasource": {
    "params" : {
      "appName": "myecom"
    }
  },
  "algorithms": [
    {
      "name": "ecomm",
      "params": {
        "appName": "myecom",
        "unseenOnly": true,
        "seenEvents": ["buy", "view"],
        "similarEvents": ["view"],
        "rank": 10,
        "numIterations" : 20,
        "lambda": 0.01,
        "seed": 3
      }
    }
  ]
}

Build the application.

pio build --verbose

If you do not want to see the log messages, remove the --verbose option. Building the engine template for the first time will take few minutes. You will see a similar output when the build successfully finishes.

[user@vultr MyEComRecomm]$ pio build --verbose
[INFO] [Engine$] Using command '/opt/PredictionIO-0.12.0-incubating/sbt/sbt' at /home/user/MyEComRecomm to build.

...

[INFO] [Engine$] Build finished successfully.
[INFO] [Pio$] Your engine is ready for training.

Train the engine now. During the training, the engine analyzes the data set and trains itself according to the provided algorithm.

pio train

Before we deploy the application, we will need to open the port 8000 so that the status of the application can be viewed on the Web GUI. Also, the websites and applications using the event server will send and receive their queries through this port.

sudo firewall-cmd --zone=public --permanent --add-port=8000/tcp
sudo firewall-cmd --reload

Now you can deploy the PredictionIO engine.

pio deploy

The above command will deploy the engine and the built-in web server on port 8000 to respond to the queries from the e-commerce websites and applications. You will see the following output at the end once the engine is successfully deployed.

[INFO] [HttpListener] Bound to /0.0.0.0:8000
[INFO] [MasterActor] Engine is deployed and running. Engine API is live at http://0.0.0.0:8000.

You can verify the status of the engine by going to http://192.0.2.1:8000 using any modern browser. Make sure that you replace 192.0.2.1 with your actual Rcs IP address.

This signifies that the engine template for E-Commerce recommendation is deployed and running successfully. You can query the engine template to fetch five recommendations for user u5 by running the following query in a new terminal session.

curl -H "Content-Type: application/json" \
-d '{ "user": "u5", "num": 5 }' \
http://localhost:8000/queries.json

You will see the generated recommendations for user u5.

[user@vultr ~]$ curl -H "Content-Type: application/json" \
> -d '{ "user": "u5", "num": 5 }' \
> http://localhost:8000/queries.json
{"itemScores":[{"item":"i25","score":0.9985169366745619},{"item":"i10","score":0.996613946803819},{"item":"i27","score":0.996613946803819},{"item":"i17","score":0.9962796867639341},{"item":"i8","score":0.9955868705972656}]}

Wrapping Up

Congratulations, Apache PredictionIO has been successfully deployed on your server. You can now use the API of the event server to import the data into the engine to predict the recommendations for the user. If you want, you can use some other templates from the template gallery. Be sure to check out the Universal Recommender engine template which can be used in almost all use cases including e-commerce, news or video.

Traditional approaches to data analysis are impossible to use when datasets reach a certain size. A modern alternative to analyzing the huge sets of data is using machine learning methods. Machine learning is able to produce accurate results when using a fast and efficient algorithm. Apache PredictionIO is an open source machine learning server used to create predictive engines for any machine learning task. It shortens the time of machine learning application from lab to production by using customizable engine templates which can be built and deployed quickly. It provides the data collection and serving components, and abstracts underlying technology to expose an API that allows developers to focus on transformation components. Once the engine server of PredictionIO is deployed as a web service, it can respond to dynamic queries in real-time. Apache PredictionIO consists of different components. PredictionIO Platform: An open source machine learning stack built on the top of some state-of-the-art open source application such as Apache Spark, Apache Hadoop, Apache HBase and Elasticsearch. Event Server: This continuously gathers data from your web server or mobile application server in real-time mode or batch mode. The gathered data can be used to train the engine or to provide a unified view for data analysis. The event server uses Apache HBase to store the data. Engine Server: The engine server is responsible for making the actual prediction. It reads the training data from the data store and uses one or more machine learning algorithm for building the predictive models. An engine, once deployed as a web service, responds to the queries made by a web or mobile app using REST API or SDK. Template Gallery: This gallery offers various types of pre-built engine templates. You can choose a template which is similar to your use case and modify it according to your requirements. Prerequisites A Rcs CentOS 7 server instance with at least 8GB RAM. For testing and development purpose, you can choose an instance with 4GB RAM and another 4GB swap memory. A sudo user. In this tutorial, we will use 192.0.2.1 as the public IP address of the server. Replace all occurrences of 192.0.2.1 with your Rcs public IP address. Update your base system using the guide How to Update CentOS 7. Once your system has been updated, proceed to install Java. Install Java Many of the components of PredictionIO require JDK, or Java Development Kit, version 8 to work. It supports both OpenJDK and Oracle Java. In this tutorial, we will install OpenJDK version 8. OpenJDK can be easily installed, as the package is available in the default YUM repository. sudo yum -y install java-1.8.0-openjdk-devel Verify Java's version to ensure it was installed correctly. java -version You will get a similar output. [user@vultr ~]$ java -version openjdk version "1.8.0_151" OpenJDK Runtime Environment (build 1.8.0_151-b12) OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode) Before we can proceed further, we will need to set up the JAVA_HOME and JRE_HOME environment variables. Find the absolute path of the JAVA executable in your system. readlink -f $(which java) You will see a similar output. [user@vultr ~]$ readlink -f $(which java) /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64/jre/bin/java Now, set JAVA_HOME and JRE_HOME environment variable according to the path of the Java directory. echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64" >> ~/.bash_profile echo "export JRE_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64/jre" >> ~/.bash_profile Execute the bash_profile file. source ~/.bash_profile Now you can run the echo $JAVA_HOME command to check if the environment variable is set. [user@vultr ~]$ echo $JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64 Install PredictionIO Apache provides PredictionIO source files which can be downloaded and compiled locally. Create a new temporary directory to download and compile the source file. mkdir /tmp/pio_sourcefiles && cd /tmp/pio_sourcefiles Download the PredictionIO source file archive using any Apache Mirror site. wget http://apache.mirror.vexxhost.com/incubator/predictionio/0.12.0-incubating/apache-predictionio-0.12.0-incubating.tar.gz Extract the archive and compile the source to create a distribution of PredictionIO. tar xf apache-predictionio-0.12.0-incubating.tar.gz ./make-distribution.sh The above distribution will be built against the default versions of the dependencies, which are Scala 2.11.8, Spark 2.1.1, Hadoop 2.7.3 and ElasticSearch 5.5.2. Wait for the build to finish, it will take around ten minutes to complete depending upon your system's performance. Note: You are free to use the latest supported version of the dependencies, but you may see some warnings during the build as some functions might be deprecated. Run ./make-distribution.sh -Dscala.version=2.11.11 -Dspark.version=2.1.2 -Dhadoop.version=2.7.4 -Delasticsearch.version=5.5.3, replacing the version number according to your choice. Once the build successfully finishes, you will see the following message at the end. ... PredictionIO-0.12.0-incubating/python/pypio/__init__.py PredictionIO-0.12.0-incubating/python/pypio/utils.py PredictionIO-0.12.0-incubating/python/pypio/shell.py PredictionIO binary distribution created at PredictionIO-0.12.0-incubating.tar.gz The PredictionIO binary files will be saved in the PredictionIO-0.12.0-incubating.tar.gz archive. Extract the archive in the /opt directory and provide the ownership to the current user. sudo tar xf PredictionIO-0.12.0-incubating.tar.gz -C /opt/ sudo chown -R $USER:$USER /opt/PredictionIO-0.12.0-incubating Set the PIO_HOME environment variable. echo "export PIO_HOME=/opt/PredictionIO-0.12.0-incubating" >> ~/.bash_profile source ~/.bash_profile Install Required Dependencies Create a new directory to install PredictionIO dependencies such as HBase, Spark and Elasticsearch. mkdir /opt/PredictionIO-0.12.0-incubating/vendors Download Scala version 2.11.8 and extract it into the vendors directory. wget https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz tar xf scala-2.11.8.tgz -C /opt/PredictionIO-0.12.0-incubating/vendors Download Apache Hadoop version 2.7.3 and extract it into the vendors directory. wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz tar xf hadoop-2.7.3.tar.gz -C /opt/PredictionIO-0.12.0-incubating/vendors Apache Spark is the default processing engine for PredictionIO. Download Spark version 2.1.1 and extract it into the vendors directory. wget https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz tar xf spark-2.1.1-bin-hadoop2.7.tgz -C /opt/PredictionIO-0.12.0-incubating/vendors Download Elasticsearch version 5.5.2 and extract it into the vendors directory. wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.2.tar.gz tar xf elasticsearch-5.5.2.tar.gz -C /opt/PredictionIO-0.12.0-incubating/vendors Finally, download HBase version 1.2.6 and extract it into the vendors directory. wget https://archive.apache.org/dist/hbase/stable/hbase-1.2.6-bin.tar.gz tar xf hbase-1.2.6-bin.tar.gz -C /opt/PredictionIO-0.12.0-incubating/vendors Open the hbase-site.xml configuration file to configure HBase to work in a standalone environment. nano /opt/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/conf/hbase-site.xml Find the empty configuration block and replace it with the following configuration. hbase.rootdir file:///home/user/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/data hbase.zookeeper.property.dataDir /home/user/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/zookeeper The data directory will be created automatically by HBase. Edit the HBase environment file to set the JAVA_HOME path. nano /opt/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/conf/hbase-env.sh Uncomment line number 27 and set JAVA_HOME to the path of jre, your Java installation. You can find the path to the JAVA executable using the readlink -f $(which java) command. # The java implementation to use. Java 1.7+ required. export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64/jre Also, comment out line numbers 46 and 47 as they are not required for JAVA 8. # Configure PermSize. Only needed in JDK7. You can safely remove it for JDK8+ # export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m" # export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m" Configure the PredictionIO Environment The default configuration in the PredictionIO environment file pio-env.sh assumes that we are using PostgreSQL or MySQL. As we have used HBase and Elasticsearch, we will need to modify nearly every configuration in the file. It's best to take a backup of the existing file and create a new PredictionIO environment file. mv /opt/PredictionIO-0.12.0-incubating/conf/pio-env.sh /opt/PredictionIO-0.12.0-incubating/conf/pio-env.sh.bak Now create a new file for PredictionIO environment configuration. nano /opt/PredictionIO-0.12.0-incubating/conf/pio-env.sh Populate the file with the following configuration. # PredictionIO Main Configuration # # This section controls core behavior of PredictionIO. It is very likely that # you need to change these to fit your site. # SPARK_HOME: Apache Spark is a hard dependency and must be configured. SPARK_HOME=$PIO_HOME/vendors/spark-2.1.1-bin-hadoop2.7 # POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar # MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar # ES_CONF_DIR: You must configure this if you have advanced configuration for # your Elasticsearch setup. ES_CONF_DIR=$PIO_HOME/vendors/elasticsearch-5.5.2/config # HADOOP_CONF_DIR: You must configure this if you intend to run PredictionIO # with Hadoop 2. HADOOP_CONF_DIR=$PIO_HOME/vendors/spark-2.1.1-bin-hadoop2.7/conf # HBASE_CONF_DIR: You must configure this if you intend to run PredictionIO # with HBase on a remote cluster. HBASE_CONF_DIR=$PIO_HOME/vendors/hbase-1.2.6/conf # Filesystem paths where PredictionIO uses as block storage. PIO_FS_BASEDIR=$HOME/.pio_store PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp # PredictionIO Storage Configuration # # This section controls programs that make use of PredictionIO's built-in # storage facilities. Default values are shown below. # # For more information on storage configuration please refer to # http://predictionio.incubator.apache.org/system/anotherdatastore/ # Storage Repositories # Default is to use PostgreSQL PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS # Storage Data Sources # PostgreSQL Default Settings # Please change "pio" to your database name in PIO_STORAGE_SOURCES_PGSQL_URL # Please change PIO_STORAGE_SOURCES_PGSQL_USERNAME and # PIO_STORAGE_SOURCES_PGSQL_PASSWORD accordingly # PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc # PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio # PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio # PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio # MySQL Example # PIO_STORAGE_SOURCES_MYSQL_TYPE=jdbc # PIO_STORAGE_SOURCES_MYSQL_URL=jdbc:mysql://localhost/pio # PIO_STORAGE_SOURCES_MYSQL_USERNAME=pio # PIO_STORAGE_SOURCES_MYSQL_PASSWORD=pio # Elasticsearch Example PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200 PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=pio PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-5.5.2 # Optional basic HTTP auth # PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name # PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret # Elasticsearch 1.x Example # PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch # PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME= # PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost # PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300 # PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-1.7.6 # Local File System Example PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs PIO_STORAGE_SOURCES_LOCALFS_PATH=$PIO_FS_BASEDIR/models # HBase Example PIO_STORAGE_SOURCES_HBASE_TYPE=hbase PIO_STORAGE_SOURCES_HBASE_HOME=$PIO_HOME/vendors/hbase-1.2.6 # AWS S3 Example # PIO_STORAGE_SOURCES_S3_TYPE=s3 # PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio_bucket # PIO_STORAGE_SOURCES_S3_BASE_PATH=pio_model Save the file and exit from the editor. Open the Elasticsearch configuration file. nano /opt/PredictionIO-0.12.0-incubating/vendors/elasticsearch-5.5.2/config/elasticsearch.yml Uncomment the line and set the cluster name to exactly the same as the one provided in the PredictionIO environment file. The cluster name is set to pio in the above configuration. # Use a descriptive name for your cluster: # cluster.name: pio Now add the $PIO_HOME/bin directory into the PATH variable so that the PredictionIO executables are executed directly. echo "export PATH=$PATH:$PIO_HOME/bin" >> ~/.bash_profile source ~/.bash_profile At this point, PredictionIO is successfully installed on your server. Starting PredictionIO You can start all the services in PredictionIO such as Elasticsearch, HBase and Event server using a single command. pio-start-all You will see the following output. [user@vultr ~]$ pio-start-all Starting Elasticsearch... Starting HBase... starting master, logging to /opt/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/bin/../logs/hbase-user-master-vultr.guest.out Waiting 10 seconds for Storage Repositories to fully initialize... Starting PredictionIO Event Server... Use the following command to check the status of the PredictionIO server. pio status You will see the following output. [user@vultr ~]$ pio status [INFO] [Management$] Inspecting PredictionIO... [INFO] [Management$] PredictionIO 0.12.0-incubating is installed at /opt/PredictionIO-0.12.0-incubating [INFO] [Management$] Inspecting Apache Spark... [INFO] [Management$] Apache Spark is installed at /opt/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.7 [INFO] [Management$] Apache Spark 2.1.1 detected (meets minimum requirement of 1.3.0) [INFO] [Management$] Inspecting storage backend connections... [INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)... [INFO] [Storage$] Verifying Model Data Backend (Source: LOCALFS)... [INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)... [INFO] [Storage$] Test writing to Event Store (App Id 0)... [INFO] [HBLEvents] The namespace pio_event doesn't exist yet. Creating now... [INFO] [HBLEvents] The table pio_event:events_0 doesn't exist yet. Creating now... [INFO] [HBLEvents] Removing table pio_event:events_0... [INFO] [Management$] Your system is all ready to go. As we can see in the above messages, our system is ready to use for implementing an engine template and predicting data. Implementing an Engine Template Several ready to use engine templates are available on the PredictionIO Template Gallery which can be easily installed on the PredictionIO server. You are free to browse through the list of engine templates to find the one that is close to your requirements or you can write your own engine. In this tutorial, we will implement the E-Commerce Recommendation engine template to demonstrate the functionality of PredictionIO server using some sample data. This engine template provides some personal recommendation to a user in an e-commerce website. By default, it has features such as excluding out of stock items or providing recommendations to a user who signs up after the model is trained. Also, by default, the engine template takes a user's view and buy events, items with categories and properties and list of unavailable items. Once the engine has been trained and deployed, you can send a query with the user id and number of items to be recommended. The generated output will be a ranked list of recommended item ids. Install Git, as it will be used to clone the repository. cd ~ sudo yum -y install git Clone the E-Commerce Recommender engine template on your system. git clone https://github.com/apache/incubator-predictionio-template-ecom-recommender.git MyEComRecomm Create a new application for the E-Commerce Recommendation template engine. Each application in PredictionIO is used to store the data for a separate website. If you have multiple websites, then you can create multiple apps to store each website's data into a different application. You are free to choose any name for your application. cd MyEComRecomm/ pio app new myecom You will see the following output. [user@vultr MyEComRecomm]$ pio app new myecom [INFO] [HBLEvents] The table pio_event:events_1 doesn't exist yet. Creating now... [INFO] [App$] Initialized Event Store for this app ID: 1. [INFO] [Pio$] Created a new app: [INFO] [Pio$] Name: myecom [INFO] [Pio$] ID: 1 [INFO] [Pio$] Access Key: a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t The output above also contains the access key which will be used to authenticate when sending the input data to the event server. You can always find the access key along with the list of available applications by running. pio app list You will see the following output containing a list of applications and the access key. [user@vultr MyEComRecomm]$ pio app list [INFO] [Pio$] Name | ID | Access Key | Allowed Event(s) [INFO] [Pio$] myecom | 1 | a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t | (all) [INFO] [Pio$] Finished listing 1 app(s). Now that we have created a new application, we will add some data to it. In the production environment, you would want to automatically send the data to the event server by integrating the event server API into the application. To learn how PredictionIO works, we will import some sample data into it. The template engine provides a Python script which can be easily used to import the sample data into the event server. Install Python pip. sudo yum -y install python-pip sudo pip install --upgrade pip Install PredictionIO Python SDK using pip. sudo pip install predictionio Run the Python script to add the sample data to the event server. python data/import_eventserver.py --access_key a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t Make sure to replace the access key with your actual access key. You will see a similar output. [user@vultr MyEComRecomm]$ python data/import_eventserver.py --access_key a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t Namespace(access_key='a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t', url='http://localhost:7070') {u'status': u'alive'} Importing data... ('Set user', 'u1') ('Set user', 'u2') ... ('User', 'u10', 'buys item', 'i30') ('User', 'u10', 'views item', 'i40') ('User', 'u10', 'buys item', 'i40') 204 events are imported. The above script imports 10 users, 50 items in 6 categories and some random events of purchase and views. To check if the events are imported or not, you can run the following query. curl -i -X GET "http://localhost:7070/events.json?accessKey=a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t" The output will show you the list of all the imported events in JSON format. Now, open the engine.json file into the editor. This file contains the configuration of the engine. nano engine.json Find both the occurrences of appName and replace the value with the actual name of the app you have created earlier. { "id": "default", "description": "Default settings", "engineFactory": "org.example.ecommercerecommendation.ECommerceRecommendationEngine", "datasource": { "params" : { "appName": "myecom" } }, "algorithms": [ { "name": "ecomm", "params": { "appName": "myecom", "unseenOnly": true, "seenEvents": ["buy", "view"], "similarEvents": ["view"], "rank": 10, "numIterations" : 20, "lambda": 0.01, "seed": 3 } } ] } Build the application. pio build --verbose If you do not want to see the log messages, remove the --verbose option. Building the engine template for the first time will take few minutes. You will see a similar output when the build successfully finishes. [user@vultr MyEComRecomm]$ pio build --verbose [INFO] [Engine$] Using command '/opt/PredictionIO-0.12.0-incubating/sbt/sbt' at /home/user/MyEComRecomm to build. ... [INFO] [Engine$] Build finished successfully. [INFO] [Pio$] Your engine is ready for training. Train the engine now. During the training, the engine analyzes the data set and trains itself according to the provided algorithm. pio train Before we deploy the application, we will need to open the port 8000 so that the status of the application can be viewed on the Web GUI. Also, the websites and applications using the event server will send and receive their queries through this port. sudo firewall-cmd --zone=public --permanent --add-port=8000/tcp sudo firewall-cmd --reload Now you can deploy the PredictionIO engine. pio deploy The above command will deploy the engine and the built-in web server on port 8000 to respond to the queries from the e-commerce websites and applications. You will see the following output at the end once the engine is successfully deployed. [INFO] [HttpListener] Bound to /0.0.0.0:8000 [INFO] [MasterActor] Engine is deployed and running. Engine API is live at http://0.0.0.0:8000. You can verify the status of the engine by going to http://192.0.2.1:8000 using any modern browser. Make sure that you replace 192.0.2.1 with your actual Rcs IP address. This signifies that the engine template for E-Commerce recommendation is deployed and running successfully. You can query the engine template to fetch five recommendations for user u5 by running the following query in a new terminal session. curl -H "Content-Type: application/json" \ -d '{ "user": "u5", "num": 5 }' \ http://localhost:8000/queries.json You will see the generated recommendations for user u5. [user@vultr ~]$ curl -H "Content-Type: application/json" \ > -d '{ "user": "u5", "num": 5 }' \ > http://localhost:8000/queries.json {"itemScores":[{"item":"i25","score":0.9985169366745619},{"item":"i10","score":0.996613946803819},{"item":"i27","score":0.996613946803819},{"item":"i17","score":0.9962796867639341},{"item":"i8","score":0.9955868705972656}]} Wrapping Up Congratulations, Apache PredictionIO has been successfully deployed on your server. You can now use the API of the event server to import the data into the engine to predict the recommendations for the user. If you want, you can use some other templates from the template gallery. Be sure to check out the Universal Recommender engine template which can be used in almost all use cases including e-commerce, news or video.

Was this answer helpful?
Back

Powered by WHMCompleteSolution