Architecture & DesignMerging Database Tables with RapidMiner Tutorial

Merging Database Tables with RapidMiner Tutorial

RapidMiner is an open source data science software for enterprise. It provides a Graphical User Interface and offers a broad range of features including support for processing Big Data on Apache Hadoop with the Radoop client software. It is one of the most commonly used data science tools, according to the Rexer Analytics annual Data Miner Survey. RapidMiner offers all the data science features required by a data scientist, but we shall discuss only one of the features: merging database tables.

A common challenge in data preparation is that team members collect data sets separately and data sets have to be merged before data may be analyzed. Before data is merged, it may reside in multiple database tables. In this tutorial, we will discuss how data in two different database tables may be merged into a single database table using RapidMiner. We will use MySQL database tables as an example in this tutorial. The underlying concepts discussed in this tutorial apply to any type of database supported by Rapid Miner. This article has the following sections:

Setting the Environment

Download and install RapidMiner Studio 7.3.x (or a later version) from https://my.RapidMiner.com/nexus/account/index.html#downloads. We shall use MySQL database for example tables to merge. Download and install MySQL database from https://dev.mysql.com/downloads/.

RapidMiner Process Operators for Merging Database Tables

RapidMiner is based on using RapidMiner operators in a RapidMiner “process.” Some of the operators are specialized for specific types of input and output. For a database, the following (Table 1) operators are provided to read, write, and update a database table.

Operator Description
Read Database Reads an ExampleSet from SQL relational database. Stores data in memory.
Write Database Writes an ExampleSet to a SQL database.
Stream Database Reads an ExampleSet from SQL database but does not read the data into memory. Instead, keeps the data in the database, which allows data of varying sizes implying large database tables to be read.
Update Database Updates an existing SQL database table.

Table 1: RapidMiner Process Operators for Databases

To merge or combine data, which could be from another data source and not just an ExampleSet from a SQL database, RapidMiner provides several operators, as discussed in Table 2.

Operator Description
Join Joins two ExampleSets. Supports inner join, left outer join and right outer join, and full outer join.
Append Appends or merges two ExampleSets into a single ExampleSet. The ExampleSets must be compatible, which implies that they should have the same number of attributes and attribute names. If one ExampleSet has a different number of attributes or different attribute names than the other, the Append operator does not merge the ExampleSets.
Union Merges two ExampleSets to create a resultant ExampleSet. The ExampleSets do not need to have the same number attributes or attribute names. The resultant ExampleSet has attributes from both ExampleSets.
Merge Merges attribute values from a single ExampleSet. Merge is not the operator we need because we shall be merging two different ExampleSets obtained from two different database tables.

Table 2: RapidMiner Process Operators for Databases

Creating the MySQL Database Tables to Merge

Next, create two MySQL database tables to merge. First, log in to the MySQL client interface (CLI), as shown in Figure 1.

Logging into MySQL CLI
Figure 1: Logging into MySQL CLI

List the databases in the MySQL database with the show databases command, as shown in Figure 2.

Listing Database with show databases
Figure 2: Listing Database with show databases

Set the database to use as “mysql,” as shown in Figure 3.

The example database tables are for log data. Some other example database tables may also be used, but the procedure to merge shall be the same. Create database table WLSLOG1.

CREATE TABLE WLSLOG1(LOGID VARCHAR(255) PRIMARY KEY,
   CODE VARCHAR(255),MSG VARCHAR(255));

Add log data to WLSLOG1.

INSERT INTO WLSLOG1(LOGID,SERVERNAME,CODE, MSG) values("log1",
  "AdminServer", "BEA-000365","Server state changed to STANDBY" );
INSERT INTO WLSLOG1(LOGID,SERVERNAME,CODE, MSG) values("log2",
  "AdminServer", "BEA-000365","Server state changed to STARTING");
INSERT INTO WLSLOG1(LOGID,SERVERNAME,CODE, MSG) values("log3",
  "AdminServer", "BEA-000365","Server state changed to ADMIN");

Table WLSLOG1 gets created, as shown in Figure 3.

Creating table WLSLOG1
Figure 3: Creating table WLSLOG1

Similarly, create database table WLSLOG2.

CREATE TABLE WLSLOG2(LOGID VARCHAR(255) PRIMARY KEY,
   CODE VARCHAR(255), MSG VARCHAR(255));

Add example data to WLSLOG2.

INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG)
   values("log4","AdminServer","BEA-000331",
   "Server state changed to RESUMING");
INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG)
   values("log5","AdminServer","BEA-000365",
   "Started WebLogic AdminServer");
INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG)
   values("log6","AdminServer","BEA-000365",
   "Server state changed to RUNNING");
INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG)
   values("log7","AdminServer","BEA-000360",
   "Server started in RUNNING mode");

MySQL database table WLSLOG2 gets created, as shown in Figure 4.

Creating table WLSLOG2
Figure 4: Creating table WLSLOG2

Create a third table, WLSLOG, for the merged output of the other two database tables.

CREATE TABLE WLSLOG(LOGID VARCHAR(255) PRIMARY KEY,
   CODE VARCHAR(255),MSG VARCHAR(255));

The table for the merged data gets created, as shown in Figure 5.

Table for merged data
Figure 5: Table for merged data

A table for the merged table is not required because a provision for creating a new database table for the merged table data is also provided. We shall discuss both options, creating a new table and using an existing table.

Creating a RapidMiner Process

Before we are able to use the RapidMiner operators, we need to create a RapidMiner process. Select File>New Process, as shown in Figure 6, in RapidMiner.

File>New Process
Figure 6: File>New Process

In Choose a template to start from, select Blank, as shown in Figure 7.

New Process>Blank
Figure 7: New Process>Blank

A new Process gets created, as shown in Figure 8.

New Process
Figure 8: New Process

Initially, a Blank Process is blank, as shown in Figure 9.

New Process is Blank
Figure 9: New Process is Blank

Creating a MySQL Database Connection

RapidMiner provides database drivers for several SQL databases. To find if a specific database provider is supported, select Connections>Manage Database Drivers, as shown in Figure 10.

Connections>Manage Database Drivers
Figure 10: Connections>Manage Database Drivers

The supported database drivers are listed in Manage Database Drivers, as shown in Figure 11. MySQL is already supported and so are the other commonly used SQL databases, such as PostgreSQL, Oracle, and SQL Server. But, if a database is not listed, click Add to add a configuration for the database driver. Select the MySQL database driver to display the settings for MySQL database driver. Click Save to exit the Manage Database Drivers.

Manage Database Drivers
Figure 11: Manage Database Drivers

To create a database connection, click Connections>Manage Database Connections, as shown in Figure 12.

Connections>Manage Database Connections
Figure 12: Connections>Manage Database Connections

In Connection Details, specify a Name (MySQLDB), and select Database system as MySQL, as shown in Figure 13. Specify Host as “localhost” and Port as 3306. Specify Database scheme as “mysql”. Specify User as “root” and specify the Password for the root user as configured while installing MySQL database. Click Test to test the connection configuration. If the connection gets established, a “Connection ok” message gets displayed.

Test
Figure 13: Test

Click OK to complete the connection configuration, as shown in Figure 14.

Configuring database connection for MySQL
Figure 14: Configuring database connection for MySQL

Merging Database Tables with the Union Operator

Next, we shall demonstrate merging the two example MySQL database tables with the Union operator. First, we need to create two ExampleSets from the two database tables, WLSLOG1 and WLSLOG2. The Read Database operator is used to create an ExampleSet from a database table. Right-click in the Process region and select Insert Operator>Data Access>Database>Read Database, as shown in Figure 15.

Insert Operator>Data Access>Database>Read Database
Figure 15: Insert Operator>Data Access>Database>Read Database

A diagram for the Read Database operator gets added to the process, as shown in Figure 16. To add a 2nd Read Database operator for the 2nd database table, right-click in the Process region and select Insert Operator>Data Access>Database>Read Database again.

Adding a 2nd Read Database
Figure 16: Adding a 2nd Read Database

A 2nd Read Database diagram gets added. Next, we shall configure the Read Database operators. Click Read Database for the first database table. In the tab Parameters for Read Database, select predefined for define connection. In the connection field, select MySQLDB. For define query, select table name, as shown in Figure 17.

Setting Read Database parameters
Figure 17: Setting Read Database parameters

Select table name as wlslog1, as shown in Figure 18. Keep the default setting for datamanagement.

Setting table name
Figure 18: Setting table name

Similarly, for the Read Database (2) operator, select in the Parameters tab predefined for define connection. In the connection field, select MySQLDB. For define query, select table name. In table name, select the wlslog2 table, as shown in Figure 19.

Setting Read Database (2) parameters
Figure 19: Setting Read Database (2) parameters

Next, add the Union operator. Right-click in Process and select Insert Operator>Blending>Table>Joins>Union, as shown in Figure 20.

Process>Insert Operator>Blending>Table>Joins>Union
Figure 20: Process>Insert Operator>Blending>Table>Joins>Union

A diagram widget for the Union operator also gets added to the Process, as shown in Figure 21.

Union operator
Figure 21: Union operator

Next, we need to connect the operators. To connect one operator with another, click a handle provided for the operator and drag to a handle on the operator to connect to, as shown in Figure 22.

Connecting Process Operators
Figure 22: Connecting Process Operators

The connection link is color coded to indicate the status of a connection. A connection that is complete is indicated with the light blue color, as shown for the Read Database to Union connection in Figure 23.

Connection links are color coded to indicate status: light blue for completed connection
Figure 23: Connection links are color coded to indicate status: light blue for completed connection

The two Read Database operators connected to the Union operator are shown in the Process in Figure 24.

Two Read Database operators
Figure 24: Two Read Database operators

Next, add the Write Database operator to output a merged (or, more accurately, a union) database table. Right-click in Process and select Insert Operator>Data Access>Database>Write Database, as shown in Figure 25.

Process>Insert Operator>Data Access>Database>Write Database
Figure 25: Process>Insert Operator>Data Access>Database>Write Database

A widget for Write Database gets added, as shown in Figure 26.

Write Database
Figure 26: Write Database

Configure the parameters for Write Database, as shown in Figure 27. Select the define connection field as predefined, connection as MySQLDB, and table name as wlslog.

Parameters for Write Database
Figure 27: Parameters for Write Database

Connect the output of the Union operator with the input of Write Database, as shown in Figure 28.

Process Operators configured and connected
Figure 28: Process Operators configured and connected

To arrange the operators added, select Process>Process Layout>Arrange Operators, as shown in Figure 29.

Arranging Operators
Figure 29: Arranging Operators

The operators get arranged, as shown in Figure 30. Arranging operators does not have any effect other than to provide a better graphical layout.

Arranged Operators
Figure 30: Arranged Operators

The table for output of Write Database is created if not already created. By default, the Write Database operator creates a new table. Because we already created a database table, we need to set a parameter to overwrite the existing database table wlslog. Select Show advanced parameters, as shown in Figure 31.

Show advanced parameters
Figure 31: Show advanced parameters

Select overwrite mode as overwrite first, append then as shown in Figure 32, which would first overwrite an existing table’s column values and subsequently append. The other options are overwrite and append.

The overwrite mode options
Figure 32: The overwrite mode options

As we shall demonstrate later, the overwrite mode is required to be set for an existing table, as shown in Figure 33.

Setting overwrite mode is required for an existing table
Figure 33: Setting overwrite mode is required for an existing table

After all the process operators have been configured and the connections between them set, select Process and enable Synchronize Meta Data with Real Data and Validate Automatically. To run the process, click Process>Run Process Locally.

Process>Run Process Locally
Figure 34: Process>Run Process Locally

The process gets started. First, the Read Database operator is run on the first database table, as indicated by the in-progress graphic in Figure 35. Subsequently, an ExampleSet is created from the 2nd Read Database operator. The Union operator creates a union of the two ExampleSets. And, the merged output data is added to the wlslog table with the Write Database operator.

Merge process started
Figure 35: Merge process started

The process runs to completion, as shown in Figure 36.

Process runs to completion
Figure 36: Process runs to completion

Run a SELECT query for the WLSLOG table in the MySQL CLI. The merged database table data from WLSLOG1 and WLSLOG2 gets listed for the WLSLOG table, as shown in Figure 37.

Merged database table WLSLOG
Figure 37: Merged database table WLSLOG

To demonstrate that the overwrite mode must be set to other than the default none, keep the default setting of none and run the process. A database error gets generated, indicating that the wlsog table already exists, as shown in Figure 38.

Database error
Figure 38: Database error

Conclusion

In this article, we discussed merging database tables with RapidMiner. We used MySQL database tables as an example, but any of the supported SQL databases may be used. In a subsequent article, “Using RapidMiner with MySQL Tables,” we shall discuss another process operator, APPEND, to append database tables. We shall also discuss the difference between Union and APPEND. We shall also discuss the UPDATE operator for updating a database table.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories