In an earlier article, “Merging Database Tables with RapidMiner,” we started with introducing the different process operators for merging database tables. We created the database tables to merge and created a RapidMiner process. Subsequently, we created a connection with MySQL database and used the UNION operator to merge two database tables. In this article, we shall discuss two other process operators: APPEND and UPDATE. This article has the following sections:
- Appending Database Tables with the APPEND Operator
- Differences Between APPEND and UNION
- Updating Database Tables with the UPDATE Operator
- Conclusion
Appending Database Tables with the APPEND Operator
Next, we shall use the Append operator to append ExampleSet generated from one database table with ExampleSet generated from the 2nd ExampleSet and output the resultant ExampleSet to the wlslog table. For the Append example, we shall create the database table “wlslog” automatically when the Process is run. Delete the wlslog table created for the previous article, as shown in Figure 1.
Figure 1: Deleting the wlslog table
Create another RapidMiner Process or use the same one, but to start with, only the Read Database and Read Database (2) operators should be included in the Process. Next, add the Append operator. Right-click the Process and select Insert Operator>Blending>Table>Joins>Append, as shown in Figure 2.
Figure 2: Insert Operator>Blending>Table>Joins>Append
The Append operator gets added, as shown in Figure 3.
Figure 3: Append Operator added
Arrange the operators with Process>Arrange Operators and add connections to connect the two Read Database operators with the Append operator, as shown in Figure 4.
Figure 4: Connecting Read Database and Append Operators
Next, add a Write Database operator. Right-click the Process and select Insert Operator>Data Access>Database>Write Database, as shown in Figure 5.
Figure 5: Insert Operator>Data Access>Database>Write Database
Connect the output from the Append to the input of the Write Database, as shown in Figure 6.
Figure 6: Connecting output from the Append to Write Database
Configure the Write Database as before except that you need to specify (instead of selecting a pre-existing table) the “table name” as wlslog and select “overwrite mode” as “none,” as shown in Figure 7.
Figure 7: Configuring the Write Database
Optionally, specify a batch size other than the default setting of 1, as shown in Figure 8.
Figure 8: Specifying batch size
Run the Process with Process>Run Process Locally, as shown in Figure 9.
Figure 9: Process>Run Process Locally
The Process runs to completion, as shown in Figure 10.
Figure 10: Process runs to completion
The wlslog table, which we had deleted, gets created and a SQL query lists the merged table data, as shown in Figure 11.
Figure 11: Merged table
Differences Between APPEND and UNION
Next, we shall demonstrate the difference between Append and Union. As discussed before, Append requires the ExampleSets that are to be combined have the same number of attributes and the same attribute names. The Union operator does not have any such requirement. Drop the wlslog table because we shall be creating the database table for the combined table data automatically when the Process is run, as shown in Figure 12.
Figure 12: Dropping the wlslog table
Modify the table definition for the wlslog1 table to include another attribute. First, drop the wlslog1 table and subsequently create the wlslog1 table with an additional attribute.
CREATE TABLE WLSLOG1(LOGID VARCHAR(255) PRIMARY KEY, SERVERNAME VARCHAR(255),CODE VARCHAR(255), MSG VARCHAR(255));
The WLSLOG1 table gets created, as shown in Figure 13.
Figure 13: Creating wlslog1 table
Add table data to WLSLOG1.
INSERT INTO WLSLOG1(LOGID,SERVERNAME,CODE, MSG) values("log1","BEA-000365","Server state changed to STANDBY" ); INSERT INTO WLSLOG1(LOGID,SERVERNAME,CODE, MSG) values("log2","BEA-000365","Server state changed to STARTING"); INSERT INTO WLSLOG1(LOGID,SERVERNAME,CODE, MSG) values("log3","BEA-000365","Server state changed to ADMIN");
Table data gets added to WLSLOG1, as shown in Figure 14.
Figure 14: Adding data to WLSLOG1
We shall use the same Process as was used for the previous section, with some modifications. The Read Database operator keeps data in memory and, if the underlying table is changed, we need to replace the operator, too. To replace the Read Database operator, right-click Read Database and select Replace Operator>Data Access>Database>Read Database, as shown in Figure 15.
Figure 15: Read Database>Replace Operator>Data Access>Database>Read Database
Similarly, replace the Append operator with a new Append operator, as shown in Figure 16.
Figure 16: Replacing the Append Operator
Set the table name for Read Database as wlslog1, as shown in Figure 17.
Figure 17: Setting Table name for Read Database Operator
The Metadata for Read Database should display an additional attribute, SERVERNAME, as shown in Figure 18.
Figure 18: Setting Table name for Read Database Operator
Run the process with Process>Run Process Locally, as shown in Figure 19.
Figure 19: Process>Run Process Locally
An error message gets displayed, “Example sets are not compatible,” as shown in Figure 20. Because we had only added the additional column SERVERNAME to wlslog1 and not to wlslog2, the number of columns in the two database tables are different and, as a result, the number of attributes in the two ExampleSets are different.
Figure 20: Example sets are not compatible
The Union operator would have combined the two ExampleSets with a different number of attributes. To demonstrate, replace the Append operator with the Union operator, as shown in Figure 21.
Figure 21: Replacing the Append operator with the Union operator
Set the table name for the Write Database as wlslog, as shown in Figure 22.
Figure 22: Setting Table Name of Write Database Operator
Run the process with Process>Run Process Locally, as shown in Figure 23.
Figure 23: Process>Run Process Locally
The process runs to completion, as shown in Figure 24.
Figure 24: Process runs to completion
Run a SQL query in the wlslog table, as shown in Figure 25. Because we had only added the additional column SERVERNAME to wlslog1 and not to wlslog2, the ExampleSet generated from the wlslog table does not include the SERVERNAME attribute. As a result, the wlslog table generated from combining the ExampleSets does not include the SERVERNAME column in some of the table rows, the table rows generated from the wlslog2 table data. Although Append won’t combine the two ExampleSets with different number of attributes, Union does and adds a NULL value for the missing column value.
Figure 25: UNION adds NULL values for missing data
To add data to the columns with a NULL value, drop the WLSLOG2 table and create the table again with the SERVERNAME column included.
CREATE TABLE WLSLOG2(LOGID VARCHAR(255) PRIMARY KEY, SERVERNAME VARCHAR(255),CODE VARCHAR(255), MSG VARCHAR(255));
The WLSLOG2 table gets created, which you can see in Figure 26.
Figure 26: Creating WLSLOG2 table
Add data to the WLSLOG2 table. The log4 id data has two of the column data switched. The column data has been switched to demonstrate the use of another operator, the Update Database operator, in the next section.
INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG) values("log4","BEA-000331","AdminServer", "Server state changed to RESUMING"); INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG) values("log5","AdminServer","BEA-000365", "Started WebLogic AdminServer"); INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG) values("log6","AdminServer","BEA-000365", "Server state changed to RUNNING"); INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG) values("log7","AdminServer","BEA-000360", "Server started in RUNNING mode");
The WLSLOG2 table data includes data for the additional column SERVERNAME, as shown in Figure 27.
Figure 27: Adding data to WLSLOG2 table
Select the overwrite mode as “overwrite” to overwrite the WLSLOG2 data, as shown in Figure 28.
Figure 28: Setting overwrite mode
The Process gets configured for another run, as shown in Figure 29.
Figure 29: Process configured
Run the Process with Process>Run Process Locally, as shown in Figure 30.
Figure 30: Process>Run Process Locally
The process runs to completion, as shown in Figure 31.
Figure 31: Process runs to completion
List the WLSLOG data with a SQL query and the previously NULL column values are replaced with data values, as shown in Figure 32.
Figure 32: NULL values replaced with data
The Append operator should be used if the number of columns and column names in two database tables to be combined are the same. The Union operator should be used if the number of columns or/and the column names are different.
Updating Database Tables with the UPDATE Operator
In the previous section, the WLSLOG2 table had data switched in two of the columns; this resulted in a merged data that was also switched in two columns. To fix the data, we don’t need to drop all tables and run the process again with data fixed in the WLSLOG1 and WLSLOG2 tables. The Update Database operator is meant for just such a data fix issue in which only a subset of the data needs to be updated.
First, we need to drop the WLSLOG2 table and replace with a table in which the data is not switched in two columns for the log4 data.
INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG) values("log4","AdminServer","BEA-000331", "Server state changed to RESUMING"); INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG) values("log5","AdminServer","BEA-000365", "Started WebLogic AdminServer"); INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG) values("log6","AdminServer","BEA-000365", "Server state changed to RUNNING"); INSERT INTO WLSLOG2(LOGID,SERVERNAME,CODE, MSG) values("log7","AdminServer","BEA-000360", "Server started in RUNNING mode");
The WLSLOG2 table gets created with data not switched in two columns, as shown in Figure 33.
Figure 33: WLSLOG2 table
A SQL query on WLSLOG2 table lists the table data as fixed with no column data switched, as shown in Figure 34.
Figure 34: SQL Query on WLSLOG2 table
Next, modify the same Process to replace some of the operators. Replace the Write Database operator with the Update Database operator and also replace the Read Database (2) operator because the underlying database table WLSLOG2 has been replaced. An alternative to replacing an operator is to delete an operator and add another.
Right-click the Process and select Insert Operator>Data Access>Database>Update Database, as shown in Figure 35.
Figure 35: Process>Insert Operator>Data Access>Database>Update Database
The Update Database operator gets added, as shown in Figure 36.
Figure 36: Update Database Operator
Delete the Write Database operator and add a connection from Union to Update Database, as shown in Figure 37.
Figure 37: Adding a Connection from Union to Update Database
The Read Database (2) operator needs to be configured slightly differently. We don’t need all the data from the WLSLOG2 table. We need only log4 row data for which we shall define a query instead of selecting a table name, which would select all the data in the WLSLOG2 table. Click Read Database (2) and select a query in the define query parameter, as shown in Figure 38.
Figure 38: Defining a query
Click Build SQL Query, as shown in Figure 39.
Figure 39: Build SQL Query
A Build SQL Query GUI gets displayed, as shown in Figure 40.
Figure 40: Build SQL Query GUI
Specify the following SQL query.
SELECT LOGID,SERVERNAME,CODE,MSG from WLSLOG2 WHERE LOGID="log4"
Click OK, as shown in Figure 41.
Figure 41: Specifying a SQL query
A query over a subset gets defined for Read Database (2); parameters are shown in Figure 42.
Figure 42: Read Database (2) parameters
Before running the process, get an ExampleSet for Read Database (2) with right-click>Show ExampleSet Result, and it should include only a single row of data, as shown in Figure 43.
Figure 43: ExampleSet for Read Database (2)
An ExampleSet for the Read Database operator lists three rows of data, as shown in Figure 44.
Figure 44: ExampleSet for Read Database
Next, configure the Update Database operator. Select the define connection as predefined and select the connection as MySQLDB. Select the scheme name as “mysql” and the table name as “wlslog”. Select the attribute filter type as “single” and select the attribute as “LOGID,” as shown in Figure 45.
Figure 45: ExampleSet for Read Database
Before running a Process with Update Database, display an ExampleSet Result, as shown in Figure 46.
Figure 46: Show ExampleSet Result
The ExampleSet consists of four rows of data which do not include any error in the data, as shown in Figure 47.
Figure 47: ExampleSet result
Select Process>Run Process Locally, as shown in Figure 48.
Figure 48: Process>Run Process Locally
The process completes successfully, as shown in Figure 49.
Figure 49: Process completes successfully
Run a SQL query on the wlslog table created and all the data should be error-free. For comparison, the SQL query result for the wlslog table before the update is also listed, as shown in Figure 50.
Figure 50: The wlslog table before and after Update
The “attr filter type” must not be “all” for the Update Database operator, as shown in Figure 51.
Figure 51: attr filter type>all
The attribute id values for all the attributes in the attribute filter must be the same in a row of data in the database table to update for the row to get updated. If all the id attribute values do not match, a new row gets added instead. To demonstrate, select the attribute filter type as “subset,” as shown in Figure 52, and click Select Attributes.
Figure 52: Selecting attribute filter type as “subset”
Select the CODE, LOGID, and SERVERNAME attributes. Click Apply, as shown in Figure 53.
Figure 53: Selecting attributes
Select Process>Run Process Locally, as shown in Figure 54.
Figure 54: Process>Run Process Locally
The process runs to completion successfully, as shown in Figure 55.
Figure 55: Process runs to completion
But, because the log4 id row in the ExampleSet generated from Read Database (2) has the accurate CODE and SERVERNAME attributes, whereas the CODE and SERVERNAME values in the wlslog table, which is to be updated, are not accurate, not all the selected attributes LOGID, CODE, and SERVERNAME match and a row gets added instead of updating an existing row with LOGID log4, as shown in Figure 56.
Figure 56: Table Row added instead of updating
Conclusion
In this article, we discussed appending and updating database tables with RapidMiner. We also discussed the difference between the UNION and APPEND operators.