The Statistica Platform is ranked in the top five data science platforms by Gartner’s new report for 2017, “Magic Quadrant for Data Science Platforms” (https://www.gartner.com/doc/3606026/magic-quadrant-data-science-platforms), previously called “Advanced Analytics Platforms” in 2016. The broad range of features and a Graphical User Interface (GUI) provided by Statistica make it one of the most commonly used data science tools.
Statistica data files are called Spreadsheets, which have rows and columns of data. Rows of data are called cases and column headers for data are called variables. A common issue in data preparation is that different team members are developing or collecting data sets separately and the data sets have to be merged before the spreadsheet may be used. The data could be in multiple data files. We shall discuss how data in two different data files may be merged into a single data file with Statistica.
Statistica supports different types of merge modes for two data files, and these are:
- Concatenate: When two data files are concatenated, one data file is taken and added (or concatenated) at the right-side of the other data file.
- Cartesian: Creates a cross product of two data files.
- Match Casenames: Merges the cases (rows) of one file with the cases of the other files by matching the case names.
- Match Variables: Merges the rows of one data file with the rows of the other data file by matching the variable names.
We shall start by discussing the Concatenate merge. This tutorial has the following sections:
Setting the Environment
Download and install Statistica Platform. Statistica data files are called Spreadsheets (stored with the .sta suffix). We shall create some Statistica data files in this tutorial. A data file is created with File>New. In Create New Document, select Spreadsheet, as shown in Figure 1.
Figure 1: Selecting New Spreadsheet to create
To save a data file, select File>Save As, as shown in Figure 2.
Figure 2: File>Save As
Concatenating Data Files
First, create the two data files that are to be merged. The data files to be merged would typically have the same number of rows and same or different number of columns. Because data is to be concatenated, the column names would typically be different. None of this is a requirement; two data files could have different number of rows and we shall discuss how to merge such a set of data files also. The objective is to merge the data in one data file with the other so that the 2nd data file is added at the right side of the 1st data file. As an example, create a data file (called wlslog1.sta) with column headers (variables) timestamp, category, and type and the following data (example log data).
4-8-2014-7:06:16,Notice,WebLogicServer 4-8-2014-7:06:17,Notice,WebLogicServer 4-8-2014-7:06:18,Notice,WebLogicServer 4-8-2014-7:06:20,Notice,WebLogicServer 4-8-2014-7:06:21,Notice,WebLogicServer 4-8-2014-7:06:22,Notice,WebLogicServer
The wlslog1.sta data file is shown in Statistica in Figure 3.
Figure 3: Data File wlslog1.sta
Create another data file (wlslog2.sta) with column headers servername, code, and msg, and add the following data (also example log data).
AdminServer,BEA-000365,STANDBY AdminServer,BEA-000365,RESUMING AdminServer,BEA-000365,ADMIN AdminServer,BEA-000331,STARTING AdminServer,BEA-000365,STARTED AdminServer,BEA-000360,RUNNING
The wlslog2.sta file is shown in Figure 4. To merge the two data files, wlslog1.sta and wlslog2.sta, click the Data tab and select Merge, as shown in Figure 4.
Figure 4: Data File wlslog2.sta
A Merge Options dialog gets displayed, as shown in Figure 5. The Variables tab is selected by default. Select Mode as Concatenate. Click the File 1 button to select the 1st file to merge.
Figure 5: Merge Options
Select the wlslog1.sta file in the Select Spreadsheet dialog (see Figure 6). Click OK. The wlslog1.sta file gets added to the File 1 field. Similarly, select the 2nd file wlslog2.sta.
Figure 6: Selecting a Spreadsheet to Merge
No other configuration is required. By default, an output spreadsheet gets generated and it may be configured with the Options tab, as shown in Figure 7. Keep the default settings for the output spreadsheet.
Figure 7: Options Tab
The two files to be merged get added to the File 1 and File 2 fields, as shown in Figure 8. The default setting for Unmatched Cases fills the data files with the missing values, which implies that empty data is stored for the section of a merged row (case) that does not match from one data file to another. Click OK.
Figure 8: Data Files to Merge
The two data files get concatenated, as shown in Figure 9. The resulting spreadsheet has 6 columns and 6 rows.
Figure 9: Resulting Spreadsheet after a Merge
If one spreadsheet were to have more rows than the other, the two spreadsheets would get concatenated just the same. As an example, add an extra row in the 1st spreadsheet (wlslog1.sta) to make 7 rows, as shown in Figure 10.
Figure 10: Extra Row in wlslog1.sta
When concatenated with the 2nd spreadsheet (wlslog2.sta), the resultant spreadsheet has an extra row with missing data for the columns from the 2nd spreadsheet (see Figure 11).
Figure 11: Merged Spreadsheet
Conclusion
In this tutorial, we introduced merging data files (also called spreadsheets) in Statistica Platform for data science. We discussed one of the merge modes: Concatenating merge. In a subsequent tutorial, we shall discuss merging by matching casenames and by matching variables.