http://www.developer.com/

Back to article

Use Spring Batch's 'Chunk' Processing for Large Data Sets


January 15, 2010

Spring Batch is an amazing tool for efficiently processing large amounts of data. Sometimes data sets are too large to process in-memory all at once, so the JVM runs out of memory and buckles under the pressure. A better approach is to use Spring Batch's "chunk" processing, which takes a chunk of data, processes just that chunk, and continues doing so until it has processed all of the data.

This article explains how to create a simple Spring Batch program that fixes an error in a large data set. (Click here to download the source code.) Specifically, the large data set holds employee records for an organization, with columns for the employee's ID, name, and department ID. When the data was created, however, the department ID was accidently omitted. Thankfully, the department ID is the first two digits of the employee ID, so we can use the employee ID to fill in the department ID column. But the data set consists of 20,000 employees (I know, a bit unrealistic but you get the point).

We'll generate our own test data using Java.

Project Requirements

You should be working on a Linux box and have MySQL installed. You may need to do some initial configuration, such as creating a root user. Consult the documentation for your distribution, as well as the documentation on the MySQL website. Optionally, you could try out the example in this article with the H2 database.

You need a complete understanding of dependency injection and how the Spring core works. You also need to know some SQL. (But that's easy for an enterprise developer such as yourself. :-)

The following are the Java dependencies for this project:

  • The Spring 2.5 core (Download the 'with-dependencies' version)
  • xstream
  • Spring Batch, again get the 'with-dependencies' version
  • mysql connector/J

After grabbing all of your dependencies, be sure to add the various JARs to your project classpath.

A Batch Briefing

A Spring Batch project uses a Job, JobLauncher, Step, and JobRepository. A Job is a container for Steps. Each Step may contain a Tasklet, which is nothing more than an object with custom logic for a Job. Each Job is started by a JobLauncher and resides within a JobRepository. Each JobRepository requires a data source. If a JobLauncher is not specified, a default object is instantiated.

Setting Up the Database

Create a MySQL database called "badEmployeeData" and then add a new table with the following definition:

CREATE TABLE EMPLOYEE(
    ID INTEGER,   
    DEPARTMENT_ID INTEGER,
    NAME VARCHAR(255)
);


A single table is more than enough for this demonstration. But be aware that we're going to generate tons of test data for this.

When your database is set up, you'll need to create your Java project structure. I used the following directory tree:

com
   theCompany
      beans
      dao
      jdbcDao
      utils
      resources


Inside the beans package, add the classes Employee and EmployeeUpdatePreparedStatementSetter. Employee is a POJO with getters and setters for the fields ID, departmentId, and name. The EmployeeUpdatePreparedStatementSetter class implements the interface ItemPreparedStatementSetter. We're using it as a stub for the item writer, which requires an implementation of said interface. The following listings show the code for both classes:

The EmployeeUpdatePreparedStatementSetter Class

package com.theCompany.beans;

import com.theCompany.beans.Employee;
import org.springframework.batch.item.database.ItemPreparedStatementSetter;

import java.sql.PreparedStatement;
import java.sql.SQLException;

public class EmployeeUpdatePreparedStatementSetter implements ItemPreparedStatementSetter<Employee> {
    public void setValues(Employee employee, PreparedStatement preparedStatement) throws SQLException {
    }
}


The Employee Class

package com.theCompany.beans;

public class Employee {
    private int id;
    private String name;
    private int departmentId;

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public int getDepartmentId() {
        return departmentId;
    }

    public void setDepartmentId(int departmentId) {
        this.departmentId = departmentId;
    }

    public String toString() {
        return Integer.toString(this.id) + ":" + this.name;
    }
}


Next, we'll need to add a data access layer for writing test data to the database. But before we get into the Java code needed for the DAO (Data Access Object), we'll need to configure our Spring application context.

The Application Context

Listing 1 is the full application context needed to run this batch job. In the code, first we declare XML namespaces; any Spring bean falls under the namespace 'beans'. The first bean we configure is the data source, which is an instance of org.apache.commons.dbcp.BasicDataSource. We set the properties for driver, the connection URL, user name, and password, and then we create a bean for the transaction manager that will handle the transactions for the dataSource bean.

We use dependency injection for the jdbcEmployeeDao, as well as for the class com.theCompany.jdbcDao.jdbcEmployeeDao, which implements the interface employeeDao shown in code listing below. The employeeDao interface requires only one method, generateTestData(). Of course, if you want to add any other data access methods, feel free to add them here.

The Interface employeeDao

package com.theCompany.dao;

import java.util.List;

public interface EmployeeDao {
    void generateTestData();
}


The jdbcEmployeeDao definition (shown in listing below) extends JdbcDaoSupport, which gives us a free setDataSource() method, and executes an update with the Spring Batch writer object. Later, when you try running the sample, notice how the number of records printed per second slows down as execution continues. When you run the example with Batch processing, you'll notice that there's no slow down.

The jdbcEmployeeDao Definition

package com.theCompany.jdbcDao;

import com.theCompany.dao.EmployeeDao;
import org.springframework.jdbc.core.support.JdbcDaoSupport;

// by inheriting from JdbcDaoSupport we get a free setDataSource() method
public class JdbcEmployeeDao extends JdbcDaoSupport implements EmployeeDao {
    public void generateTestData(){
        for(long i = 1001; i<=100000; i++){
            getJdbcTemplate().execute("INSERT INTO EMPLOYEE VALUES("+i+",0,'blah name')");
            System.out.println("record " + i + " inserted");
        }
    }
}


Now, create a main method and execute with the code below in order to generate the test data.

package com.theCompany.utils;

import org.springframework.context.ApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;
import com.theCompany.jdbcDao.jdbcEmployeeDao;

public class GenerateData {
    public static void main(String[] a) throws Exception {
        ApplicationContext context = new ClassPathXmlApplicationContext("resources/application-context.xml");
        jdbcEmployeeDao dao = (jdbcEmployeeDao)context.getBean("employeeDao");

        dao.generateTestData(); 
    }
}


The Batch Job

Now that we have our basic data connection stuff set up and we have some test data to play with, we can proceed with configuring our batch job. We'll create a simple job, which uses the default job repository and contains only a single step.

From Listing 1, you can see that the tasklet does nothing more than run a chunk of code using the item reader (an instance of org.springframework.batch.item.database.JdbcCursorItemReader) and the item writer. The item writer requires an itemPreparedStatementSetter and a row mapper. The row mapper is shown in the listing below. The item writer will update the data using the query UPDATE EMPLOYEE SET DEPARTMENT_ID = SUBSTRING(ID,1,2), or grab the first two characters from ID and insert it into the DEPARTMENT_ID column.

The Row Mapper
package com.theCompany.jdbcDao;

import java.sql.ResultSet;
import java.sql.SQLException;

import org.springframework.jdbc.core.RowMapper;
import com.theCompany.beans.Employee;


public class EmployeeRowMapper implements RowMapper {

    public static final String ID_COLUMN = "id";
    public static final String NAME_COLUMN = "name";

    public Object mapRow(ResultSet resultSet, int i) throws SQLException {
        Employee employee = new Employee();
        
        employee.setId(resultSet.getInt(ID_COLUMN));
        employee.setName(resultSet.getString(NAME_COLUMN));

        return employee;
    }
}


Add the class in Listing 2 to package com.theCompany, which will be the main for our program. Different from before when we added our test data, when you run this main with the writer.write(emp) commented out, notice how the data is written to the standard output without slowing down. This is one of the big advantages of using batch processing.

Now, uncomment write.write(emp) and run again. The execution speed slows down significantly. So, what's the advantage? Under an enormous load, the database will slow down. Whereas here, the execution speed will stay constant, independent of the load.

Chunk Processing Is More Efficient

From the example presented here, you can see how it's more efficient to process chunks of code as opposed to trying to run everything from memory. If the data is too large, it's impossible to take that route anyways. With a little extra configuration, you can save a lot of processing time. With that in mind, I've really only scratched the surface of what can be done with Spring Batch.

Code Download

  • SpringBatchEmployeeExample.zip

    References

  • Sitemap | Contact Us

    Thanks for your registration, follow us on our social networks to keep up-to-date