September 30, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

Use Spring Batch's 'Chunk' Processing for Large Data Sets

  • January 15, 2010
  • By Cesar Otero
  • Send Email »
  • More Articles »

Spring Batch is an amazing tool for efficiently processing large amounts of data. Sometimes data sets are too large to process in-memory all at once, so the JVM runs out of memory and buckles under the pressure. A better approach is to use Spring Batch's "chunk" processing, which takes a chunk of data, processes just that chunk, and continues doing so until it has processed all of the data.

This article explains how to create a simple Spring Batch program that fixes an error in a large data set. (Click here to download the source code.) Specifically, the large data set holds employee records for an organization, with columns for the employee's ID, name, and department ID. When the data was created, however, the department ID was accidently omitted. Thankfully, the department ID is the first two digits of the employee ID, so we can use the employee ID to fill in the department ID column. But the data set consists of 20,000 employees (I know, a bit unrealistic but you get the point).

We'll generate our own test data using Java.

Project Requirements

You should be working on a Linux box and have MySQL installed. You may need to do some initial configuration, such as creating a root user. Consult the documentation for your distribution, as well as the documentation on the MySQL website. Optionally, you could try out the example in this article with the H2 database.

You need a complete understanding of dependency injection and how the Spring core works. You also need to know some SQL. (But that's easy for an enterprise developer such as yourself. :-)

The following are the Java dependencies for this project:

  • The Spring 2.5 core (Download the 'with-dependencies' version)
  • xstream
  • Spring Batch, again get the 'with-dependencies' version
  • mysql connector/J

After grabbing all of your dependencies, be sure to add the various JARs to your project classpath.

A Batch Briefing

A Spring Batch project uses a Job, JobLauncher, Step, and JobRepository. A Job is a container for Steps. Each Step may contain a Tasklet, which is nothing more than an object with custom logic for a Job. Each Job is started by a JobLauncher and resides within a JobRepository. Each JobRepository requires a data source. If a JobLauncher is not specified, a default object is instantiated.

Setting Up the Database

Create a MySQL database called "badEmployeeData" and then add a new table with the following definition:

CREATE TABLE EMPLOYEE(
    ID INTEGER,   
    DEPARTMENT_ID INTEGER,
    NAME VARCHAR(255)
);


A single table is more than enough for this demonstration. But be aware that we're going to generate tons of test data for this.

When your database is set up, you'll need to create your Java project structure. I used the following directory tree:

com
   theCompany
      beans
      dao
      jdbcDao
      utils
      resources


Inside the beans package, add the classes Employee and EmployeeUpdatePreparedStatementSetter. Employee is a POJO with getters and setters for the fields ID, departmentId, and name. The EmployeeUpdatePreparedStatementSetter class implements the interface ItemPreparedStatementSetter. We're using it as a stub for the item writer, which requires an implementation of said interface. The following listings show the code for both classes:

The EmployeeUpdatePreparedStatementSetter Class

package com.theCompany.beans;

import com.theCompany.beans.Employee;
import org.springframework.batch.item.database.ItemPreparedStatementSetter;

import java.sql.PreparedStatement;
import java.sql.SQLException;

public class EmployeeUpdatePreparedStatementSetter implements ItemPreparedStatementSetter<Employee> {
    public void setValues(Employee employee, PreparedStatement preparedStatement) throws SQLException {
    }
}


The Employee Class

package com.theCompany.beans;

public class Employee {
    private int id;
    private String name;
    private int departmentId;

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public int getDepartmentId() {
        return departmentId;
    }

    public void setDepartmentId(int departmentId) {
        this.departmentId = departmentId;
    }

    public String toString() {
        return Integer.toString(this.id) + ":" + this.name;
    }
}


Next, we'll need to add a data access layer for writing test data to the database. But before we get into the Java code needed for the DAO (Data Access Object), we'll need to configure our Spring application context.


Tags: Java, Spring, data



Page 1 of 2



Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel