http://www.developer.com/

Back to article

Reading Excel Files with Apache POI HSSF


December 22, 2008

In one of my former positions, I had the delightful task of cataloging potential solutions to new requirements and then rating them (yes, I'm the kind of masochist who enjoys that sort of thing). A requirement came along to create a fully portable version of Business Intelligence Portal Application that could be run from a non-technical user's laptop and have the data updated real-time through Excel. The business case for this was that a technology executive had been presenting static HTML demonstrations of BI portal applications he was proposing and found that the resistance he received was mostly on the data in the demo being out-of-date, distracting his audience from the benefits of the application itself.

The technical solutions needed were a web application that could be run from the tech-exec's laptop with no knowledge of application servers and the need to read Excel files quickly as a data source, preferably with run-time updates. The first part was solved with an application I had built in my spare time (see Open Source-Based Portal-Lite). This left the reading of the Excel files. I found both open-source and commercial solutions to this. The commercial solutions immediately went to the bottom of the list because this was an application for selling applications, which meant the budget was limited. Of the open source applications, the hands-down, stand-out winner of the evaluation was the Apache POI HSSF APIs.

The Apache POI project is a set of Java APIs to read and write Microsoft Office files. J2EE web applications are primarily used to create a browser interface to databases. No matter what your opinion of Microsoft Office may be, it is ubiquitous in the workplace and there is a huge amount of data stored in Excel files on any network or small business PC.

Creating a Java Object from the Excel File

The first step is to take the Excel file itself and turn it into something that Java will work with. This is accomplished by creating a POIFSFileSystem object and an HSSFWorkbook object. The HSSFWorkbook object is the root object from which you will access the part of the Excel file you are interested in later on.

The example class adapted for this article is from an application where Excel acted as the database, so it uses a file path as an argument to a static constructor:

public static ExcelReader getInstance(String sDocPath)
   throws IOException
{
   return new    ' ExcelReader(sDocPath);
}

Which in turns calls a private constructor to initialize the POIFSFileSystem and HSSFWorkbook objects:

private ExcelReader(String sDocPath) throws IOException
{
   this.instance = new ExcelReader();
   this.docPath  = sDocPath;
   this.document = new POIFSFileSystem(new FileInputStream(docPath));
   this.workbook = new HSSFWorkbook(document);
}

In a web application where the Excel file is uploaded, you would create a java.io.InputStream with whatever mechanism your web application has for handling uploads. For example, a Struts 1.x application would use the getInputStream() method from the org.apache.struts.upload.FormFile interface to provide the parameter for the POIFSFileSystem constructor.

Either way, it is handy to abstract the huge number of HSSF APIs with your own class that focuses on just the key methods you will need for your application. For instance, the preceding constructor creates both the POIFSFileSystem and HSSFWorkbook objects because clients of this class are only interested reading values from the Excel file where the POIFSFileSystem class can be used for both reading and writing Excel files.

Identifying the Parts of the Workbook

Most people are aware that an Excel file contains Worksheets, which contain cells arranged in rows and columns. Even if you are aware that the Excel file is a Workbook, you probably seldom think about it, which is why the example implementation here abstracts the Workbook concept away from the client. Although the HSSF libraries provide methods to do just about anything in Java with a Workbook that you would do in Excel, what you are interested in here is simply reading the data from the file so you can use it however you wish. This means you will want your Java class to do exactly what you as users in Excel would do, in the same sequence, only much, much faster. These steps are:

  • Go to a Worksheet
  • Identify the header row
  • Read the data by rows
  • Read the data by columns

Reading Values

Worksheets can be derived from HSSFWorkbook objects either by name or index. Generally, the client application should be more interested in the name of the worksheet if there are multiple worksheets, and the index if there is supposed to be a single worksheet in Excel file where your data is located. The client application should only be interested in the values in the worksheet, so your reader class does not even need to expose the methods for retrieving the HSSFSheet object, only methods for returning values in a useful manner.

For most purposes, assuming the first row in a spreadsheet provided as a data source that contains the field names is safe. From a practical standpoint, data submitted as input should have a defined format. Even though it is possible to come up with logic for discovering format from any layout the user provides, it would be extremely complex. So, your first convenience method will be to return the values of the first row of a given worksheet:

public ArrayList<String> getColNames(int sheetIndex)
{
   ArrayList<String> colNames = new ArrayList<String>();
   HSSFSheet sheet = this.workbook.getSheetAt(sheetIndex);
   HSSFRow   row   = sheet.getRow(0);
   HSSFCell  cell  = null;
   int       cols  = 0;
   if (row != null)
   {
      cols = row.getPhysicalNumberOfCells();
      for (int i = 0; i < cols; i++)
      {
         cell = row.getCell(i);
         if (cell != null && cell.getCellType() == stringCell)
         {
            colNames.add(cell.getRichStringCellValue().getString());
         }
      }
   }
   return colNames;
}

Now, overload the method to work with the name of a worksheet for the multi-worksheet inputs:

public ArrayList<String> getColNames(String sheetName)
{
   return getColNames(this.workbook.getSheetIndex(sheetName));
}

To represent values from a spreadsheet, I find a Map of ArrayLists handy for the rows of data, with the key being the column name, and then dropping these Maps into an ArrayList. This makes for simple code to read the values for multiple purposes without having to create new objects or expose the HSSF APIs to client applications. This is definitely not the only approach, just the one I found most useful the first time I implemented the HSSF APIs (see the User-Friendly Updates section of "Open Source-Based Portal-Lite").

/**
 * @param sheetIndex
 * @return ArrayList<Map> where the key is the field names.
 * Assumes first row contains field names
 */
public ArrayList<Map> getMappedValues(int sheetIndex)
{
   ArrayList<String>    colNames    = null;
   ArrayList<Map>       mapArray    = null;
   HSSFRow              row         = null;
   HSSFSheet            sheet       = null;
   int                  sheetRows   = 0;
   int                  rowCols     = 0;
   Map<String, Object>  rowMap      = null;

   sheet     = this.workbook.getSheetAt(sheetIndex);
   sheetRows = sheet.getPhysicalNumberOfRows();
   mapArray  = new ArrayList<Map>(sheetRows - 1);
   colNames  = getColNames(sheetIndex);

   colNames.trimToSize();

   rowCols = colNames.size();

   for (int i = 1; i < sheetRows; i++)
   {
      row    = sheet.getRow(i);
      rowMap = new HashMap<String, Object>(rowCols);
      for (int c = 0; c < rowCols; c++)
      {
         rowMap.put(colNames.get(c), getCellValue(row.getCell(c)));
      }
      mapArray.add(rowMap);
   }
   return mapArray;
}

Again, you can overload to provide the convenience of a sheet name instead of an index:

public ArrayList<Map> getMappedValues(String sheetName)
{
   return getMappedValues(this.workbook.getSheetIndex(sheetName));
}

Using What You Read

The following example prints out my collection of series recordings to the console:

public static void main(String args[])
{
   ArrayList<String>                colNames      = null;
   ArrayList<Map<String, Object>>   columnMaps    = null;
   ExcelReader                      excelReader   = null;
   Iterator<String>                 colNamesIt    = null;
   Iterator<Map<String, Object>>    columnMapsIt  = null;
   Map<String, Object>              columnMap     = null;
   String                           colKey        = null;
   String[]                         myRecordings  = new
      String[]{"NCIS", "DoctorWho"};

   try
   {
      excelReader = getInstance("demo_data.xls");

      for(int i=0; i < myRecordings.length; i++)
      {
         colNames     = excelReader.getColNames(myRecordings[i]);
         colNamesIt   = colNames.iterator();
         columnMaps   = excelReader.getMappedValues(myRecordings[i]);
         columnMapsIt = columnMaps.iterator();

         while(columnMapsIt.hasNext())
         {
            columnMap     = columnMapsIt.next();

            while(colNamesIt.hasNext())
            {
               colKey = colNamesIt.next();

               System.out.println(colKey+"\t"+
                  ((columnMap.get(colKey)!=null)?
                    columnMap.get(colKey):""));
            }
            System.out.println("---------------------------------");
            colNamesIt = colNames.iterator();
         }
      }
   }
   catch (Exception any)
   {
      any.printStackTrace();
   }
}

With some minor changes, you can present this as an HTML table:

<tr><% while(colNamesIt.hasNext()){%>
   <th><%=colNamesIt.next() %></th>
   <%}colNamesIt  = colNames.iterator();%></tr>
<% while(columnMapsIt.hasNext()){ %><tr>
<% columnMap = columnMapsIt.next(); while(colNamesIt.hasNext()){
   colKey = colNamesIt.next();
   isNeed = colKey.equals("Status")
            && columnMap.get(colKey)==null;%>
<td<%if(isNeed){ %> bgcolor="red"<%} %>>
   <%=((columnMap.get(colKey)!=null)?columnMap.get(colKey):"Need") %>
</td>
<%}colNamesIt  = colNames.iterator();%></tr>
<%} %></table></td><%}%></tr></table><%}catch
   (Exception any){any.printStackTrace();}%>



Click here for a larger image.

Figure 1: Web Table from HSSF

Or, you could take the field names and values and create a SQL insert or update statement.

Conclusion

Providing non-technical users the ability to submit data in the form of Microsoft Excel files is a win-win solution for both developers and users. The POI project makes this a practical approach for any J2EE web project with their HSSF and XSSF APIs.

For the curious, HSSF stands for "Horrible Spread Sheet Format", and XSSF for "XML Spread Sheet Format". The former is for Excel 2007 and earlier, whereas the latter is for Excel 2007 and forward. Because the majority of users still use the older Excel format, this article contained only HSSF examples. That, and no one has offered me a free copy of the MS Office upgrade to try out the newer APIs.

About the Author

Scott Nelson provides optimization services designing, developing, and maintaining web-based applications for manufacturing, pharmaceutical, financial services, non-profits, and real estate agencies for use by employees, customers, vendors, franchisees, executive management, and others who use a browser. For information on how he can help with your web applications, please visit http://www.fywservices.com/. He also blogs all of the humorous emails forwarded to him at Frequently Unasked Questions.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date