JavaEnterprise JavaJava, RDF, and the "Virtual Web" Part three: Content aggregation

Java, RDF, and the “Virtual Web” Part three: Content aggregation

Part 2 of this article appeared recently: Java, RDF, and the "Virtual Web" Part two

Content aggregation is defined as a process of collecting data from heterogeneous sources. In today’s online world, this notion gets associated with giant navigational hubs like Yahoo!, Excite, and NetCenter. In practice, content aggregation takes place on different levels in many other Internet and Intranet applications. Such applications either serve content that was aggregated offline, or use custom server-side and browser-side code to perform dynamic aggregation. The power of RDF is in using metadata to control both offline and dynamic aggregation without having to implement custom code.

Content Management

An early solution to the problem of aggregation came in the form of content management systems that supplied content providers with tools for collecting distributed heterogeneous data, converting it to a single representation, and, most commonly, storing it in a relational database. Successful tools tapped into existing workflows to minimize pains from changing old procedures for collecting and maintaining data. Once information was safely in the database, application developers could get at it using server-interpreted scripts (Tcl, Perl, etc.) to generate dynamic pages. The scripts were often referenced through proprietary tags within HTML templates. Template processors were hooked up to HTTP servers via CGI, native server APIs, or, later, the Java Servlet API.

Traditional content management systems work best for brand new applications that have the luxury of establishing their own data maintenance procedures. Replacing or redesigning an existing application often means incompatibility with archived data, which implies massive conversions of data and redundant maintenance for the duration. At the very least, the transition is likely to be a mess, but it can be even worse if existing data maintenance procedures can not be redesigned.

For example, consider a company that was maintaining its business policies in MS Word. Employees that were perfectly content with editing Word documents may have all kinds of difficulties in using database tools to store and edit content, and it would be outright dangerous to get them to edit HTML templates containing content retrieval scripts. The company is likely to end up with a much more complex and expensive process that has the old group of domain experts maintaining policies in their original format. Additional groups would be responsible for converting policies to a new format that is acceptable to database loading tools, as well as for modifying the database and maintaining presentation templates.

RDF and Legacy Data

Although life may be beautiful once all archived content is safely in the database, once efficient tools and procedures are in place, and once employees are re-trained, getting there is incredibly hard. Supporting seamless access to heterogeneous data is not an alternative but a complement to content management solutions. It reduces the pain of transition and makes it possible to deploy new Internet applications without being forced to revamp data maintenance procedures. The trick is to provide seamless access to both new and existing data while implementing minimal amount of custom code. Using RDF, we can associate virtual resources with units of content (e.g., sections of a Word document), publish the virtual resources, and utilize information in these resources to process run-time HTTP requests. Virtual resources may contain metadata referencing Java components that are responsible for dynamic retrieval and customization of content, as well as metadata serving as input to these components.

Sample Specification

Aggregation goes hand-in-hand with syndication, which was discussed in the first two parts of this article.

RDF version of the RSS specification

by Leon Shklar

<?xml version="1.0"?>
<?rdf:RDF
  xmlns_rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns_rss="http://my.netscape.com/publish/formats/RDF/rss091#">

  <rss:channel about="http://myrdfspecs.com/servlet/rss/channel-9876"
      rss_title="W3C"
      rss_link="http://www.w3.org"
      rss_description="Your source for Web news, events,
                       and technologies!"
      rss_language="en-us">

    <rss:image about="http://www.w3.org/Icons/WWW/w3c_main"
      rss_title="W3C Logo"
      rss_link="http://www.w3.org"
      rss_width="88</width"
      rss_height="31</height"
      rss_description="Leading the Web to its full potential ..."/>
         
    <rss:item
      rss_title="Technical Reports"
      rss_link="http://www.w3.org/TR"
      rss_description="W3C Specifications, Working Drafts and Notes"/>

    <rss:item
      rss_title="Press Information"
      rss_link="http://www.w3.org/Press"
      rss_description="News, Press Releases and more ..."/>

  </rss:channel>
     
</rss>

Here, we consider a slightly more complicated example of content aggregation:

RDF specification for aggregating corporate policies from heterogeneous sources

by Leon Shklar

<?xml version="1.0"?>
<?rdf:RDF
  xmlns_rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns_mac="http://www.metaphoria.net/RDF/formats/access#">

  <rdf:description about="http://www.myserver.com/servlet/rdf/datasource1"
      mac_type="datasource"
      mac_location="http://www.fakename.com/IT/ITSecurity.doc"
      mac_processor="net.metaphoria.dts.filters.WordXML"
      mac_profile="http://myrdfspecs.com/analysis/word-897.rdf"/>

  <rdf:description about="http://www.myserver.com/servlet/rdf/datasource2"
      mac_type="datasource"
      mac_location="http://www.fakename2.com/building-security.pdf"
      mac_processor="net.metaphoria.dts.filters.AdobeXML"
      mac_profile="http://myrdfspecs.com/analysis/pdf-351.rdf"/>

  <rdf:description about="http://www.myserver.com/servlet/rdf/datasource3"
      mac_type="datasource"
      mac_location="jdbc:oracle:thin:@dbserver.fake.com:1521:WG73"
      mac_processor="net.metaphoria.dts.access.jdbcWrapper"
      mac_profile="http://myrdfspecs.com/access/query-section.sql"/>

  <rdf:description about="http://www.fakename.com/servlet/rdf/policy-1234"
      mac_type="resource"
      mac_title="Security"
      mac_description="IT Security Policies"
      mac_language="en-us">

    <mac:item
      mac_title="General Security"
      mac_datasource="http://www.myserver.com/servlet/rdf/datasource3"
      mac_sectionid="q359"
      mac_description="Corporate Security Policies - Introduction"/>

    <mac:item
      mac_title="IT Security - Mainframes"
      mac_datasource="http://www.myserver.com/servlet/rdf/datasource1"
      mac_sectionid="1.2"
      mac_description="Corporate Security Policies - IT Security for Mainframes"/>

    <mac:item
      mac_title="IT Security - Unix"
      mac_datasource="http://www.myserver.com/servlet/rdf/datasource1"
      mac_sectionid="1.5"
      mac_description="Corporate Security Policies - IT Security for Unix Systems"/>

    <mac:item
      mac_title="Building Security"
      mac_datasource="http://www.fakename2.com/building-security.pdf"
      mac_sectionid="1"
      mac_description="Corporate Security Policies - Building Secuirity - Overview"/>

  </rdf:description>
     
</rss>

The first part of the specification defines data sources that encapsulate information in a Word document, a PDF file, and a relational database respectively. Data source specifications contain references to Java classes that are responsible for retrieving content and converting it to XML according to profiles that are also referenced in the specifications. Data sources are, in turn, referenced in the second part of the specification that defines the composition of a virtual document. This document is composed of different sections that come from heterogeneous sources, only one of which, the database, is maintained by a content management system. Once either the PDF file or the Word document get converted into XML and migrated into the database, it’s enough to modify the data source reference in the specification to keep the virtual document from changing.

Summary

Content management systems are instrumental in constructing content aggregation solutions. However, there are problems with the "one size fits all" approach that they enforce. What we need is an opportunity to analyze existing data and maintenance procedures, and selectively migrate some data components to content management systems, while retaining the rest in their original formats. RDF applications can help achieve this objective by establishing an abstraction layer that enables the selective migration of data, as well as provides the flexibility to postpone the migration until it becomes appropriate. Moreover, RDF may be successfully applied to modeling workflows, which makes it an attractive platform for next generation content management systems.

Conclusion

RDF holds great promise for the future of the Web. Even though content aggregation, syndication and personalization applications are most likely to benefit in the very short term, RDF is by no means limited to these areas. As the technology matures, it will become increasingly simple to create new applications by generating RDF models. Such models may be generated based on very high-level interactive specifications and data analysis. Ultimately, this should provide non-programmers with the ability to build sophisticated Internet applications.

About the author

Leon Shklar holds a Ph.D. in Computer Science from Rutgers University, New Brunswick, N.J. He is the director of R&D at Information Architects Corp. (IA), Hoboken, N.J. IA’s Metaphoria Virtual Web Server is the first commercial product that employs RDF models to construct sophisticated content aggregation and syndication solutions for the Internet.



Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories