Last year, noted blogger and technical evangelist Robert Scoble made headlines by explaining how he manages to read a staggering 622 RSS (Rich Site Summary) feeds each and every morning. Although attempting to digest this much information on a regular basis is probably overkill for most, it’s a testament to the efficiency boost gained from subscribing to RSS feeds in lieu of navigating from one web site to the next.
But what exactly is an RSS feed, and how does one go about consuming them? In this tutorial, I’ll show you how to use Ruby to retrieve and parse RSS feeds from your favorite web sites. You can use what you learn here to do something as simple as including a favorite RSS feed on your web site, or as the basis for building your own custom RSS aggregator!
RSS Internals
RSS was created almost a decade ago by Netscape for use on their My Netscape portal, which made it possible for users to customize their home pages with a variety of custom data (at the time a cutting-edge development). This XML-based format made it possible for content publishers to distribute information in a format-agnostic manner, allowing others to integrate this content into then web sites with relative ease. That is, easy if you understand RSS’ XML dialect.
If you open any RSS feed within a text editor, you’ll see it contains a bunch of slightly confusing tags that delimit data identified as titles, URLs, dates, creators, and descriptions, among others. For example, here’s a snippet from my blog’s RSS feed:
<item> <title>Adding Multiple Markers with YM4R</title> <link>http://www.wjgilmore.com/?p=38</link> <comments> http://www.wjgilmore.com/?p=38#comments </comments> <pubDate>Thu, 06 Mar 2008 14:39:29 +0000</pubDate> <dc:creator>wjgilmore</dc:creator> <guid isPermaLink="false"> http://www.wjgilmore.com/?p=38 </guid> <description><![CDATA[In this post you'll learn how to add multiple markers...]]></description> <content:encoded><![CDATA[In this post you'll learn how to add multiple markers...]]></content:encoded> <wfw:commentRss> http://www.wjgilmore.com/?feed=rss2&p=38 </wfw:commentRss> </item>
Therefore, to parse and format a feed, you need to iterate over the tags found in the document, and understand the context of the content found within. Writing these sorts of capabilities from scratch can be a real chore; however, with Ruby much of the work has already been done for you!
Using Ruby to Consume RSS Feeds
RSS parsing has become so commonplace that the capability is built directly into the Ruby language. Once included in your script, the rss module will take care of all of the heavy lifting involved in parsing the feed, in the end providing you with an object from which you can access the various RSS elements. The following script will do exactly this, retrieving my blog’s RSS feed, and outputting some information about the feed:
# Provides RSS parsing capabilities require 'rss' # Allows open to access remote files require 'open-uri' # What feed are we parsing? rss_feed = "http://feeds.feedburner.com/WJasonGilmore" # Variable for storing feed content rss_content = "" # Read the feed into rss_content open(rss_feed) do |f| rss_content = f.read end # Parse the feed, dumping its contents to rss rss = RSS::Parser.parse(rss_content, false) # Output the feed title and website URL puts "Title: #{rss.channel.title}" puts "RSS URL: #{rss.channel.link}" puts "Total entries: #{rss.items.size}"
Save this file as parserss.rb and execute it from the command line:
%>ruby parserss.rb Title: W. Jason Gilmore RSS URL: http://www.wjgilmore.com Total entries: 10
Retrieving and displaying the various posts is just as easy. To see this capability in action, add the following snippet to the end of the file:
rss.items.each do |item| puts "<a href='#{item.link}'>#{item.title}</a>" puts "Published on: #{item.date}" puts "#{item.description}" end
Executing parserss.rb anew will produce the same output as before, in addition to output that looks like this:
<a href='http://www.wjgilmore.com/?p=39'> Great Rails Documentation Interface </a> Published on: Thu, 06 Mar 2008 11:09:50 -0500 I met local Rails guru Josh Schairbaum for lunch yesterday, and [...] <a href='http://www.wjgilmore.com/?p=38'> Adding Multiple Markers with YM4R </a> Published on: Thu, 06 Mar 2008 09:39:29 -0500 Continuing the mini-series on Rails/YM4R and Google Maps, one of the most common problems [...] <a href='http://www.wjgilmore.com/?p=37'> Changing the default Google Maps API Icon </a> Published on: Wed, 05 Mar 2008 09:52:28 -0500 I’m using Bill Eisenhauer’s awesome YM4R Rails [...]
Of course, you’re free to apply liberal amounts of Ruby to the data before it’s output. For instance, the default date format is not very user-friendly. One way to improve it is by using strftime, like so:
rss_html << "Published on: #{item.date.strftime("%B %d, %Y")} <br />"
This will result in the publication dates being formatted like this: March 05, 2008.
Dumping the Transformed RSS to an HTML file
Viewing the transformed RSS in a terminal window doesn’t exactly improve your situation. Instead, you’ll want to view the HTML in a browser. Fortunately, dumping the HTML to a file is easy. Just modify the previous code to look like this:
rss_html = "" rss.items.each do |item| rss_html << "<p><a href='#{item.link}'>#{item.title} </a><br />" rss_html << "Published on: #{item.date.strftime("%B %d, %Y")} <br />" rss_html << "#{item.description}</p>" end File.open("wjgilmore.html", "w") do |f| f.write rss_html end
Execute this revised script and then load the file that has been created (wjgilmore.html) into your browser. You’ll see output similar to that shown in the following screenshot.
Figure 1: Converting the RSS feed to HTML.
Sorting Feeds According to Date
Most RSS feeds are sorted according to descending date order (newest on top); however, occasionally you’ll encounter a feed that doesn’t comply. Logically, you’ll want to ensure it conforms to this logical ordering, and so you will need to account for it within your script. Or, perhaps you would for some reason rather read the posts in order of oldest first. Using Ruby’s sort! method, making sure the feeds are sorted in ascending orderm is very simple:
rss.items.sort! {|a,b| a.date <=> b.date}
If you want to sort them according to oldest first, just add the following line after the above:
rss.items.reverse!
Where to From Here?
With great web-based aggregators such as Google Reader at your disposal, there’s little reason to re-invent the wheel and create your own. However, there remain countless possibilities for integrating feeds into your own web site, or creating a new application that helps users filter information more efficiently than ever before. Hopefully, this tutorial will serve as a catalyst for creating the next RSS-driven solution!
About the Author
W. Jason Gilmore is a freelance web developer, consultant, and technical writer. He’s the author of several books, including the best-selling “Beginning PHP and MySQL 5: Novice to Professional, Second Edition” (Apress, 2006. 913pp.).