Web ServicesUsing the Twitter API to Create an Early Alert System

Using the Twitter API to Create an Early Alert System

More so today than ever before, we live in a society where rapid and decisive responses to new information can be critical to success.  Whether we are seeking to profit from fluctuations in the stock market, seeking to prepare or respond to a crisis that is beginning to unfold, or seeking to do anything else where quick thinking is key, the ability to be alerted to any changes in the status quo as early as possible can be a huge advantage.  Luckily, in today’s society data streams that constantly report breaking news and shifts in people’s opinions are evolving in a manner that makes collecting data from such sources readily accessible.  One such example, is the widespread adoption of the Twitter platform, in which individuals can rapidly microblog about currently ongoing events in posts of up to 140 characters.  This article will detail the development of an early alert system that uses the Twitter search API to monitor the number of new Tweets, pertaining to a topic of interest, that occur each hour and record them in a database.  The idea behind the methodology is that if the hourly occurrences of new Tweets are monitored for an extended period of time, a baseline can be established that is representative of normal amounts of interest about that topic.  When breaking news occurs, with regards to the topic of interest, it is hypothesized that the number of Tweets pertaining to the topic will begin to exceed the baseline values.  The example application will consider a computer security related scenario, in which the security team for a corporation wants to stay on top of emerging computer security threats, but the concept and code can be readily adapted to any type of event or news topic.  In the exemplified scenario the security team is concerned with the security threats pertaining to SQL injection, XSS, rootkits, botnets and DDoS.

The Twitter Search API Requests

The Twitter search API is used by making GET requests of the URL http://search.twitter.com/search.format, where format will be replaced by either json or atom, depending on the types of results the developer wants returned by the request.  In the example application in this article the chosen result format is ATOM, resulting in a base URL of http://search.twitter.com/search.atom.  The various parameters required of the search can then be appended to this base URL.  A comprehensive listing of all of the various Twitter Search API parameters can be found at https://dev.twitter.com/docs/api/1/get/search.  Of all of the available parameters, the only required parameter is that of “q”, which specifies the search query.  Thus a basic search for the word malware could be performed by making a request to http://search.twitter.com/search.atom?q=malware.  Our example application will make use of several other parameters as well.  The “p” parameter controls the results page number (starting with 1), while the “rpp” parameter controls how many results appear on each page, with a maximum of 100 results per page.  The search API allows you to retrieve up to 1500 results for each search, so the example application processes 15 pages of results with 100 results on each page.  The final parameter used by the example application is that of “result_type”.  For this particular parameter there are three options.

a) recent – returns the most recent tweets on a topic

b) popular – returns the most popular tweets on a topic

c) mixed – returns a combination of recent and popular tweets. 

Since the early alert system is based on occurrences over time, this parameter was set to “recent” to ensure that the 1500 most recent tweets are returned by the search for processing by the application. 

Twitter Search API Results

ATOM formatted Twitter Search API results are returned with all Tweet information contained between a set of “<entry>” tags as demonstrated in Listing 1:

Listing 1: A sample Tweet result as displayed in the ATOM format. 




   <link type="text/html" href="http://twitter.com/nanoquetz9l/statuses/156550318447009793" rel="alternate"/>

   <title>@eXploitSD site registrations open. All levels welcome #exploitation #vulnerability #malware #reverse #ruby #Metasploit http://t.co/IZh5BRCL</title>

   <content type="html">@&lt;a class=" " href="http://twitter.com/eXploitSD"&gt;eXploitSD&lt;/a&gt; site registrations open. All levels welcome &lt;a href="http://search.twitter.com/search?q=%23exploitation" title="#exploitation" class=" "&gt;#exploitation&lt;/a&gt; &lt;a href="http://search.twitter.com/search?q=%23vulnerability" title="#vulnerability" class=" "&gt;#vulnerability&lt;/a&gt; #&lt;em&gt;malware&lt;/em&gt; &lt;a href="http://search.twitter.com/search?q=%23reverse" title="#reverse" class=" "&gt;#reverse&lt;/a&gt; &lt;a href="http://search.twitter.com/search?q=%23ruby" title="#ruby" class=" "&gt;#ruby&lt;/a&gt; &lt;a href="http://search.twitter.com/search?q=%23Metasploit" title="#Metasploit" class=" "&gt;#Metasploit&lt;/a&gt; &lt;a href="http://t.co/IZh5BRCL"&gt;http://t.co/IZh5BRCL&lt;/a&gt;</content>


   <link type="image/png" href="http://a3.twimg.com/profile_images/1503924389/dry_normal.gif" rel="image"/>



   <twitter:source>&lt;a href="http://twitter.com/"&gt;web&lt;/a&gt;</twitter:source>



      <name>nanoquetz9l (Rick Flores)</name>




Within the set of <entry> tags are contained additional child elements that contain the information the comprises the Tweet, such as the <published> tags, which list the time the Tweet was published in GMT, the <content> tags, which contain the Tweet itself, and the <author> tags and its corresponding children, which contain information about the author. The ATOM data in Listing 1 is representative of the Tweet displayed in Figure 1.  For our application we will be extracting the data from the <published> and <content> tags that comprise each search result/entry. 

The Tweet that corresponds to the ATOM data demonstrated in Lisitng 1
Figure 1: The Tweet that corresponds to the ATOM data demonstrated in Lisitng 1

The Early Alert Application

Before we begin looking at the code for an early alert application, we need to construct a database to hold all of the Twitter data that we collect for later analysis.  For this particular project SQLite3 is going to be used as the database of choice, since its self-contained and serverless nature make it easy for those who wish to try the sample code to setup a test environment.  SQLite databases can be setup from the command line, but for those who prefer a GUI means of interaction the SQLite Database Browser provides a nice front end for dealing with SQLite databases graphically.  The SQLite Database Browser is also used in the creation of the screen captures of the database contained in this article.  This early alert application demoed in this article, makes use of database that consists of two tables (Figure 2).  The Counts table consists of “keyword” and “count” fields and is used to store the number of new tweets that appeared within the last hour for each keyword the application searches for.  The Tweets table consists of a “keyword” field, a “time” field, and an “entry” field, which contain the keyword that yielded the tweet, the time of the tweet, and the content of the tweet respectively. 

The schema of the database used by the application
Figure 2: The schema of the database used by the application

Now that the database is set up, the application that will search Twitter, process the Tweets, and store the data in the database can be examined.  The code for the application is written in Perl and contained in Listing 2.

Listing 2: The early alert Perl code.

use strict;
use LWP;
use DBI;
use XML::LibXML;
use Time::Local;

#terms to search Twitter for
my @queries=('sql+injection','XSS','rootkit','botnet','DDoS');

#opens connection to DB to store data
my $dbh = DBI->connect("dbi:SQLite:dbname=TwitterSec.db","","");

my $ua = LWP::UserAgent->new;

#start of twitter URL - specifies results in ATOM format
my $urlbase='http://search.twitter.com/search.atom?q=';

my $i;
my $j;

for($j=0;$j<=3;$j++){ #controls how many hours monitoring will occur for
my $currenttime=time;
foreach my $query(@queries){#loops through search terms

  my $count=0;
  for($i=1;$i<=15;$i++){#checks all 15 result pages
  my $url=$urlbase . $query . "&p=$i&rpp=100&result_type=recent"; #appends URL to recievie 100 results per page
  my $response=$ua->get($url);
  my $results=$response->content;
  my $parser=XML::LibXML->new; #used to parse ATOM results
  my $domtree=$parser->parse_string($results);
  my @entries=$domtree->getElementsByTagName("entry");
  foreach my $entry (@entries){
     my $time=$entry->getChildrenByTagName("published");#extracts time
     my $content=$entry->getChildrenByTagName("content");#extracts tweet
     $time=~/(d+)-(d+)-(d+)T(d+):(d+):(d+)/o; #breaks time up into components
       my $year=$1;
       my $month=$2-1;
       my $day=$3;
       my $hour=$4;
       my $min=$5;
       my $sec=$6;
       my $tweettime=timegm($sec,$min,$hour,$day,$month,$year);#converts time to unix time
       if($tweettime+3600>=$currenttime){#finds tweets less than 1 hour old
         my $ltime=localtime($tweettime);
         open STDERR, '>InsertError';
         $dbh->do( "INSERT INTO Tweets(keyword,time,entry) VALUES ('$query','$ltime', '$content')" );
         close STDERR;
  $dbh->do( "INSERT INTO Counts(keyword,count) VALUES ('$query','$count')" );

The code demonstrated in listing 2 makes use of several Perl modules including LWP to make the GET requests and receive the response to the requests.  The DBI module is used to interface with the SQLite database and the XML::LibXML module is used to parse the ATOM formatted results.  The Time::Local module is used to convert the Tweet times into Unix Times (seconds since 1970) to allow for consistent time handling.  The code listing then defines the keywords that will be used for each query of the API and stores the terms in the @queries array.  Next a connection to the database is established. 

The script is setup to process search results on an hourly basis, and thus the first loop encountered controls how many hours the script will perform the monitoring.  Once inside this loop the inner loop is used to process each of the query terms listed in the @queries array. At this time, the current time is established in Unix notation, and a counter ($count) set to 0.  For each term the 1500 most recent tweets are retrieved from the search API and the XML parser used to extract the time and content that pertains to each tweet.  The time data is separated into its constituent parts (e.g. month, day, hour, etc) and converted to Unix time.  Note that for this particular Perl function, months are numbered 0-11 and that is the rationale for subtracting one from the month value.  Once the tweet time in Unix notation is obtained it is compared to the current time to evaluate if it occurred within the last hour (3600 seconds).  If the tweet occurred within the last hour the tweet information is inserted into the database (Figure 3) and the $count variable incremented by one. Any errors in the insert process will be redirected to the file InsertError.   After each query term is processed, the $count information is inserted into the database (Figure 4) and the process repeated with the next term.  After an hour has elapsed, as timed by the sleep function, the query terms are reprocessed.  The hourly processing of query terms continues until the maximum number of iterations specified in the outermost loop is reached. 

Tweets information inserted into the database
Figure 3: Tweets information inserted into the database

Hourly Tweet counts displayed by keyword
Figure 4: Hourly Tweet counts displayed by keyword 

Application Usage

In order to make effective use of this script as an early alert system, the script must initially be run for a long enough time to establish a firm baseline of what “normal” Twitter traffic is for the topic of interest.  Once these baseline measurements are established, the results that turn up for each hourly period can be compared to these baselines.  Any current Tweet traffic that significantly exceeds your baseline measure is an indicator that some type of breaking news or event is occurring that pertains to that particular topic of interest, and thus serves as an early alert of something to be on the lookout for.  For example, in the case of the computer security related example presented in this article, a sudden spike in Tweets pertaining to rootkits may be an indication of a new type of malware or backdoor appearing in the wild and could alert the security team to pro-actively take steps to attempt to detect the presence of such a threat and mitigate any damage such a threat could do. 

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories