Solving Complex SQL Problems with Full-Text Indexing
I've always believed that the best way to teach a new technique or technology is to define a real-world problem and then illustrate how new technique or technology helped to solve the problem. Therefore, in today's article, I am not only going to write about SQL Server's Full-Text Index (what it is, how to configure it and how to use it), but I'm also going to do so by presenting a very real problem and showing how this SQL Server feature helped me to solve it.
I believe that most code-oriented developer sites make the same mistake of "pigeon holing" their articles into one category or another when organizing their sites. For example, one article might be defined as a "Visual C++ Listview" article and placed into an appropriate folder. Another article might be defined as a "Visual Basic dialog" article and so on. Sites that organize their articles this way typically then provide a table of contents into each folder and provide links on their home page to each folder. It is then up to the site's users to figure out which folder contains the article they're in search of. At first glance, this might appear to be a very straightforward, logical means of organizing a site.
However, this approach is becoming more and more difficult to maintain due to a myriad reasons. One obvious reason is that the reader might not care what language is being used in the article's sample code. In the case of an article detailing how to use MSMQ, the language of the code snippets isn't as important as what the author has to say about what MSMQ is, what advantages it provides and how to incorporate it into your system's overall design. Another frequent problem with is that in today's distributed applications environment frequently multiple technologies are used together. For example, if someone writes an article explaining how to combine the powers of XML, XSL and ASP to write a "guest book" application, in which category does this article belong? If you put the article in only one of those three folders, then you run the risk of someone not finding the article. On the other hand, if you place the article in one folder and links to it from each of the other folders, you run into maintenance headaches. I could go into a dozen more reasons, but the fact is that we had to do something in order to not only make the site easier to maintain, but we needed to make it easier for readers to find the articles they needed to do their jobs.
The answer was to go with an approach where all articles are basically created equal and the database would be used in order to store keywords that would form the backbone of the search engine. Now instead of doing a brain-dead text search through all of our site's content we could offer the readers a search dialog where they could specifically state what search criteria they wanted to search by (e.g., language, keywords, date, author, etc.)
The first thing to do was to simply create two tables: one for Articles and another for Keywords. Since the cardinality between the Articles and Keywords table would be "many-to-many", I followed very basic normalization rules, and created an "intersection table" (Articles2Keywords) that would contain the primary keys for each table. This relationship is shown via a SQL Server 7.0 Diagram in Figure 1.
Now this table layout works great if you want to simply associate the article records to their respective keyword records. However, some problems become quite evident when you start thinking about how to do more than just list an article's keywords. In other words, the code is quite simple now (via a very basic SQL join) to read an article's record and then retrieve all of its keywords. You can also do the opposite just as easily and print a list of keywords along with a list of articles that use that keyword.
However, what isn't so easy is the following. Let's say that I have the following Article table records:
|1||VC++, MSMQ article|
|2||VB, MSMQ article|
I would then have the following records in my Keywords table:
So far, so good. Now the intersection table, Articles2Keywords, would look like this.
Now comes the hard part! How can I write an SQL query against these tables where I stipulate things like return all rows where an article has the keyword VC++ and MSMQ? The simple answer to this question is that you can't write an SQL statement that would allow you to do this and at the same time not result in a Cartesian product-producing, index-ignoring, performance nightmare. The problem is that most of time when you write a query, the specified "pattern matching" is done on a row-by-row basis. In other words, a simple query might have you searching for all articles that were published after a given date. In that case, each Articles table record that meets the specified selection criteria is returned. However, in the case of a many-to-many relationship like the one explained in this article where the search criteria would actually need to be applied to multiple records (e.g., give me all Article records where those records have specific associated multiple records in Keywords table), this becomes impossible to write with a simple query or stored procedure. This is where the combining of a little imagination and the power of SQL Server's Full-Text Indexing come in.
Introducing Full-Text Indexing
The imagination part is that we create a forth, non-normalized table called articleKeywords that contains only two columns: lArticleSeq (the Article table's primary key) and sKeywords (a comma-delimited string of keywords that are used for the article. Now, all we have to do is define a trigger that calls a stored procedure whenever a record is created, updated or deleted in the Articles2Keywords table. This stored procedure will then recreate the articleKeywords record for the affected article. Using our data set described earlier, our articleKeywords table would contain the following data.
Now we just need a simple way to search through this table when the reader performs a search for a given article. To do this, we enlist the help of a relatively underutilized feature of SQL Server called full-text indexing. Full-text indexing is not very complex to use, but does have an absolutely huge amount of flexibility and features associated with it. To fully cover every aspect of this wonderful feature would be a full article in itself. Therefore, for purposes of this article, I'll briefly go over some of the more advanced usages of full-text indexing, but I'll try and keep this description short and salient to the context of solving this article's stated problem.
Full-text indexing was invented to solve just the sort of problem we're faced with here. That is, we need an efficient means of searching through a given column (sKeywords) of a table (articleKeywords) for different combinations of text. For example, we might need to search for each record that contains the text "VC++" and "MSMQ". These two text strings might or might not be adjacent to one another in the sKeywords column.
What full-text indexing does is it provides the ability to search for a set of words or phrases, or even a complete sentence within a text column. It does this through the SQL CONTAINS predicate. This predicate allows you to specify boolean operators, proximity terms (e.g., NEAR), generation terms and even weighted terms in your search criteria. Here are some examples of each of these (notice that the first parameter of the CONTAINS predicate is the column name to be searched):
- Boolean operators such as AND and OR can be used.
SELECT lArticleSeq from articleKeywords WHERE CONTAINS(sKeywords, 'VC++ AND MSMQ')
- Proximity terms enable you to search for a word within a specified distance (in words) of another word. For example, the following SQL would search for records where the keyword MSMQ is near the keyword VC++. Obviously, our particular problem is not suited for this type of search, but this ability does come in handy when searching through very large text items (whole articles, for example).
SELECT lArticleSeq from articleKeywords WHERE CONTAINS(sKeywords, 'VC++ NEAR MSMQ')
- Generation terms are extremely useful when you are searching for words that may be represented in different forms (as opposed to technical terms). For example, the following SQL would return columns containing any form of the base word found (e.g., find, finding, found, etc.):
- Weighted terms are used when you want to give a certain value, or weight, to the different elements being searched for. This is the type of search performed by advanced Web site search engines when they return with a list of weighted search results:
SELECT articleTitle, articleDescription FROM Articles WHERE CONTAINS(articleDescription, 'ISABOUT (VC++ weight (.8), MSMQ weight (.4), MTS weight (.2) )' )
SELECT someField FROM someTable WHERE CONTAINS(searchField, ' FORMSOF (INFLECTIONAL, find) ')
As you can see, you can do some pretty powerful searches when full-text indexing is involved. Now, let's see how you configure this feature.
Configuring Full-Text Indexing
SQL Server 7.0 provides a wizard to aid in this process. However, I am documenting the slightly more manual approach so that SQL Server 6.5 users can follow along as well.
The following steps include SQL and stored procedures that you need to call. If you're running SQL Server 7.0, you can perform these steps using the Query Analyzer (remember to set the DB combo box in the Query Analyer's toolbar). If you're running SQL Server 6.5, you can perform these steps using the Query window of the desired database.
- The first step is to turn on full-text indexing for the desired database. To do this, simply call the system-supplied sp_fulltext_database stored procedure.
- Next, you have to create a full-text catalog. This catalog is what will be used by the full-text index when you perform searches against your indexed table. As you can see, I created a catalog called articleKeywordsCatalog since this catalog will be used for the articleKeywords table. While not mandatory, it is a good idea to adopt a similar naming convention:
- The next thing you need to do is to tell SQL Server which tables you want indexed via the sp_fulltext_table stored procedure:
sp_fulltext_table 'articleKeywords', 'create', 'articleKeywordsCatalog', 'PK_articleKeywords'
There are two things to notice here. First, notice that it is here that we associate the table with its indexing catalog. The second thing to notice is the last parameter of this stored procedure which is where you specify a unique index to be used. If you haven't explicitly created an index for the table, but have specified a unique primary key then the name of the index to use will in the form PK_[table name].
- Once you've done that, call the sp_fulltext_column stored procedure for each column that you want indexed. By now, you're probably beginning to realize why it's cool that SQL Server 7.0 provides a wizard to automate this!
- Now, activate the index by calling the sp_fulltext_table stored procedure as follows.
- The last step (and one that you'll have to do even if you use the SQL Server 7.0 Full-Text Index wizard) is to tell SQL Server to populate the catalog.
sp_fulltext_catalog 'articleKeywordsCatalog', 'start_full'
sp_fulltext_catalog 'articleKeywordsCatalog', 'create'
sp_fulltext_column 'articleKeywords', 'sKeywords', 'add'
As you can see from the format, the first argument is the table name, the second is the column name and the third argument is simply used to specify whether you are adding or removing a column from indexing.
sp_fulltext_table 'articleKeywords', 'activate'
You have now seen how to configure full-text indexing for a given table/column. The next section just briefly shows the last few steps we needed to perform to finish solving our initial problem.
Now that all the tables are defined and indexing is configured, we need only to populate the Articles2Keywords table. In our case, we did this by simply creating a stored procedure that joined the Articles, Keywords and Articles2Keywords tables and wrote a single "keywords" record to the articleKeywords table for each article. We then defined a trigger so that any time a record in the Articles2Keywords table was created, updated or deleted, a stored procedure was automatically run to executed that article's record. We could do this because in our environment, the most we'll worry about posting per day is one or two articles. Obviously, if your situation involves tables with very large data sets that are updated much more frequently, your approach might be to run this stored procedure as part of "end of day" (which would obviously mean that your "summary" table would always be one day behind production).
In this article, my approach to showing you how to configure and use full-text indexing was done by first presenting a real-life problem and illustrating how full-text indexing can be used to solve it. Hopefully, in the process of reading this article, you've seen not only how to use full-text indexing, but have thought of some other ways in which this very useful feature of SQL Server can be used to solve your own problem domain-specific challenges.
About the author:Tom Archer runs the CodeGuru Web site as well as a brand new site dedicated to Windows DNA called sourceDNA. In addition, he also writes books and magazine articles and occasionally speaks at major Visual C++ conferences. When Tom isn't doing all that, he somehow tries to find time for his two favorite hobbies: traveling and meeting new people.