Saturday, April 28, 2007

Understanding SQL Server Full-Text Indexing

Understanding SQL Server Full-Text Indexing


Microsoft SQL Server supports T-SQL, an implementation of ANSI standard SQL. T-SQL is designed to (among other things) search for matches in your data. For example, if you've created a table with a column named Notes you could construct these queries:
SELECT * FROM MyTable WHERE Notes = 'Deliver Tuesday'
SELECT * FROM MyTable WHERE Notes LIKE '%caution%'
But what if you're not looking for an exact match, either to the full text of the column or a part of the column? That's when you need to go beyond the standard SQL predicates and use SQL Server's full-text search capabilities. With full-text searching, you can perform many other types of search:

Two words near each other
Any word derived from a particular root (for example run, ran, or running)
Multiple words with distinct weightings
A word or phrase close to the search word or phrase

In this article, I'll show you how to set up and use full-text searching in SQL Server 2000, and give you a sneak peek of the changes that are coming in this area when SQL Server 2005 ships next year.

Full-Text Indexing Architecture
You might be a bit surprised to learn that SQL Server doesn't handle its own full-text indexing tasks. Any version of Windows that SQL Server will run on includes an operating system component named the Microsoft Search Service. This service provides indexing and searching capabilities to a variety of applications, including SQL Server, Exchange, and SharePoint.
SQL Server uses an interface component, the SQL Server Handler, to communicate with the Microsoft Search Service. The Handler extracts data from SQL Server tables that have been enabled for full-text searching and passes it to the search service for indexing. Another component, the full-text OLE DB provider, gets invoked by SQL Server when you actually perform a full-text search. The provider takes the portion of the search that needs to be satisfied by the full-text index and passes it off to the Search Service for evaluation.
You need to be aware of one consequence of this architecture: because the full-text indexes are not in your SQL Server database, they can't be backed up from within SQL Server. Instead, you need to backup the disk files created by the Search Service. You'll find these files located under Program Files\Microsoft SQL Server\MSSQL\FTDATA.
Enabling Full-Text Indexing
As you can probably guess, there's a certain amount of overhead involved in passing data back and forth between SQL Server and the Search Service. To speed things up, SQL Server doesn't pass any data to the Search Service unless you explicitly tell it to do so. After all, you might never want to do any full-text searches, in which case it would be silly to spend time indexing your data for them.
To get started, you need to add a full-text catalog to your database. The easiest way to do this is to open SQL Server Enterprise Manager and expand the node for your database to find the Full-Text Catalogs node (if that node isn't present, check to make sure that the Microsoft Search Service is installed on the server). Right-click on the node and select New Full-Text Catalog. SQL Server will prompt you for a name and location for the catalog (and it will supply a default location). Name the catalog anything you like and click OK to create it.
Next you need to tell SQL Server what data to include in the catalog. Again, you can do this in Enterprise Manager. Right-click on a table and select Full-Text Index Table, Define Full-Text Indexing on a Table. This will launch the SQL Server Full-Text Indexing Wizard. You need to make these choices to complete the wizard:
Select a unique index on the table
Select the columns to index. You can optionally specify a language to use for word breaking.
Select the catalog to contain the index, or create a new catalog.
Create a schedule to repopulate the index on a regular basis (this is also optional).
When you finish the wizard, it will create the index for the table. But the index won't have any entries in it yet. Right-click on the table again anfd select Full-Text Index Table, Start Full Population to build the actual index
Performing a Full-Text Search
Now you're ready to actually do some searches. For these examples, I added a full-text index to the ProductName column in the Northwind Products table. Four T-SQL predicates are involved in full-text searching:
FREETEXT
FREETEXTTABLE
CONTAINS
CONTAINSTABLE
FREETEXT is the easiest of these to work with; it lets you specify a search term but then tries to look at the meaning rather than the exact term when finding matches. For instance, here's a query using FREETEXT together with its results:
SELECT ProductName
FROM Products
WHERE FREETEXT (ProductName, 'spread' )
ProductName
----------------------------------------
Grandma's Boysenberry Spread
Vegie-spread
(2 row(s) affected)
As you can see, FREETEXT finds the word or words you give it anywhere in the search column. FREETEXTTABLE works like FREETEXT except that it returns its results in a Table object.
CONTAINS (and CONTAINSTABLE, which works the same but delivers results in a table) offers a much more complex syntax for using a full-text indexed column:
CONTAINS
( { column * } , '<>'
)
<> ::=
{ <>
<>
<>
<>
<>
}
{ ( <> )
{ AND AND NOT OR } <> [ ...n ]
}
<> ::=
word " phrase "
<> ::=
{ "word * " "phrase * " }
<> ::=
FORMSOF ( INFLECTIONAL , <> [ ,...n ] )
<> ::=
{ <> <> }
{ { NEAR ~ } { <> <> } } [ ...n ]
<> ::=
ISABOUT
( { {
<>
<>
<>
<>
}
[ WEIGHT ( weight_value ) ]
} [ ,...n ]
)
For instance, you can search for one word "near" another this way:
SELECT ProductName
FROM Products
WHERE CONTAINS(ProductName, '"laugh*" NEAR lager')
ProductName
----------------------------------------
Laughing Lumberjack Lager
(1 row(s) affected)
Note the use of "laugh*" to match any word starting with "laugh." You can also supply a weighted list of terms to CONTAINS, and it will prefer matches with a higher weight:
SELECT ProductName
FROM Products
WHERE CONTAINS(ProductName, 'ISABOUT (stout weight (.8),
ale weight (.4), lager weight (.2) )' )
ProductName
----------------------------------------
Laughing Lumberjack Lager
Steeleye Stout
Sasquatch Ale
Outback Lager
(4 row(s) affected)

Looking Forward to SQL Server 2005
SQL Server 2005 features quite a number of changes and improvements in full-text searching:
A dedicated indexing service that works directly with SQL Serrver. This speeds up full-text operations and isolates SQL Server from changes to the search service made by other applications.
Data definition language (DDL) statements for creating and altering full-text catalogs and indexes.
Full-text queries against linked servers.
Full-text queries against arbitrary sets of columns (instead of just one column or all columns).
Specification of the language to be used for word-breaking in an index.
Integrated backup and restore for full-text catalogs.
Full-text indexing for XML data.
Integration with SQL Profiler and logging of index operations.
If you were interested in full-text searching in SQL Server 2000 but ran into brick walls, take another look when the new version comes out. Microsoft's substantial work in this area means that full-text indexing and searching will be better than ever.

Full-Text to the Rescue
Many SQL Server problems can be solved without ever looking at full-text search. But it comes in very handy in one key scenario: when human beings are supplying search terms from their own head, instead of from a list. You may need to work at providing a good user interface for this facility, but if you have people searching through a large corpus of text, you should definitely consider full-text searching. The end result is likely to be a better application and happier users.