Saturday, April 28, 2007

The SQL Server 2000 - FullText Search Service: How to set up and Query with Indexing Service

The SQL Server 2000 FullText Search Service: How to set up and Query with Indexing Service

If you have a discussion forum or other database - enabled service on your site that accumulates significant amounts of textual data, and you want to provide users with a powerful way to search this content based on words or sentences, then the SQL Server Fulltext query engine could be just the thing for you.
Now most developers are familiar with Indexing Service (formerly "Index Server") and how it is used to build a "catalog" of the files on a webserver or file server. You may even be familiar with the ixsso.util and ixsso.Query index query object s and how they can be used to build an ASP - based search page.
It's surprising to find how many developers are unaware of how Indexing Service can also index SQL Server data and provide these search and ranking functions in the same way - through the SQL Server fulltext index facility. In this short article, I'll review how to create and populate a full-text catalog on a SQL Server table, and illustrate some of the syntax from the Transact SQL extensions that allow you to perform powerful full-text search and ranking functions on your data.
SQL Server keeps any fulltext indexes you create in external "catalogs" - the same type of catalog that Indexing Service uses for your webserver or fileserver filesystem. SQL just stores them in a different place. The regular Indexing Service catalogs you are probably familiar with are maintained through the Indexing Services snap-in:
The catalogs for SQL Server Full-Text indexes, however, are managed through the SQL Server Enterprise Manager.
Each Full-Text catalog in Indexing Service stores the index data for one or more SQL Server tables of a single database. Unlike as with regular table indexes, SQL Server does not update these indexes immediately as data is changed, added or deleted. Rather, they are populated and updated according to a schedule that you can conveniently set from within the SQL Server Enterprise manager, or even programmatically with SQL DMO.
Now let's go ahead and set up a catalog on a fictitious "Hotlinks" service table we have on our website. This is a service where "hotlinks" of URLS that are in the news or are of interest to our particular clientele are added on a daily basis by category and which have a potentially large nText type "Description" column in the table. We want our users to be able to search not only by category or keyword but also in the fulltext of this description, including the ability to have exclusionary searches, e.g., "Everything with "maragarita" near the word "party" but NOT "hangover".
Now I don't know about you, but personally my experience tells me the above search is not possible. However, if it were, the stuff we'll do next would enable us to easily handle it!
To create a fulltext index on a SQL Server table, expand a table subtree in Enterprise Manager, right-click the Full-Text Index Table context menu item, and choose "Define Full-Text Indexing on a Table":

Note that you must have at least one unique index field in the table, preferably one with a small numeric integer type such as the identity field. If you don't have such a field on the table, now is a good time to add one. The Wizard will allow you to choose the unique index field you want to use. Next, you'll be asked which table column you want to do the full-text indexing on. Next, the Wizard will suggest an existing Indexing Service catalog to store your new index in, and gives you the option to create a new catalog :
As you can see above, here we have elected to start a new catalog "Hotlinks" for our table, as we expect to accumulate a large amount of data. Finally, the Wizard asks how you want to update the catalog:
Here we have selected to update the full catalog using an incremental population (faster) once every day at midnight, since we have a lot of new data added each day. This completes the wizard, and your catalog and / or population will be created. Finally (and don't forget this step!) populate the index from the "Start full Population" menu choice on the "Hotlinks" table context menu under "Full Text Index Table" ,and we are now ready to actually use our new full-text index.
Now that we're in the full-text search business, let's take a look at some sample SQL code to perform some queries on our full-text search index data, using the Transact SQL extensions that are designed especially for this purpose:
Using the CONTAINS and FREETEXT Functions
CONTAINS:Is a predicate used to search Full-Text indexed columns containing character-based data types for precise or fuzzy (less precise) matches to single words and phrases, the proximity of words within a certain distance of one another, or weighted matches. CONTAINS can search for:
A word or phrase.
The prefix of a word or phrase.
A word near another word.
A word inflectionally generated from another (for example, the word "drive" is the inflectional stem of drives, drove, driving, and driven).
A word that has a higher designated weighting than another word.
Syntax
An exmple of using "CONTAINS"
SELECT Link_Description, Link_URLFROM HOTLINKSWHERE CONTAINS(Link_Description, ' "ASP" OR "XML" ')
The above SQL might be constructed dynamically from form-field choices that a user made in searching out hotlinks archive for a hotlink whose link_description column (that's our full-text indexed column) contained either the word "ASP" or the word "XML".
CONTAINS can also be used with prefix terms (webserv*), proximity terms ("ASP" NEAR "DHTML"), generation (word form) terms, and weighted terms [Link_Description, ISABOUT("ASP" weight (.8), DHTML weight(.4)]
FREETEXT:Is a predicate used to search columns containing character-based data types for values that match the meaning and not the exact wording of the words in the search condition. When FREETEXT is used, the full-text query engine internally "word-breaks" the freetext_string into a number of search terms and assigns each term a weight and then finds the matches.
An example of using "FREETEXT" in a query might be:
SELECT Link_URL, Link_Description, Link_date FROM HOTLINKS WHERE FREETEXT(Link_Description, @SearchWord)
The CONTAINSTABLE and FREETEXTTABLE functions are used to specify full-text queries that return relevance rankings for each row. These functions are very similar but used differently from the full-text predicates, CONTAINS and FREETEXT.
Although both the full-text predicates described above and the full-text rowset-valued functions are used for full-text queries, and the Transact-SQL statement used to specify the full-text search condition is the same in both the predicates and the functions, there are major differences in the way that these are used:
CONTAINS and FREETEXT both return a TRUE or FALSE value, so they are typically specified in the WHERE clause of a SELECT statement, as in the examples shown here.
CONTAINSTABLE and FREETEXTTABLE both return a table of zero, one, or more rows, so they must always be specified in the FROM clause.
CONTAINS and FREETEXT can only be used to specify selection criteria, which Microsoft® SQL Server™ uses to determine the membership of the result set.
CONTAINSTABLE and FREETEXTTABLE are also used to specify selection criteria. The table returned has a column named KEY that contains full-text key values. Each full-text registered table has a column whose values are guaranteed to be unique. The values returned in the KEY column of CONTAINSTABLE or FREETEXTTABLE are the unique values, from the full-text registered table, of the rows that match the selection criteria specified in the full-text search condition.
The table produced by CONTAINSTABLE and FREETEXTTABLE also has a column named RANK, which contains values from 0 through 1000. These values are used to rank the rows returned according to how well they met the selection criteria. You have probably seen these ranks shown in search pages on the web.
Queries that use the CONTAINSTABLE and FREETEXTTABLE functions are more complex than those that use the CONTAINS and FREETEXT predicates because qualifying rows returned by the functions must be explicitly joined with the rows in the original SQL Server table.
This example returns the description and category name of all Hotlinks for which the Link_Description column contains the phrase "ASP HOSTING" near either the word "IIS" or the word "NT" All rows with a Link_TOPIC name "XML" are disregarded. Only rows with a rank value of 2 or higher are returned.
SELECT FT_TBL.Description, FT_TBL.CategoryName, KEY_TBL.RANKFROM HOTLNKS AS FT_TBL INNER JOINCONTAINSTABLE (HOTLNKS, Link_Description, '("ASP HOSTING" NEAR IIS) OR("ASP HOSTING" NEAR NT)') AS KEY_TBLON FT_TBL.Hotlink_ID = KEY_TBL.[KEY]WHERE KEY_TBL.RANK > 2AND FT_TBL.Link_Topic<> 'XML'ORDER BY KEY_TBL.RANK DESC
Note the use of square brackets around the KEY column name in the JOIN statment above. This is because KEY is a SQL Server keyword, and we must enclose these in brackets if we use them for a column or table name. Also note that phrases ("ASP HOSTING") are enclosed in double quotes.
There are a number of other features that can be used by CONTAINSTABLE to make your searches more powerful. An hour or so spent studying SQL Server Books Online will be well worth the effort.
You can avoid the complexity of using CONTAINSTABLE and FREETEXTTABLE by writing stored procedures that accept a few facts about the query and then create and execute the appropriate query. A simplified procedure that submits a FREETEXTTABLE query follows. The table shows the procedure parameters (all input).
procedure parameters
Parameter
Required
Description
@additional_predicates
Optional
If there are any, these get added with AND after the FREETEXT predicate. KEY_TBL.RANK can be used within expressions.
@freetext_column
Yes
@freetext_search
Yes
Search condition.
@from_table
Yes
@order_by_list
Optional
KEY_TBL.RANK can be one of the columns specified.
@select_list
Yes
KEY_TBL.RANK can be one of the columns specified.
The code for the procedure is:
CREATE PROCEDURE freetext_rank_proc@select_list nvarchar(1000),@from_table nvarchar(517),@freetext_column sysname,@freetext_search nvarchar(1000),@additional_predicates nvarchar(500) = '',@order_by_list nvarchar(500) = ''AS BEGINDECLARE @table_id integer,@unique_key_col_name sysname,@add_pred_var nvarchar(510),@order_by_var nvarchar(510)
-- Get the name of the unique key column for this table.SET @table_id = Object_Id(@from_table)SET @unique_key_col_name = Col_Name( @table_id, ObjectProperty(@table_id, 'TableFullTextKeyColumn') )
-- If there is an additional_predicate, put AND() around it.IF @additional_predicates <> ''SET @add_pred_var = 'AND (' + @additional_predicates + ')'ELSESET @add_pred_var = ''
-- Insert ORDER BY, if needed.IF @order_by_list <> ''SET @order_by_var = 'ORDER BY ' + @order_by_listELSESET @order_by_var = ''
-- Execute the SELECT statement.EXECUTE ( 'SELECT ' + @select_list+ ' FROM '+ @from_table+ ' AS FT_TBL, FreetextTable('+ @from_table+ ','+ @freetext_column+ ','''+ @freetext_search+ ''') AS KEY_TBL '+ 'WHERE FT_TBL.'+ @unique_key_col_name+ ' = KEY_TBL.[KEY] ' + @add_pred_var+ ' '+ @order_by_var)END
This procedure can be used to submit the query:
EXECUTE freetext_rank_proc 'Description, KEY_TBL.RANK', -- Select list'Categories', -- From'Description', -- Column'How can I get free ASP web hosting?', -- Freetext search'KEY_TBL.RANK >= 10', -- Additional predicate'KEY_TBL.RANK DESC' -- Order by

There is much more that you can do with Full-Text; space simply does not permit a more complete treatment here. SQL Server Books Online has much more information. There are also several good books that deal with the subject, most notably "Professional SQL Server 2000 Programming" by Robert Vieira (WROX).
One last consideration: Full-Text runs as a separate service on your SQL Server box. It takes resources and memory. When you implement full-text, you should consider the load that the population process will place on your machine, and try to balance that against how quickly you will need changes to be reflected in your search results. At the least, you should try to schedule population for non-peak hours, and also it would be wise to increase the size of your virtual memory pagefile to as large as three times your RAM.
With this ammunition you have just about everything you need to start building a robust full-text search facility that scales to match the needs of your database structure, your business, and your customers.