Saturday, April 28, 2007

SQL Server Full Text Search Optimization

SQL Server Full Text Search Optimization


by Tony Bain (visit his website)
Full text indexing is a great feature that solves a long running database problem, the searching of textual data columns for specific words and phrases in SQL Server databases.
However, as the full-text search engine is a separate component of SQL Server (it uses the Microsoft Search Service) this can be a potential performance problem due to the way full-text search interacts with SQL Server.
Full-text search works really well when you are searching a text column and you are interested in all the rows that match your highly selective search criteria. However, full-text search is commonly used with other SQL Server predicates, such as those included in the WHERE clause.
For example, let's say you wish to search the description of properties for sale for the word "bath". However, you are only interested in flats with baths. The statistics of the table you will be searching are:
Total Properties
Properties with bath's
Houses
800,000
330,000
Flats
20,000
2,000
Apartments
15,000
1,200
To perform this query you could write it as follows:
SELECT *FROM properties p INNER JOIN containstable(properties, description, 'bath') tON p.PropertyID = t.[key]WHERE p.type = 'flat'

Now this is where full-text starts to under-perform. As I said above, this is due to the interaction between the Microsoft Search Service and SQL Server. Essentially what happens is the the Search Service is asked to search the catalog for rows which contain the word 'bath'. The matching key rows are returned to SQL Server, and then this result set is filtered down to the only show the top ten when the type is "flat".The performance hit in the above example would be when the Search Service is returning the key rows to the Query Optimizer. In the example we are only interested in the top 10 results, but the search service returns 333,200 key rows to the Query Optimizer. The Optimizer filters these results to find those which are the type 'flat'.There is currently no ability for the Search Service to restrict the number of rows it returns to SQL Server, other than the textual search condition and the top_n_by_rank parameter.
The top_n_by_rank parameter of the containstable function limits the number of rows returned to the number you specify in order of descending rank. For example:
SELECT *FROM properties pINNER JOIN containstable( properties, description, 'bath',10) tON p.PropertyID = t.[key]WHERE p.type = 'flat'
While at first this appears to meet our needs, in fact it may result in no rows been returned. This is because the top_n_by_rank is evaluated by the Search Service before being returning the key rows to SQL Server. The top 10 rows that are returned to the Query Optimizer from the Search Service may not be of type='flat' and therefore be filtered out of the result set.
To be honest, there is no elegant way to improve the performance. However, in certain situations you can dramatically improve the performance of a full text query by embedding textual codes within the text column. This will allow you to search both on your embedded textual code and the required search condition. This can reduce the number of rows returned to SQL greatly and change the performance of the full-text query by an order of magnitude.
For example if your data is:
PropertyID
Type
Description
1
HOUSE
Big, Nice, Tidy, Shower, Kitchen
2
FLAT
Small, Tidy, Shower, Bath, Gas
3
HOUSE
Medium, Average Quality, Bath
......


update it to:
PropertyID
Type
Description
1
HOUSE
TYPEHOUSE Big, Nice, Tidy, Shower, Kitchen
2
FLAT
TYPEFLAT Small, Tidy, Shower, Bath, Gas
3
HOUSE
TYPEHOUSE Medium, Average Quality, Bath
......


now you can rewrite your query to the following:
SELECT top 10 *FROM properties pINNER JOIN containstable(properties,'"TYPEFLAT" and "bath"') tON p.PropertyID = t.[key]WHERE p.type = 'flat'
For this example, only 2,000 results will be returned to the Query Optimizer, and from this the top 10 rows will be returned to the user. This may be acceptable performance, however, if we run the following query:
SELECT top 10 *FROM properties pINNER JOIN containstable(properties,'"TYPEHOUSE" and "bath"') ON p.PropertyID = t.[key]WHERE p.type = 'HOUSE'
In this example, 330,000 rows will be returned to the Query Optimizer, which means performance will still be poor. However, now that we are filtering the results in the Search Service before they are returned to the Query Optimizer, we can specify the number of results we need by using the top_n_by_rank parameter of the containstable function. So rewritten, this query would look like:
SELECT *FROM properties pINNER JOIN containstable(properties,'"TYPEFLAT" and "bath"',10) tON p.PropertyID = t.[key]WHERE p.type = 'flat'
Obviously, you don't want to return the "TYPEFLAT" text in the description column to your database applications, so the final query should look like:
SELECT PropertyID, Type, SubString(description, 9, 9 - LEN(description)) AS descriptionFROM properties pINNER JOIN containstable(properties,'"TYPEFLAT" and "bath"',10) tON p.PropertyID = t.[key]WHERE p.type = 'flat'
In this example only 10 rows will be returned the Query Optimizer, so performance should ROCK!
Obviously you will need to maintain the text code in the full-text column by using triggers, however the overhead in doing so should be minimal when compared to the speed performance gained.