Sunday, August 26, 2007

Full text search (Wikipedia)

Full text search (Wikipedia)

...

Improving the performance of full text searching

The deficiencies of free text searching have been addressed in two ways: By providing users with tools that enable them to express their search questions more precisely, and by developing new search algorithms that improve retrieval precision.

Improved querying tools

  • Keywords. Document creators (or trained indexers) are asked to supply a list of words that describe the subject of the text, including synonyms of words that describe this subject. Keywords improve recall, particularly if the keyword list includes a search word that is not in the document text.
  • Field-restricted search. Some search engines enable users to limit free text searches to a particular field within a stored data record, such as "Title" or "Author."
  • Boolean queries. Searches that use Boolean operators (for example, "encyclopedia" AND "online" NOT "Encarta") can dramatically increase the precision of a free text search. The AND operator says, in effect, "Do not retrieve any document unless it contains both of these terms." The NOT operator says, in effect, "Do not retrieve any document that contains this word." If the retrieval list retrieves too few documents, the OR operator can be used to increase recall; consider, for example, "encyclopedia" AND "online" OR "Internet" NOT "Encarta". This search will retrieve documents about online encyclopedias that use the term "Internet" instead of "online." This increase in precision is very commonly counter-productive since it usually comes with a dramatic loss of recall. [2]
  • Phrase search. A phrase search matches only those documents that contain a specified phrase, such as "Wikipedia, the free encyclopedia."
  • Concordance search. A concordance search produces an alphabetical list of all principal words that occur in a text with their immediate context.
  • Proximity search. A phrase search matches only those documents that contain two or more words that are separated by a specified number of words; a search for "Wikipedia" WITHIN2 "free" would retrieve only those documents in which the words "Wikipedia" and "free" occur within two words of each other.
  • Regular expression. A regular expression employs a complex but powerful querying syntax that can be used to specify retrieval conditions with precision.
  • Wildcard search. A search that substitutes one or more characters in a search query for a wildcard character such as an asterisk. For example, in the search function in Microsoft Word, using the asterisk in the search query "s*n" will find "sin", "son", "sun", etc. in a text.

Improved search algorithms

Technological advances have greatly improved the performance of free text searching. For example, Google's PageRank algorithm gives more prominence to documents to which other Web pages have linked. This algorithm dramatically improves users' perception of search precision, a fact that explains its popularity among Internet users. See search engine for additional examples.