Sunday, August 26, 2007

Inverted index (Wikipedia)

Inverted index (Wikipedia)
Inverted index

An inverted index (also referred to as postings file or inverted file) is an index structure storing a mapping from words to their locations in a document or a set of documents, allowing full text search. It is the most popular data structure used in document retrieval systems.

There are two main variants of inverted indexes: A record level inverted index (or inverted file index) contains a list of references to documents for each word. A word level inverted index (or full inverted index) additionally contains the positions of each word within a document. The latter form offers more functionality (like phrase searches), but needs more time and space to be created.

Example

Given the texts T0 = "it is what it is", T1 = "what is it" and T2 = "it is a banana", we have the following inverted file index (where the integers in the set notation brackets refer to the subscripts of the text symbols, T0, T1 etc.):

"a":      {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}

A term search for the terms "what", "is" and "it" would give the set \{0,1\} \cap \{0,1,2\} \cap \{0,1,2\} = \{0,1\}.

With the same texts, we get the following full inverted index, where the pairs are document numbers and local word numbers. Like the document numbers, local word numbers also begin with zero. So, "banana": {(2, 3)} means the word "banana" is in the third document (T2), and it is the fourth word in that document (position 3).

"a":      {(2, 2)}
"banana": {(2, 3)}
"is": {(0, 1), (0, 4), (1, 1), (2, 1)}
"it": {(0, 0), (0, 3), (1, 2), (2, 0)}
"what": {(0, 2), (1, 0)}

If we run a phrase search for "what is it" we get hits for all the words in both document 0 and 1. But the terms occur only consecutively in document 1.

Applications

The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Once a forward index is developed, which stores lists of words per document, it is next inverted to develop an inverted index. Querying the forward index would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistic. Instead of listing the words per document in the forward index, the inverted index data structure is developed which lists the documents per word.

With the inverted index created, the query can now be resolved by jumping to the word id (via random access) in the inverted index. Random access is generally regarded as being faster than sequential access.

See also

References

  • Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89685-0. Pages 560–563 of section 6.5: Retrieval on Secondary Keys.
  • Justin Zobel, Alistair Moffat and Kotagiri Ramamohanarao, Inverted files versus signature files for text indexing. ACM Transactions on Database Systems (TODS), Volume 23, Issue 4 (December 1998), Pages: 453 - 490.
  • Justin Zobel and Alistair Moffat, Inverted Files for Text Search Engines . ACM Computing Surveys 38(2).

External links