Thursday, August 30, 2007
Layer 2 VPN Architectures: Understanding Any Transport over MPLS
read more | digg story
Which Routing Protocol Should My Network Use?
http://www.informit.com/article.aspx?p=417092
Which Routing Protocol Should My Network Use?
By Russ White, Alvaro Retana, Don Slice
Date: Sep 30, 2005
Sample Chapter is provided courtesy of Cisco Press.
One of the toughest questions facing network engineers is which routing protocol to use. Each has its strengths, and each works better with some network designs than with others. This chapter will help you decide which routing protocol is best for your particular network.
Among all the thorny questions that network engineers are asked on a regular basis, probably among the hardest is this one:
- My network currently runs Enhanced Interior Gateway Routing Protocol (EIGRP). Would I be better off if I switched to Open Shortest Path First (OSPF)?
You can replace the two protocols mentioned in this sentence with any pair of protocols among the advanced interior gateway protocols (OSPF, Intermediate System-to-Intermediate System [IS-IS] and EIGRP), and you have described a question that routing protocol engineers are asked probably thousands of times a year. Of course, convergence is always faster on the other side of the autonomous system boundary, so to speak, so it is always tempting to jump to another protocol as soon as a problem crops up with the one you are running.
How do you answer this question in real life? You could try the standard, "It depends," but does this really answer the question? The tactic in the Routing Protocols Escalation Team was to ask them questions until they went away, but none of these answers really helps the network operator or designer really answer the question, "How do you decide which protocol is the best?"
Three questions are embedded within this question, really, and it is easier to think about them independently:
- Is one protocol, in absolute terms, "better" than all the other protocols, in all situations?
- If the answer to this first question is "No," does each routing protocol exhibit some set of characteristics that indicate it would fit some situations (specifically, network topologies) better than others?
- After you have laid out the basics, what is the tradeoff in living with what you currently have versus switching to another routing protocol? What factors do you need to consider when doing the cost/benefit analysis involved in switching from one routing protocol to another?
This appendix takes you through each of these three questions. This might be the first and last time that you hear a network engineer actually answer the question, "Which routing protocol should I use?" so get ready for a whirlwind tour through the world of routing.
Is One Protocol "Better" Than the Others?
The first thing you need to do with this sort of question is to qualify it: "What do you mean by better?" Some protocols are easier to configure and manage, others are easier to troubleshoot, some are more flexible, and so on. Which one are you going to look at?
This appendix examines ease of troubleshooting and convergence time. You could choose any number of other measures, including these:
- Ease of management—What do the Management Information Bases (MIBs) of the protocol cover? What sorts of show commands are available for taking a network baseline?
- Ease of configuration—How many commands will the average configuration require in your network configuration? Is it possible to configure several routers in your network with the same configuration?
- On-the-wire efficiency—How much bandwidth does the routing protocol take up while in steady state, and how much could it take up, at most, when converging in response to a major network event?
Ease of Troubleshooting
The average uptime (or reliability) of a network is affected by two elements:
- How often does the network fail?
- How long does it take to recover from a failure?
The network design and your choice of equipment (not just the vendor and operating system, but also putting the right piece of equipment into each role and making certain that each device has enough memory, and so on) play heavily into the first element. The design of the network also plays into the second element. The piece often forgotten about when considering the reliability of a network is how long it takes to find and fix, or troubleshoot, the network when it fails.
Ease of management plays a role in the ease of troubleshooting, of course; if it is hard to take a baseline of what the network is supposed to look like, you will not do so on a regular basis, and you will have a dated picture to troubleshoot from. The tools available for troubleshooting are also important. Of course, this is going to vary between the implementations of the protocols; here, implementations in Cisco IOS Software illustrate the concepts. Table G-1 outlines some of the troubleshooting tools that are available in EIGRP, OSPF, and IS-IS, in Cisco IOS Software.
Table G-1 Cisco IOS Software Troubleshooting Tools for EIGRP, OSPF, and IS-IS
EIGRP
OSPF
IS-IS
Debug Neighbors
Neighbor formation state; hello packets.
Neighbor formation state; hello packets.
Packets exchanged during neighbor formation.
Log Neighbor State
Yes.
Yes.
No.
Debug Database Exchange and Packets
Packets exchanged (updates, replies, and so on), with filters per neighbor or for a specific route.
Packets flooded, with filters for specific routing information. Packets retransmitted.
Packets flooded.
Debug Interactions with the Routing Table
Yes.
No.
No.
Debug Route Selection Process
Yes (DUAL1 FSM2 events).
Yes (SPF3 events).
Yes (SPF events).
Show Database
Yes, by specific route and route state.
Yes, by LSA4 type and advertising router.
Yes, by LSP5 ID or type of route.
Event Log
Yes; understandable if you comprehend DUAL and its associated terminology.
Yes; only understandable if you have access to the source code.
No.
1 DUAL = Diffusing Update Algorithm
2 FSM = finite state machine
3 SPF = shortest path first
4 LSA = link-state advertisement
5 LSP = link-state packet
From this chart, you can see that EIGRP generally provides the most tools for finding a problem in the network quickly, with OSPF running a close second.
Which Protocol Converges Faster?
I was once challenged with the statement, "There is no way that a distance vector protocol can ever converge faster than a link-state protocol!" An hour and a half later, I think the conversation tapered off into, "Well, in some situations, I suppose a distance vector protocol could converge as fast as a link-state protocol," said without a lot of conviction.
In fact, just about every network engineer can point to reasons why he thinks a specific routing protocol will always converge faster than some other protocol, but the reality is that all routing protocols can converge quickly or slowly, depending on a lot of factors strictly related to network design, without even considering the hardware, types of links, and other random factors that play into convergence speed in different ways with each protocol. As a specific example, look at the small network illustrated in Figure G-1 and consider the various options and factors that might play into convergence speed in this network.
Figure G-1 Simple Network
This figure purposefully has no labels showing anything concerning routing protocols configuration or design; instead, this section covers several possible routing configurations and examines how the same protocol could converge more or less quickly even on a network this small through just minor configuration changes.
Start with EIGRP as an example:
- The Router A to C link has a bandwidth of 64 kbps.
- The Router A to B link has a bandwidth of 10 Mbps.
- The Router B to D and Router C to D links have equal bandwidths.
With this information in hand, you can determine that Router D is going to mark the path to 10.1.1.0/24 through Router B as the best path (the successor in EIGRP terms). The path through Router C will not be marked as a feasible successor, because the differential in the metrics is too great between the two paths. To the EIGRP process running on Router D, the path through Router C cannot be proven based on the metrics advertised by Routers B and C, so the path through Router C will not be installed as a possible backup route.
This means that if the Router B to D link fails, Router D is forced to mark 10.1.1.0/24 as active and send a query to Router C. The convergence time is bounded by the amount of time it takes for the following tasks:
- Router D to examine its local topology table and determine that no other known loop-free paths exist.
- Router D to build and transmit a query toward Router C.
- Router C to receive and process the query, including examining its local EIGRP topology table, and find it still has an alternate path.
- Router C to build a reply to the query and transmit it.
- Router D to receive the reply and process it, including route installation time and the time required to change the information in the forwarding tables on the router.
Many factors are contained in these steps; any one of them could take a long time. In the real world, the total time to complete the steps in this network is less than two or three seconds.
Now change the assumptions just slightly and see what the impact is:
- The Router A to C link and A to B links have equal bandwidth.
- The Router B to D link has a bandwidth of 64 kbps.
- The Router B to C link has a bandwidth of 10 Mbps.
As you can tell, the network conditions have been changed only slightly, but the results are altered dramatically. In this case, the path to 10.1.1.0/24 through Router C is chosen as the best path. EIGRP then examines the path through Router B and finds that it is a loop-free path, based on the information embedded in EIGRP metrics. What happens if the Router B to C link fails?
The process has exactly one step: Router D examines its local EIGRP topology table and finds that an alternate loop-free path is available. Router D installs this alternate route in the local routing table and alters the forwarding information as needed. This processing takes on the order of 150 milliseconds or less.
Using the same network, examine the various reactions of OSPF to link failures. Begin with these:
- The Router B to D link has a cost of 20.
- All other links in the network have a cost of 10.
- All routes are internal OSPF routes.
What happens if the Router B to C link fails?
- Router B and C detect the link failure and wait some period of time, called the link-state advertisement (LSA) generation time. Then they flood modified router LSAs with this information.
- The remaining routers in the network receive this new LSA and place it in their local link-state databases. The routers wait some period of time, called the shortest path first (SPF) wait time, and then run SPF.
- In the process of running SPF, or after SPF has finished running (depending on the implementation), OSPF will install new routing information in the routing table.
With the default timers, it could take up to one second (or longer, in some situations) to detect the link failure and then about three and a half seconds to flood the new information. Finally, it could take up to two and a half seconds before the receiving routers will run SPF and install the new routing information. With faster times and various sorts of tuning, you can decrease these numbers to about one second or even in the 300-millisecond range in some specific deployments.
Making Router D an area border router (ABR) dramatically impacts the convergence time from the Router E perspective because Router D has to perform all the preceding steps to start convergence. After Router D has calculated the new correct routing information, it must generate and flood a new summary LSA to Router E, and Router E has to recalculate SPF and install new routes.
Redistributing 10.1.1.0/24 into the network and making the area that contains Routers A, B, C, and D into a not-so-stubby area (NSSA) throws another set of timers into the problem. Router D now has to translate the Type 7 external LSA into an external Type 5 LSA before it can flood the new routing information to Router E.
These conditions do not even include the impact of multiple routes on the convergence process. EIGRP, for instance, can switch from its best path to a known loop-free path for 10,000 routes just about as fast as it can switch 1 route under similar conditions. OSPF performance is adversely impacted by the addition of 10,000 routes into the network, possibly doubling convergence time.
You can see, then, that it is not so simple to say, "EIGRP will always converge faster than OSPF," "IS-IS will always converge faster than EIGRP," or any other combination you can find. Some people say that OSPF always converges faster than EIGRP, for instance, but they are generally considering only intrarea convergence and not the impact of interarea operations, the impact of various timers, the complexity of the SPF tree, and other factors. Some people say that EIGRP always converges faster than any link-state protocol, but that depends on the number of routers involved in the convergence event. The shorter the query path, the faster the network converges.
If you align all the protocol convergence times based on the preceding examination, you generally find the convergence times in this order, from shortest to longest:
- EIGRP with feasible successors.
- Intrarea OSPF or IS-IS with fast or tuned timers.
-
- EIGRP without feasible successors.
- Intrarea OSPF or IS-IS with standard timers.
- Interarea OSPF or IS-IS.
The last three are highly variable, in reality. In any particular network, OSPF, IS-IS, and EIGRP without feasible successors might swap positions on the list. The network design, configuration, and a multitude of other factors impact the convergence time more than the routing protocol does. You get the best convergence time out of a routing protocol if you play the network design to the strengths of the protocol.
Which Designs Play to the Strength of Each Protocol?
The natural question, after you have decided that network design plays into the suitability of the protocol (you have seen this to be the case for convergence speed, but the same is also true of any other factor you might consider for a given routing protocol, including management, troubleshooting, configuration, and so on) is this:
- What sorts of network designs play into the strengths of any given routing protocol?
This is not an easy question to answer because of the numerous ways to design a network that works. Two- and three-layer network designs, switched cores versus routed cores, switched user access versus routed user access—the design possibilities appear to be endless. To try to put a rope around this problem, the sections that follow examine only a few common topological elements to illustrate how to analyze a specific topology and design and try to determine how a routing protocol will react when running on it.
The specific types of network topologies considered here are as follows:
- Hub-and-spoke designs
- Full mesh designs
- Highly redundant designs
After you consider each of these specific topology elements, you learn the general concepts of hierarchical network design and how each protocol plays against them.
Hub-and-Spoke Topologies
Hub-and-spoke network designs tend to be simple in theory and much harder in implementation. Scaling tends to be the big problem for hub-and-spoke topologies. The primary focus here is the capability of a routing protocol to maintain a multitude of routing neighbors and to converge to massive network events in an acceptable amount of time. Assume, throughout this section, that you are always dealing with dual-homed hub-and-spoke networks, as Figure G-2 illustrates.
Figure G-2 Dual-Homed Hub-and-Spoke Network
Start by considering the following simple question:
- How many spokes or remote routers does it take to really start stressing any routing protocol that is running over a hub-and-spoke network design?
The answer to this question always depends on various factors, including link speed and stability, router processing speed and packet switching speeds, and other factors. However, general experience shows that a high-speed router (in terms of processing power) with reasonably good design supports at least 100 remote sites with any modern routing protocol.
When considering network designs in which hundreds of remote sites are available, however, you need to use special techniques with each protocol to scale the number of remote sites attached to a single pair of hub routers. Look at each protocol to see what types of problems you might encounter and what types of tools are available to resolve those problems:
- OSPF floods topology information to each router within an area and summaries of reachability information into the area. You can place all the remote site routers into one or more OSPF stub areas, which cuts down on the amount of information flooded out to each remote site. Any change on a remote site is still flooded to every other remote site within the same area. For that reason, the design becomes a tradeoff between the number of areas that you want to manage and that the hub routers support and the amount of information that you can flood through the low-speed links connecting the remote stub sites.
- IS-IS also floods information to each router within an area. It does not, by default, flood information from the core of the network (the L2 routing domain) into each area. Again, you still face the tradeoff of how many level 1 routing domains you want to support at the hub routers versus how much information you can flood toward each remote router.
- The primary factor in determining scaling and convergence time in an EIGRP hub-and-spoke network is the number of queries the hub router needs to generate or process when the network changes, and the number of updates the hub router needs to generate toward the remote. Normally, if a hub loses several routes, for instance, it needs to generate queries for each of those routes to each of the remote sites. The remote sites then query the other hub router, which must process and reply to each of the queries. If the number of routes is high, this can be a processor- and memory-intensive task, causing the network to converge slowly, especially if the links between the remote sites and the hub routers are low speed. In this situation, you can summarize routers at the core toward the remote routers and block the routing information transmitted up toward the core routers. You can also cut down on the query range into the hub-and-spoke network dramatically. EIGRP, however, also provides a special operational mode for the remote sites; you can configure the remote sites as stubs, which indicates to the hub routers that the remote sites are never used for transiting traffic. If the remote sites are configured as stub routers, the hub router never queries them for lost routes, and the scaling properties change dramatically.
EIGRP, in theory, scales much better in a hub-and-spoke topology—and this is true in real networks, too. You often find EIGRP hub-and-spoke networks that have more than 500 remote sites attached to a pair of hub routers, over low bandwidth links, in the wild. In contrast, you tend to see OSPF and IS-IS hub-and-spoke networks top out at around 200 remote sites, even if higher bandwidth links are involved.
Full Mesh Topologies
Full mesh topologies are a less common design element in networks, but they are worth considering because the scaling properties of a routing protocol in a full mesh design indicate, to some degree, the scaling properties of the same protocol in a partial mesh design. You can think of a full mesh topology as a special case of a partial mesh topology. Again, look at the challenges and tools that are available for each protocol. Use the network illustrated in Figure G-3 throughout this discussion.
Figure G-3 Full Mesh Network
- Each OSPF router sends topology information to each adjacent neighbor within an area (flooding domain). If Router A receives a new link-state advertisement (LSA), Router D receives three copies of this new LSA: one from Router A, one from Router B, and one from Router C. The Cisco IOS Software implementation of OSPF does have an option to control the flooding through a full mesh network, using the database filter-out command.
- IS-IS is similar to OSPF; each router sends topology information to each adjacent neighbor. Cisco IOS Software enables you to control flooding through mesh groups.
- Each router in an EIGRP network sends each of the routes it is using to forward traffic to each neighbor. In this network, Router D is going to receive three copies of any new routing information that Router A receives, one copy from Router A, one from Router B, and one from Router C. These three copies of the routing information might be the same, but they indicate reachability through three different next hops (or neighbors). Reducing the information propagated through the mesh is difficult, at best. You can filter these routing updates through some paths within the mesh to decrease the amount of information flooded through the mesh, but that also reduces the number of paths usable through the mesh for any specific destination.
OSPF and IS-IS flood extra information through a mesh topology by default, but you can use tools to reduce the amount of flooding in highly meshed topologies. EIGRP sends updates through each router in the mesh, but it is difficult to reduce the number of these updates unless you want to decrease the number of paths that the network actually uses through the mesh.
In the real world, OSPF and IS-IS scale better in highly meshed environments, especially if you implement flooding reduction techniques. This is a matter of scale, of course; networks that have a mesh network of 20 or 30 routers work fine with any of the three routing protocols. However, when the mesh starts surpassing this number of routers, the special techniques that OSPF and IS-IS offer to scale further can make a difference.
Interaction with Hierarchical Designs
Traditional network design is based on layers, either two or three, that abstract the network details into "black boxes" and divide functionality vertically through the network to make management and design easier:
- The two-layer model has aggregation and core layers, or areas, within the network.
- The three-layer model has access, distribution, and core layers.
How do these layered network designs interact with each protocol? Consider each protocol in turn.
OSPF splits flooding domains into areas that are separated by ABRs. Because every router within an area must share the same link-state database to calculate loop-free paths through the network, the only place that route aggregation can be performed is at an ABR. ABRs actually aggregate two types of information:
- Information about the topology of an area that is hidden from other areas at these border edges
- Aggregation of reachability information that can be configured at these border edges
This combination of route aggregation points and flooding domain boundaries in the network implies several things:
- In all three-layer network designs with OSPF, you should place the ABR in the distribution layer of the network.
- In all two-layer network designs with OSPF, you should place the ABR at the aggregation to core layer edge of the network.
- The most aggregation points that you can cross when passing from one edge of the network to the opposite edge of the network is two.
These topological limitations might not be major in smaller networks, but in networks that have thousands of routers, they could impose severe restrictions on the network design. Network designers and operators normally break up OSPF networks at this size into multiple administrative domains, connecting the separate domains through BGP or some other mechanism.
IS-IS is similar to OSPF in its restrictions, except that IS-IS allows the core and outlying flooding domains to overlap. This introduces a degree of flexibility that OSPF does not provide, but you can still only aggregate routing information at the edges where two flooding domain meet, and you cannot build more than two levels of routing into the network.
EIGRP, as a distance vector protocol, does not divide the concepts of topology summarization and routing aggregation; topology beyond one hop away is hidden by the natural operation of the protocol. Figure G-4 illustrates the conceptual difference among EIGRP, OSPF/IS-IS, and RIP in terms of topology information propagated through the network.
Figure G-4 Topological Awareness in Routing Protocols
If you examine the scope through which routing information is transmitted (or known) within a network, you find the following:
- The Bellman-Ford algorithm, used by the Routing Information Protocol (RIP) and the Interior Gateway Routing Protocol (IGRP), uses only information about the local cost to reach a given destination. If Router B is running RIP, it considers only the total cost of the path to reach a destination at Router E when deciding on the best (loop-free) path.
- Diffusing Update Algorithm (DUAL), used by EIGRP, considers the local cost to reach a given destination and the cost of each neighbor to reach the same destination when calculating which available paths are loop free. EIGRP uses an awareness of the topology that is one hop away from the calculating router.
- OSPF and IS-IS, which are link-state protocols, do not use information about the metrics of a neighbor; rather, they count on being aware of the entire topology when calculating a loop-free path. At a flooding domain border, OSPF and IS-IS act much like distance vector protocols. Router A does not know about the topology behind Router B; it only knows the cost of Router B to reach destinations that are attached to Router E.
Because topology information is hidden in the natural processing of EIGRP routing updates, EIGRP is not restricted in where it can aggregate routing information within the network. This provides a great deal of flexibility to network designers who are running EIGRP. Multiple layers of aggregation can be configured in the network. This means that moving from one edge of the network to the opposite edge of the network could mean encountering many more than two aggregation points.
The practical result of the EIGRP capability to aggregate routing information anywhere in the network is that many existing large-scale (2000 router and larger) networks run within a single EIGRP process or administrative domain. The feasibility of building networks this large is based on the capability to use route aggregation to divide the network into multiple layers, or sections, each acting fairly independently of the other. Although it is possible to build an OSPF or IS-IS network this large, designing and managing this network is more difficult because of the restrictions that link-state protocols place on aggregation points.
In general, up to some relative size, the protocols are relatively equal in their capability to work with hierarchical network designs. OSPF and IS-IS tend to be less flexible about where route aggregation can be placed in the network, making it more difficult, in some situations, to fit the network design and the protocol design together. EIGRP excels at fitting into hierarchical network design.
Topological Rules of Thumb
After examining these various network topologies and how each routing protocol tends to react, you can see that when a network does not reach the edge of a specific protocol capability on any given topology, any of the routing protocols is fine. If your network has a specific predominant topology type, however, such as large-scale hub-and-spoke or large-scale full mesh topologies, choosing a protocol to fit those topologies makes sense. You can always compromise in complex areas of your network design by making effective and stable topological design areas in which the routing protocol is really stretched to the edge of its capabilities.
What Are the Tradeoffs?
In many networks, the final decision of which routing protocol is "best" comes down to these issues:
- Convergence speed—How important is convergence speed? How much flexibility do you have in the design of your network around convergence speeds?
- Predominant topologies—Does your network design have one dominant type of topology? Would a full mesh or large-scale hub-and-spoke topology benefit from running one protocol over another?
- Scaling strategy—Does your scaling strategy call for dividing the network into multiple pieces, or does it call for a single IGP domain, with the network broken up into pieces through route aggregation and other techniques?
- Maintenance and management—Which routing protocol fits the network management style of your day-to-day operations? Which one seems easier to troubleshoot and manage in your environment?
Beyond the technical factors are some nontechnical ones. For instance, if you decide to switch protocols, what is the cost for the long term? You need to consider training costs, the cost of revised procedures, design effort, and possible downtime while you convert the network from one protocol to another.
In some situations, this might not be an issue. For instance, if two networks are being merged because of a corporate merger, and each runs a different protocol, the decision might be more open to consideration. If you are going to need to convert one half of the network or the other, you can more carefully consider the technical considerations and make a decision based on those considerations alone. However, if your network is stable today, you should think twice about switching protocols unless a change in the business environment or some major shift in the way the network is built indicates it is an important move to make to meet the needs of the enterprise.
Which Routing Protocol Should My Network Use?
read more | digg story
Wednesday, August 29, 2007
Lifehacker Code: Better Gmail (Firefox extension)
read more | digg story
Tuesday, August 28, 2007
Sunday, August 26, 2007
Damerau-Levenshtein distance - Wikipedia, the free encyclopedia
Damerau-Levenshtein distance - Wikipedia, the free encyclopedia
Damerau-Levenshtein distance is a string metric for measuring edit distance in information theory and computer science. Like Levenshtein distance, it finds the difference between two strings by giving the minimum number of operations needed to transform one string into the other where an operation is an insertion, deletion, or substitution of a single character. Unlike Levenshtein distance, it counts transposition as a single edit operation, rather than two. Damerau-Levenshtein distance is therefore equal to the minimal number of insertions, deletions, substitutions and transpositions needed to transform one string into the other. Damerau in his seminal paper[1] not only distinguished these four edit operations but also stated that they correspond to more than 80% of all human misspellings. It is worth noting that Damerau concentrated on single-character misspellings. Edit distance was introduced by Levenshtein,[2] who generalized this concept with multiple edit operations, but did not include transpositions in the set of basic operations.
Contents |
The algorithm
Adding transpositions sounds simple, but in reality there is a serious complication. Firstly, let us consider a direct extension of the formula used to calculate Levenshtein distance. Below is pseudocode for a function DamerauLevenshteinDistance that takes two strings, str1 of length lenStr1, and str2 of length lenStr2, and computes the Damerau-Levenshtein distance between them:
int DamerauLevenshteinDistance(char str1[1..lenStr1], char str2[1..lenStr2])
// d is a table with lenStr1+1 rows and lenStr2+1 columns
declare int d[0..lenStr1, 0..lenStr2]
// i and j are used to iterate over str1 and str2
declare int i, j, cost
for i from 0 to lenStr1
d[i, 0] := i
for j from 1 to lenStr2
d[0, j] := j
for i from 1 to lenStr1
for j from 1 to lenStr2
if str1[i] = str2[j] then cost := 0
else cost := 1
d[i, j] := minimum(
d[i-1, j ] + 1, // deletion
d[i , j-1] + 1, // insertion
d[i-1, j-1] + cost // substitution
)
if(i > 1 and j > 1 and str1[i] = str2[j-1] and str1[i-1] = str2[j]) then
d[i, j] := minimum(
d[i, j],
d[i-2, j-2] + cost // transposition
)
return d[lenStr1, lenStr2]
Basically this is the algorithm to compute Levenshtein distance with one additional recurrence:
if(i > 1 and j > 1 and str1[i] = str2[j-1] and str1[i-1] = str2[j]) then
d[i, j] := minimum(
d[i, j],
d[i-2, j-2] + cost // transposition
)
Let us calculate pair-wise distances between the strings TO, OT and OST using this algorithm. The distance between TO and OT is 1. The same for OT vs. OST. But the distance between TO and OST is 3, even though the strings can be made equal using one deletion and one transposition. Clearly, the algorithm does not compute precisely the value we want. Furthermore, the triangle inequality does not hold.
In reality this algorithm calculates the cost of the so-called optimal string alignment, which does not always equal the edit distance. It is also easy to see that the cost of the optimal string alignment is the number of edit operations needed to make the strings equal under the condition that no substring is edited more than once. We will also call this value a restricted edit distance. As noted by G. Navarro,[3] in the general case, i.e. when a set of elementary edition operations includes substitutions of arbitrary length strings, unrestricted edit distance is hardly computable. However, the goal is achievable in the simpler case of Damerau-Levenshtein distance. It is also possible (see[4] for details) to compute unrestricted distance treating reversals of arbitrary, not obligatory adjacent characters as a single edit operation.
To devise a proper algorithm to calculate unrestricted Damerau-Levenshtein distance note that there always exists an optimal sequence of edit operations, where once-transposed letters are never modified afterwards. Thus, we need to consider only two symmetric ways of modifying a substring more than once: (1) transpose letters and insert an arbitrary number of characters between them, or (2) delete a sequence of characters and transpose letters that become adjacent after deletion. The straightforward implementation of this idea gives an algorithm of cubic complexity: , where M and N are string lengths. Using the ideas of Lowrance and Wagner,[5] this naive algorithm can be improved to be in the worst case.
It is interesting that the bitap algorithm can be modified to process transposition. See the information retrieval section of[6] for an example of such an adaptation.
Algorithm discussion
The above-described pseudo-code calculates only restricted edit distance. Damerau-Levenshtein distance plays an important role in natural language processing. It is worth noting that in natural languages, strings are short and the number of errors (misspellings) rarely exceeds 2. In such circumstances, restricted and real edit distance differ very rarely. That is why this limitation is not very important. However, one must remember that restricted edit distance does not always satisfy the triangle inequality and, thus, cannot be used with metric trees. An extension of the edit distance algorithm, that does satisfy the triangle inequality is described in the paper F.J. Damerau. A technique for computer detection and correction of spelling errors, Communications of the ACM, 1964.
See also
References
- ^ F.J. Damerau. A technique for computer detection and correction of spelling errors, Communications of the ACM, 1964.
- ^ V.I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 1966.
- ^ R. Lowrance, R. Wagner. An Extension of the String-to-String Correction Problem, JACM, 1975.
- ^ G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys (CSUR) archive 33(1), pp 31-88, 2001.
External links
- ^ Site devoted to fuzzy searching and information retrieval
- Project Dedupe http://dedupe.sourceforge.net
Technorati : Damerau-Levenshtein distance
Del.icio.us : Damerau-Levenshtein distance
Index (search engine) (Wikipedia)
Index (search engine) (Wikipedia)
Search engine indexing entails how data is collected, parsed, and stored to facilitate fast and accurate retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science. An alternate name for the process is Web indexing, within the context of search engines designed to find web pages on the Internet.
Popular engines focus on the full-text indexing of online, natural language documents[1], yet there are other searchable media types such as video, audio[2], and graphics[3][4]. Meta search engines reuse the indices of other services and do not store a local index, whereas cache-based search engines permanently store the index along with the corpus. Unlike full text indices, partial text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined interval due to the required time and processing costs, whereas agent-based search engines index in real time.
Indexing
The goal of storing an index is to optimize the speed and performance of finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would take a considerable amount of time and computing power. For example, an index of 10,000 documents can be queried within milliseconds, where a sequential scan of every word in 10,000 large documents could take hours. No search engine user would be comfortable waiting several hours to get search results. The trade off for the time saved during retrieval is that additional storage is required to store the index and that it takes a considerable amount of time to update.
Index Design Factors
Major factors in designing a search engine's architecture include:
- Merge factors
- How data enters the index, or how words or subject features are added to the index during corpus traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content. Traversal typically correlates to the data collection policy. Search engine index merging is similar in concept to the SQL Merge command and other merge algorithms.[5]
- Storage techniques
- How to store the index data - whether information should be compressed or filtered
- Index size
- How much computer storage is required to support the index
- Lookup speed
- How quickly a word can be found in the inverted index. How quickly an entry in a data structure can be found, versus how quickly it can be updated or removed, is a central focus of computer science
- Maintenance
- Maintaining the index over time[6]
- Fault tolerance
- How important it is for the service to be reliable, how to deal with index corruption, whether bad data can be treated in isolation, dealing with bad hardware, partitioning schemes such as hash-based or composite partitioning[7], replication
Index Data Structures
Search engine architectures vary in how indexing is performed and in index storage to meet the various design factors. Types of indices include:
- Suffix trees
- Figuratively structured like a tree, supports linear time lookup. Built by storing the suffixes of words. Used for searching for patterns in DNA sequences and clustering. A major drawback is that the storage of a word in the tree may require more storage than storing the word itself.[8] An alternate representation is a suffix array, which is considered to require less memory and supports compression like BWT.
- Trees
- An ordered tree data structure that is used to store an associative array where the keys are strings. Regarded as faster than a hash table, but are less space efficient. The suffix tree is a type of trie. Tries support extendible hashing, which is important for search engine indexing.[9]
- Inverted indices
- Stores a list of occurrences of each atomic search criterion[10], typically in the form of a hash table or binary tree[11][12].
- Citation indices
- Stores the existence of citations or hyperlinks between documents to support citation analysis, a subject of Bibliometrics.
- Ngram indices
- For storing sequences of n length of data to support other types of retrieval or text mining.[13]
- Term document matrices
- Used in latent semantic analysis, stores the occurrences of words in documents in a two dimensional sparse matrix.
Challenges in Parallelism
A major challenge in the design of search engines is the management of parallel processes. There are many opportunities for race conditions and coherence faults. For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competing tasks. Consider that authors are producers of information, and a crawler is the consumer of this information, grabbing the text and storing it in a cache (or corpus). The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a producer-consumer model. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve distributed computing, where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully-synchronized, distributed, parallel architecture.[14]
Inverted indices
Many search engines incorporate an inverted index when evaluating a search query to quickly locate the documents which contain the words in a query and rank these documents by relevance. The inverted index stores a list of the documents for each word. The search engine can retrieve the matching documents quickly using direct access to find the documents for a word. The following is a simplified illustration of the inverted index:
Word | Documents |
---|---|
the | Document 1, Document 3, Document 4, Document 5 |
cow | Document 2, Document 3, Document 4 |
says | Document 5 |
moo | Document 7 |
The above figure is a simplified form of a Boolean index. Such an index would only serve to determine whether a document matches a query, but would not contribute to ranking matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of the word in each document.[15] With position, the search algorithm can identify word proximity to support searching for phrases. Frequency can be used to help in ranking the relevance of documents to the query. Such topics are the central research focus of information retrieval.
The inverted index is a sparse matrix given that words are not present in each document. It is stored differently than a two dimensional array to reduce memory requirements. The index is similar to the term document matrices employed by latent semantic analysis. The inverted index can be considered a form of a hash table. In some cases the index is a form of a binary tree, which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically distributed.[16]
Inverted indices can be programmed in several computer programming languages.[17][18]
Index Merging
The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. The architecture may be designed to support incremental indexing[19], where a merge involves identifying the document or documents to add into or update in the index and parsing each document into words. For technical accuracy, a merge involves the unison of newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives.
After parsing, the indexer adds the containing document to the document list for the appropriate words. The process of finding each word in the inverted index in order to denote that it occurred within a document may be too time consuming when designing a larger search engine, and so this process is commonly split up into the development of a forward index and the process of sorting the contents of the forward index for entry into the inverted index. The inverted index is named inverted because it is an inversion of the forward index.
The Forward Index
The forward index stores a list of words for each document. The following is a simplified form of the forward index:
Document | Words |
---|---|
Document 1 | the,cow,says,moo |
Document 2 | the,cat,and,the,hat |
Document 3 | the,dish,ran,away,with,the,spoon |
The rationale behind developing a forward index is that as documents are parsed, it is better to immediately store the words per document. The delineation enables asynchronous processing, which partially circumvents the inverted index update bottleneck.[20] The forward index is sorted to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.
Compression
Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of compression to reduce the size of the indices on disk.[21] Consider the following scenario for a full text, Internet, search engine.
- An estimated 2,000,000,000 different web pages exist as of the year 2000[22]
- A fictitious estimate of 250 words per webpage on average, based on the assumption of being similar to the pages of a novel.[23]
- It takes 8 bits (or 1 byte) to store a single character. Some encodings use 2 bytes per character[24][25]
- The average number of characters in any given word on a page can be estimated at 5 (Wikipedia:Size comparisons)
- The average personal computer comes with about 20 gigabytes of usable space[26]
Given these estimates, generating an uncompressed index (assuming a non-conflated, simple, index) for 2 billion web pages would need to store 5 billion word entries. At 1 byte per character, or 5 bytes per word, this would require 2500 gigabytes of storage space alone, more than the average size a personal computer's free disk space. This space is further increased in the case of a distributed storage architecture that is fault-tolerant. Using compression, the index size can be reduced to a portion of its size, depending on which compression techniques are chosen. The trade off is the time and processing power required to perform compression and decompression.
Notably, large scale search engine designs incorporate the cost of storage, and the costs of electricity to power the storage. Compression, in this regard, is a measure of cost as well.
Document Parsing
Document parsing involves breaking apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. For example, if the full contents of a document consisted of the sentence "Hello World", there would typically be two words found, the token "Hello" and the token "World". In the context of search engine indexing and natural language processing, parsing is more commonly referred to as tokenization, and sometimes word boundary disambiguation, tagging, text segmentation, content analysis, text analysis, text mining, concordance generation, Speech segmentation, lexing, or lexical analysis. The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang.
Natural language processing, as of 2006, is the subject of continuous research and technological improvement. There are a host of challenges in tokenization, in extracting the necessary information from documents for indexing to support quality searching. Tokenization for indexing involves multiple technologies, the implementation of which are commonly kept as corporate secrets.
Challenges in Natural Language Processing
Word Boundary Ambiguity - native English speakers can at first consider tokenization to be a straightforward task, but this is not the case with designing a multilingual indexer. In digital form, the text of other languages such as Chinese, Japanese or Arabic represent a greater challenge as words are not clearly delineated by whitespace. The goal during tokenization is to identify words for which users will search. Language specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax).
Language Ambiguity - to assist with properly ranking matching documents, many search engines collect additional information about words, such as its language or lexical category (part of speech). These techniques are language-dependent as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document.
Diverse File Formats - in order to correctly identify what bytes of a document represent characters, the file format must be correctly handled. Search engines which support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document.
Faulty Storage - the quality of the natural language data is not always assumed to be perfect. An unspecified amount of documents, particular on the Internet, do not always closely obey proper file protocol. binary characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer performance could degrade.
Tokenization
Unlike literate human adults, computers are not inherently aware of the structure of a natural language document and do not instantly recognize words and sentences. To a computer, a document is only a big sequence of bytes. Computers do not know that a space character between two sequences of characters means that there are two separate words in the document. Instead, a computer program is developed by humans which trains the computer, or instructs the computer, how to identify what constitutes an individual or distinct word, referred to as a token. This program is commonly referred to as a tokenizer or parser or lexer. Many search engines, as well as other natural language processing software, incorporate specialized programs for parsing, such as YACC OR Lex.
During tokenization, the parser identifies sequences of characters, which typically represent words. Commonly recognized tokens include punctuation, sequences of numerical characters, alphabetical characters, alphanumerical characters, binary characters (backspace, null, print, and other antiquated print commands), whitespace (space, tab, carriage return, line feed), and entities such as email addresses, phone numbers, and URLs. When identifying each token, several characteristics may be stored such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number.
Language Recognition
If the search engine supports multiple languages, a common initial step during tokenization is to identify each document's language, given that many of the later steps are language dependent (such as stemming and part of speech tagging). Language recognition is the process by which a computer program attempts to automatically identify, or categorize, the language of a document. Other names for language recognition include language classification, language analysis, language identification, and language tagging. Automated language recognition is the subject of ongoing research in natural language processing. Finding which language the words belongs to may involve the use of a language recognition chart.
Format Analysis
Depending on whether the search engine supports multiple document formats, documents must be prepared for tokenization. The challenge is that many document formats contain, in addition to textual content, formatting information. For example, HTML documents contain HTML tags, which specify formatting information, like whether to start a new line, or display a word in bold, or change the font size or family. If the search engine were to ignore the difference between content and markup, the segments would also be included in the index, leading to poor search results. Format analysis involves the identification and handling of formatting content embedded within documents which control how the document is rendered on a computer screen or interpreted by a software program. Format analysis is also referred to as structure analysis, format parsing, tag stripping, format stripping, text normalization, text cleaning, or text preparation. The challenge of format analysis is further complicated by the intricacies of various file formats. Certain file formats are proprietary and very little information is disclosed, while others are well documented. Common, well-documented file formats that many search engines support include:
- Microsoft Word
- Microsoft Excel
- Microsoft Powerpoint
- IBM Lotus Notes
- HTML
- ASCII text files (a text document without any formatting)
- Adobe's Portable Document Format (PDF)
- PostScript (PS)
- LaTex
- The UseNet archive (NNTP) and other deprecated bulletin board formats
- XML and derivatives like RSS
- SGML (this is more of a general protocol)
- Multimedia meta data formats like ID3
Techniques for dealing with various formats include:
- Using a publicly available commercial parsing tool that is offered by the organization which developed, maintains, or owns the format
- Writing a custom parser
Some search engines support inspection of files that are stored in a compressed, or encrypted, file format. If working with a compressed format, then the indexer first decompresses the document, which may result in one or more files, each of which must be indexed separately. Commonly supported compressed file formats include:
- ZIP - Zip File
- RAR - Archive File
- CAB - Microsoft Windows Cabinet File
- Gzip - Gzip file
- BZIP - Bzip file
- TAR, GZ, and TAR.GZ - Unix Gzip'ped Archives
Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for spamdexing:
- Including hundreds or thousands of words in a section which is hidden from view on the computer screen, but visible to the indexer, by use of formatting (e.g. hidden "div" tag in HTML, which may incorporate the use of CSS or Javascript to do so).
- Setting the foreground font color of words to the same as the background color, making words hidden on the computer screen to a person viewing the document, but not hidden to the indexer.
Section Recognition
Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on the web contain erroneous content and side-sections which do not contain primary material, that which the document is about, such as newsletters and corporate reports. For example, this article may display a side menu with words inside links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear in the raw source content sequentially are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, a dilemma ensues where the quality of the index is degraded and search quality is degraded due to the mixed content and improper word proximity. Two primary problems are noted:
- Content in different sections is treated as related in the index, when in reality it is not
- Organizational 'side bar' content is included in the index, but the side bar content does not contribute to the meaning of the document, and the index is filled with a poor representation of its documents, assuming the goal is to go after the meaning of each document, a sub-goal of providing quality search results.
Section analysis may require the search engine to implement the rendering logic of each document, essentially an abstract representation of the actual document, and then index the representation instead. For example, some content on the Internet is rendered via Javascript. Viewers of web pages in web browsers see this content. If the search engine does not render the page and evaluate the Javascript within the page, it would not 'see' this content in the same way, and index the document incorrectly. Given that some search engines do not bother with rendering issues, many web page designers avoid displaying content via Javascript or use the Noscript tag to ensure that the web page is indexed properly. At the same time, this fact is also exploited to cause the search engine indexer to 'see' different content than the viewer.
Meta Tag Indexing
Specific documents offer embedded meta information such as the author, keywords, description, and language. For HTML pages, the meta tag contains keywords which are also included in the index. During earlier growth periods in the Internet and search engine technology (more so, the hardware on which it runs) would only index the keywords in the meta tags for the forward index (and still applying techniques such as stemming and stop words). The full document would not be parsed. At this time, full-text indexing was not as well established, nor was the hardware to support such technology. The design of the HTML markup language initially included support for meta tags for this very purpose of being properly and easily indexed, without requiring tokenization.[27]
As the Internet grew (the number of users capable of browsing the web and the number of websites increased and the technology for making websites and hosting websites improved), many brick-and-mortar corporations went 'online' in the mid 1990s and established corporate websites. The keywords used to describe webpages (many of which were corporate-oriented webpages similar to product brochures) changed from descriptive keywords to marketing-oriented keywords designed to drive sales by placing the webpage high in the search results for specific search queries. The fact that these keywords were subjectively-specified was leading to spamdexing, which drove many search engines to adopt full-text indexing technologies in the 1990s. Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information. Given that conflict of interest with the business goal of designing user-oriented websites which were 'sticky', the customer lifetime value equation was changed to incorporate more useful content into the website in hopes of retaining the visitor. In this sense, full-text indexing was more objective and increased the quality of search engine results, as it was one more step away from subjective control of search engine result placement, which in turn furthered research of full-text indexing technologies.
In the context of Desktop search, many solutions incorporate meta tags to provide a way for authors to further customize how the search engine will index content from various files that is not evident from the file content. Desktop search is under the control of the user and the changes in that context only serve to help, unlike Internet search engines which must focus more on the full text index.
See also
- Information retrieval
- Document Retrieval
- Information extraction
- Search engine
- Vertical search
- Desktop search
- List of search engines
- Text Retrieval
- Latent Semantic Indexing
- Keyword In Context Indexing
- Concordance
- Controlled vocabulary
- Natural language processing
- Text mining
- Content analysis
Further reading
- R. Bayer and E. McCreight. Organization and maintenance of large ordered indices. Acta Informatica, 173-189, 1972.
- Donald E. Knuth. The art of computer programming, volume 1 (3rd ed.): fundamental algorithms, Addison Wesley Longman Publishing Co. Redwood City, CA, 1997.
- Donald E. Knuth. The art of computer programming, volume 3: (2nd ed.) sorting and searching, Addison Wesley Longman Publishing Co. Redwood City, CA, 1998.
- Gerald Salton. Automatic text processing, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1988.
- Gerard Salton. Michael J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, 1986.
- Gerard Salton. Lesk, M.E.: Computer evaluation of indexing and text processing. Journal of the ACM. January 1968.
- Gerard Salton. The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice Hall Inc., Englewood Cliffs, 1971.
- Gerard Salton. The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading, Mass., 1989.
- Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Chapter 8. ACM Press 1999.
- G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
- Adelson-Velskii, G.M., Landis, E. M.: An information organization algorithm. DANSSSR, 146, 263-266 (1962).
- Edward H. Sussenguth, Jr., Use of tree structures for processing files, Communications of the ACM, v.6 n.5, p.272-279, May 1963
- Harman, D.K., et al: Inverted files. In Information Retrieval: Data Structures and Algorithms, Prentice-Hall, pp 28-43, 1992.
- Lim, L., et al: Characterizing Web Document Change, LNCS 2118, 133-146, 2001.
- Lim, L., et al: Dynamic Maintenance of Web Indexes Using Landmarks. Proc. of the 12th W3 Conference, 2003.
- Moffat, A., Zobel, J.: Self-Indexing Inverted Files for Fast Text Retrieval. ACM TIS, 349-379, October 1996, Volume 14, Number 4.
- Mehlhorn, K.: Data Structures and Efficient Algorithms, Springer Verlag, EATCS Monographs, 1984.
- Mehlhorn, K., Overmars, M.H.: Optimal Dynamization of Decomposable Searching Problems. IPL 12, 93-98, 1981.
- Mehlhorn, K.: Lower Bounds on the Efficiency of Transforming Static Data Structures into Dynamic Data Structures. Math. Systems Theory 15, 1-16, 1981.
- Koster, M.: ALIWEB: Archie-Like indexing in the Web. Computer Networks and ISDN Systems, Vol. 27, No. 2 (1994) 175-182 (also see Proc. First Int'l World Wide Web Conf., Elsevier Science, Amsterdam, 1994, pp. 175-182)
- Serge Abiteboul and Victor Vianu. Queries and Computation on the Web. Proceedings of the International Conference on Database Theory. Delphi, Greece 1997.
- Ian H Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold, 1994.
- A. Emtage and P. Deutsch, "Archie--An Electronic Directory Service for the Internet." Proc. Usenix Winter 1992 Tech. Conf., Usenix Assoc., Berkeley, Calif., 1992, pp. 93-110.
- M. Gray, World Wide Web Wanderer.
- D. Cutting and J. Pedersen. "Optimizations for Dynamic Inverted Index Maintenance." Proceedings of the 13th International Conference on Research and Development in Information Retrieval, pp. 405-411, September 1990.
- Searching semantic information
References
- ^ Clarke, C., Cormack, G.: Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System. TechRep MT-95-01, University of Waterloo, February 1995.
- ^ Stephen V. Rice, Stephen M. Bailey. Searching for Sounds. Comparisonics Corporation. May 2004. Verified Dec 2006
- ^ Charles E. Jacobs, Adam Finkelstein, David H. Salesin. Fast Multiresolution Image Querying. Department of Computer Science and Engineering, University of Washington. 1995. Verified Dec 2006
- ^ Lee, James. Software Learns to Tag Photos. MIT Technology Review. November 09, 2006. Pg 1-2. Verified Dec 2006. Commercial external link
- ^ Brown, E.W.: Execution Performance Issues in Full-Text Information Retrieval. Computer Science Department, University of Massachusetts at Amherst, Technical Report 95-81, October 1995.
- ^ Cutting, D., Pedersen, J.: Optimizations for dynamic inverted index maintenance. Proceedings of SIGIR, 405-411, 1990.
- ^ Linear Hash Partitioning. MySQL 5.1 Reference Manual. Verified Dec 2006
- ^ Gusfield, Dan [1997] (1999). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press. ISBN 0-521-58519-8. .
- ^ trie, Dictionary of Algorithms and Data Structures, U.S. National Institute of Standards and Technology.
- ^ Black, Paul E., inverted index, Dictionary of Algorithms and Data Structures, U.S. National Institute of Standards and Technology Oct 2006. Verified Dec 2006.
- ^ C. C. Foster, Information retrieval: information storage and retrieval using AVL trees, Proceedings of the 1965 20th national conference, p.192-205, August 24-26, 1965, Cleveland, Ohio, United States
- ^ Landauer, W. I.: The balanced tree and its utilization in information retrieval. IEEE Trans. on Electronic Computers, Vol. EC-12, No. 6, December 1963.
- ^ Google Ngram Datasets for sale at LDC Catalog
- ^ Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google, Inc. OSDI. 2004.
- ^ Grossman, Frieder, Goharian. IR Basics of Inverted Index. 2002. Verified Dec 2006.
- ^ Tang, Hunqiang. Dwarkadas, Sandhya. "Hybrid Global Local Indexing for Efficient Peer to Peer Information Retrieval". University of Rochester. Pg 1. http://www.cs.rochester.edu/u/sarrmor/publications/eSearch-NSDI.ps
- ^ [1] - inverted index written in Haskell
- ^ [2] - inverted index written in Lisp
- ^ Tomasic, A., et al: Incremental Updates of Inverted Lists for Text Document Retrieval. Short Version of Stanford University Computer Science Technical Note STAN-CS-TN-93-1, December, 1993.
- ^ Sergey Brin and Lawrence Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Stanford University. 1998. Verified Dec 2006.
- ^ H.S. Heaps. Storage analysis of a compression coding for a document database. 1NFOR, I0(i):47-61, February 1972.
- ^ Murray, Brian H. Sizing the Internet. Cyveillance, Inc. Pg 2. July 2000. Verified Dec 2006.
- ^ Blair Bancroft. Word Count:A Highly Personal-and Probably Controversial-Essay on Counting Words. Personal Website. Verified Dec 2006.
- ^ The Unicode Standard - Frequently Asked Questions. Verified Dec 2006.
- ^ Storage estimates. Verified Dec 2006.
- ^ PC Technology Guide: Guides/Storage/Hard Disks. PC Technology Guide. 2003. Verified Dec 2006.
- ^ Berners-Lee, T., "Hypertext Markup Language - 2.0", RFC 1866, Network Working Group, November 1995.
- ^ Krishna Nareddy. Indexing with Microsoft Index Server. MSDN Library. Microsoft Corporation. January 30, 1998. Verified Dec 2006. Note that this is a commercial, external link.
Technorati : Search engine indexing