Searching for Facts in all the Wrong Places
DARREL RAYMOND   
   Consultant The Gateway Group   
   Waterloo, Ontario   
 Canada
With the explosion of information on the World Wide Web and corporate intranets, the need to search for information has never been more important. What’s more, searching is increasingly an activity for technical professionals of every stripe, not just researchers and librarians.
Not surprisingly, various search technologies have pros and cons. Web search engines and product-data-management packages may misname simple search techniques (such as keyword search) with more impressive monikers (concept searching). Familiarity with the differences in such methods, if nothing else, helps avoid frustration during searches.
Some search technologies are useful for expanding the set of documents in the solution; some are useful for restricting it. No one search technology is appropriate for every need.
More complex technologies are not necessarily better. Natural language understanding and concept searching can potentially simplify a user’s life, but both promise more than they often deliver. In addition, users of these methods don’t have a simple, clear model of how the system is finding documents, and so can’t tell if the system is operating correctly or whether it is indeed retrieving all relevant documents.
Similarly, full-text and phrase-searching systems might appear to be best, because “they index everything, and it seems that you can’t do better than that.” But text systems index only the text of the document, ignoring additional information that might be provided by categorization. Full-text searching also places a burden on users, who must think of all possible word and phrase variants to locate all the relevant documents.
Keyword-based systems are simple and cheap, but to work well require a consistent indexing strategy. This may involve the use of humans to categorize documents, or may require authors to provide keywords, descriptive titles, and good abstracts.
Traditional databases are highly structured and focus on numbers, which have an exact meaning. Document-searching technologies, on the other hand, must focus on words. But words derive much of their power from being elusive, ambiguous, and open to reinterpretation. It should not be surprising, then, that document searching is inherently an incomplete and approximate enterprise.
Basics of Searching
   All document-search techniques have the   same basic structure. A user specifies a   search query, a description of an ideal document   that would satisfy the need for information.   The database contains document descriptors   for each indexed document; the actual search compares the query description   against the document descriptors, collecting   those that match.   
Searching technologies differ in three main ways: The kind of information used to describe documents; the rules that decide when a query description matches a document descriptor; and the speed with which matching and updating of the database can take place.
Keyword Searching   
   In keyword searching, a set of keywords   describe documents, and the user enters a   query that consists of keywords. The search   engine records a match if a document descriptor   contains these words.
Keyword systems work best when the user specifies words that are highly selective — they occur infrequently in the whole collection of documents, but occur frequently in the documents the user is interested in. Words of low selectivity are often called stop words. Examples include “is,” “to,” “and,” “the,” or other frequently used words. Even fairly specific terms can be stop words. For example, in the documents of a steel company, “steel” would be a stop word because it shows up frequently (and hence has low selectivity).
Some keywordbased systems employ a restricted or controlled vocabulary; they describe documents using words from that vocabulary, and users can consult the vocabulary to find words for querying. Other keyword systems approximate this by using only words found in document titles and abstracts.
Concept searching is an enhancement of keyword searching. Concept searching engines use a thesaurus to expand the set of search terms the user provides, trying to find more potentially relevant documents. Some concept searching engines also make use of morphological or grammatical knowledge to search for plurals and grammatical variants.
Concept searching is useful for situations in which your information needs are somewhat vague, or when you have run out of ideas for search terms. The basic problem with concept searching is that people generally have different interpretations of a given concept, and their interpretations change depending on their information need.
As a searching technology, keywordbased retrieval is relatively well understood and can be efficient. It is not hard to update a keyword-based index. Keyword-based searching is effective if each document descriptor has enough keywords.
Boolean Searching   
   Some keyword searching systems permit   Boolean searching, which lets users specify   some words as alternatives and that some   words should not be in the document descriptor.   The three Boolean operators are   AND, OR, and NOT. As an example, “bodkin   AND uncle AND Denmark” might select   the play Hamlet, while “bodkin OR uncle”   would find other plays that have uncles   in them.  
Boolean searching seems simple, but people often misuse it. Statistical analysis shows that AND is too powerful at reducing matches. Human factors research shows that most people cannot properly pose a query involving NOT. OR is relatively safe to use. Another problem with Boolean querying is that it can be relatively complicated. This is particularly so for posing a Boolean query that searches for documents matching only a subset of keywords.
Weighted Searching   
   Weighted searching is the main alternative   to Boolean searching. Instead of specifying   that a document contain “Hamlet AND   uncle AND bodkin,” you assign weights to   the search terms, as in: “Hamlet 0.95, uncle   0.5, bodkin 0.78.” The search engine uses   the weights to determine the relative importance   of the query words.  
Some search systems also weight the words used as document representatives. Weights given may be based on a word’s selectivity, its frequency in the document, or other properties.
The basic problem with weighted searching is that it’s hard to understand what the weights really mean. We know what it means for a document to contain the word “Hamlet.” But what does it mean for it to contain “Hamlet” 0.75? Furthermore, it is possible that varying the weights only slightly may lead to a completely different solution.
Similarity-based searching and fuzzy match retrieval are variants of weighted searching.
Full-text Searching
   A keyword-based system indexes a few   words or representatives of a document. A full-text retrieval system indexes the whole   text. An important virtue of a full-text index   is that you need not worry about what words   to search on — the whole document is indexed.   Another advantage is that a full-text   system may index many fragments of text   that a keyword-based system would not (such as numbers, dates, prices, and punctuation).  
Full-text systems have disadvantages. Indexing is slower, because they must process significantly more text. There is generally a need for file format converters to extract text from different formats. The size of the index is large, maybe even larger than the documents themselves. From the standpoint of maintenance, the updating of the index is often a costly activity as well.
Some full-text systems support phrase searching, where you can look for phrases in addition to individual words. Besides searching for “Hamlet” or “bodkin,” for example, you could also search for the phrase “shrug off this mortal coil.”
The basic advantage of phrase searching is its greater degree of selectivity. Many words that are not particularly selective by themselves become extremely selective when combined as a phrase. The basic problem with phrase searching is that it is even more restricted than a Boolean AND query.
Phrase searching is complicated to implement because phrases overlap in a text, whereas words do not. Phrase-searching indexes are generally larger than full-text indexes. Phrase searching engines cannot discard stop words, because these words gain significance in a phrase. The phrase “to be or not to be,” for example, consists completely of stop words, but is highly significant.
Proximity searching is a kind of fuzzy phrase searching. A garden-variety phrase search gives exactly the words you want, in an exact order. In contrast, a proximity search specifies one or more words that should be close to each other. An example of a proximity search is “shuffle NEAR coil.” Some proximity searching systems let you specify the width of the range in characters or words.
Proximity searching is a bit like a weighted search on word positions. Searching for “government corruption” (a phrase search) will retrieve some documents, but searching for “government NEAR corruption” will find more. Proximity searching works well when phrases tend to have many variants, or when words of moderate selectivity tend to sit close to one another. Unfortunately, proximity searching is expensive in computer time.
Ranging Searching   
   Range searching is possible when documents   are represented by values that can be   ordered. An example is “documents published between January and June of last   year.” Any values chosen from an ordered   domain — including time, money, revisions,   and dimensions — can be the subject   of a range search.   Range searching is generally expensive   to implement in document managers,   though it is a staple of relational database   systems. In addition, proximity searching   can be implemented as a kind of range   search.
Two documents are said to be bibliographically coupled if there is a third document that links to both of them. The earliest use of bibliographic coupling was for indexing academic papers through their bibliographies, hence the name. The basic idea is that if two papers are referenced from a third, there is some evidence to believe that the two are related (otherwise, the author would not have referenced them both).
Bibliographic coupling is uncommon in document management systems, but is beginning to appear in systems for searching the Web.
Some research tools to search the Web are also offering cocitation, a companion to bibliographic coupling. Two documents are considered related by cocitation if each references a third document. As with bibliographic coupling, cocitation is an information retrieval technique originating with academic papers and bibliographies. In scientific fields, the fact that two publications jointly reference a third is evidence that both are related by academic “pedigree.”
Relevance FeedBack   
   One recurring problem with search systems   is in getting users to specify the right words or other document representatives.   Relevance feedback tries to address this by   using documents themselves as queries.  
Relevance feedback is sometimes called “query by example.” Users browse a database until they find one or more documents that seem appropriate. Then they pose a query to the system that says, in effect, “find more documents like this.” The search engine then extracts from the document its keywords, title, full text, or other representatives, and treats these as the input for more searching. Systems based on relevance feedback may use Boolean, weighted, full text, proximity, or range searching.
The main advantage of relevance feedback is simplicity for users. The main disadvantage is that the search mechanism is a black box; you really have no idea how the system picks documents similar to the one you gave as an example. Consequently, relevance feedback techniques are useful mainly when you are out of ideas for locating more relevant information.
Natural Language
   The goal of natural language retrieval is   the ability to pose questions in natural language   to a computer, just as one would pose   them to a human (such as a researcher or librarian).   The obvious virtue of the scheme   is a completely natural user interface. The   difficulty is that the riddle of fully understanding   natural language remains largely   unsolved. Simple ambiguities confuse computers,   and relatively little progress has   taken place on understanding language.
Most systems claiming to do natural language searches are essentially extracting keywords out of a natural language query, then using these for a weighted search.