I must admit I found the chapter on Indexing and Searching in our text Modern Information Retrieval to be very vexing. I knew I was in trouble when I read the following just a little over three pages into this 60 page chapter, “Throughout this chapter we assume that the reader is familiar with basic data structures, such as sorted arrays, binary search trees, B-trees, hash tables, and tries.” I went into the chapter knowing nothing of these data structures, and I can’t say that I learned much from the chapter. Fortunately, during my search for articles on indexing and searching, I definitely feel I learned a lot. Below is a summary of two articles I selected on the topics of indexing and searching.
Article One, Inverted indexes: Types and techniques, summary:
Inverted indexes are the primary tool used by search engines to execute queries. We learned in Chapter 9 of our text (Modern Information Retrieval) that there are two elements in the inverted index structure: the vocabulary and the occurrences. An inverted index stores a list of documents for each word in which that word appears instead of the convention method of a document being stored as a list of words (thus the “inverted” comes into play!).
Although inverted indexes are easy to build, they typically use a very large amount of memory, storage space, and/or CPU time. Documents are typically “preprocessed” prior to being indexed using methods such as “lexing”, “stemming”, and eliminating “stop words”. I found what I thought to be a very good presentation on text processing in information retrieval that discussed these terms at: http://www.csee.umbc.edu/~ian/irF02/lectures/03Text-Processing.pdf.
“Lexing” is the process of converting byte stream into tokens, which are each a single alphanumeric word. These are converted into lower case. Decisions must be made regarding how to handle hyphens and other types of punctuation. Numbers are typically not included in the list of tokens because they take up a lot of space in the index and don’t add significant value.
“Stemming” involves removing any prefixes and suffixes from a word. For example, “connected”, “connecting”, and “connection” might all be indexed as “connect”.
“Stop words” are often eliminated because they appear too frequently among almost all documents. Examples of stop words include “a”, “the”, “of”, and “to”.
The trends of using stemming and stop word removal are used less frequently by modern search engines. They prefer having a somewhat larger index and thus having the capability to search for various combinations of words to the possibility of being put at a competitive disadvantage.
There are many different ways to search for information, including:
- Normal queries – typically one word (ex. = magic)
- Boolean queries – uses Boolean logic such as AND, OR, and NOT with multiple words (ex. = magic AND Houdini)
- Phrase queries – a series of words contained within quotation marks “to be or not to be”
- Proximity queries – uses logic term NEAR
- Wildcard queries – a form of inexact matching where a “*” is used to designate letters, words, or a combination of words (ex. = Paris is the * capital of the world would return documents that contain phrases such as “Paris is the fashion capitol of the world” and Paris is the culinary capitol of the world”)
Results of queries are typically ranked in some order of relevance. A common method is the weighing scheme TF-IDF (term frequency – inverse document frequency). This combines proportionally how often a term is found within a document with how often a term is found within all documents being searched.
The authors of this article lament the lack of published research on the topic of query evaluation optimization. They feel this is in large part due to the fact that large companies don’t wish to share any findings they discover because it would potentially aid their competitors.
Mahapatra, A.K., & Biswa, S. (2011). Inverted indexes: Types and techniques. International Journal of Computer Science Issues, 8(4), 384-392.
Article 2, A Two-Tier Distributed Full-Text Indexing System, summary:
Index structure is of paramount importance to search engines. The better the index structure, the quicker and more accurate the search. This can be a challenge in the Internet environment, due to its dynamic nature.
This article focuses on the Internet connection among clusters. It discusses two types of distributed inverted file partitioning schemes: document partitioning and term partitioning.
Attributes of document partitioning include:
- Easier to implement
- Higher parallelism
- Good expansibility
- Better load balancing
- Index built by partitioning documents according to subjects
- Queries related to one subject forwarded directly to a specific index server
- High network burden because of use of different index servers (the more index servers, the slower the response time)
Attributes of term partitioning include:
- Index built on one server then allocated to each server index
- Whole index must be rebuilt when web document is updated
- Network overhead and resource consuming is low since only one server
- Better search efficiency
- Bad load balance
The authors propose a two-tiered distributed full text indexing system that combines attributes of both of these types of document partitioning and provides a balance between search efficiency, resource consuming, and load balance. The scheme uses document partitioning among the clusters and term partitioning inside each cluster.
Zhang, W., Chen, H., He, H. & Chen, G. (2013). A two-tiered distributed full-text indexing system. Applied Mathematics & Information Sciences, 8(1), 321-326. DOI: 10.12785/amis/080139