Words in a document that are frequently occurring but meaningless in terms of InformationRetrieval are called stopwords. It is repeatedly claimed that stopwords do not contribute towards the context or information of the documents and they should be removed during indexing as well as before querying by an IR system. However, the use of a single fixed stopword list across different document collections could be detrimental to the retrieval effectiveness.
A term-based random sampling method in deriving a stopword list automatically for a given collection is presented in:
T. Lo, B. He and I. Ounis. Automatically Building a Stopword List for an Information Retrieval System. To appear in the Journal on Digital Information Management: special issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR'05).
The term-based random sampling is introduced based on the Kullback-Leibler divergence measure. This approach determines how informative a term is and hence enables us to derive a stopword list automatically. Using the approach, the automatically generated stopword list consists of the least informative words that are extracted from samples of the collection.