Automatically ranking documents is one of the key problems which has been, and is still being tackled in Information Retrieval. The ultimate goal of any IR system is to retrieve and rank documents relevant to a users query.
Putting retrieval to one side, we wish to fairly rank documents over a collection without biasing our results based upon the attributes of a given document. Thus we have devised and developed extensively, many mechanisms for rewarding and penalising document ranks based upon their attributes.
One such attribute we have to normalise documents on is their length, which gives obvious advantages to longer documents over shorter ones. Singhal et al., in their 1996 paper Pivoted Document Length Normalization, give a very nice, concise definition of length normalisation:
"Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Document length normalization is used to fairly retrieve documents of all lengths."
Let us consider a system which ranks documents that have been retrieved, but does not take into account the length of these documents. We encounter two major flaws in this system's rankings:
1. Term frequencies: in higher length documents, terms will occur more frequently (a document about cars that is three pages long is likely to have more occurrences of the word car than a document of three paragraphs which is also about cars). This will lead to documents of higher length being ranked above those of lower length, even if they are not as relevant.
2. Term appearances: longer documents contain more noise, that is they are more likely to contain other terms due to their length rather than their contribution to what the document is about. This leads to higher length documents matching a larger set of queries, and being ranked high where they may not be relevant.
Fixing through normalisation
We need to be able to counter these two problems. We do this through normalising the length of documents when we rank them. This causes us to (1) give a more balanced rating based upon the number of length normalised times a term appears in a document, and (2) cut out the noise elements from a document that are not likely to reflect upon it's actual content.