The Divergence from Randomness (DFR) paradigm is a generalisation of one of the very first models of Information Retrieval, Harter's 2-Poisson indexing-model. The 2-Poisson model is based on the hypothesis that the level of treatment of the informative words is witnessed by an elite set of documents, in which these words occur to a relatively greater extent than in the rest of the documents.
On the other hand, there are words, which do not possess elite documents, and thus their frequency follows a random distribution, that is the single Poisson model. Harter's model was first explored as a retrieval-model by Robertson, Van Rijsbergen and Porter. Successively it was combined with standard probabilistic model by Robertson and Walker and gave birth to the family of the BMs IR models (among them there is the well known BM25 which is at the basis the Okapi system).
DFR models are obtained by instantiating the three components of the framework: selecting a basic randomness model, applying the first normalisation and normalising the term frequencies.
Basic Randomness Models
The DivergenceFromRandomness models are based on this simple idea: "The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d." In other words the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness:
weight(t|d) is proportional to -log(Prob_M(t in d | Collection)) (Formula 1)
where the subscript M stands for the type of model of randomness employed to compute the probability. In order to choose the appropriate model M of randomness, we can use different urn models. IR is thus seen as a probabilistic process, which uses random drawings from urn models, or equivalently random placement of coloured balls into urns. Instead of urns we have documents, and instead of different colours we have different terms, where each term occurs with some multiplicity in the urns as anyone of a number of related words or phrases which are called tokens of that term. There are many ways to choose M, each of these provides a basic DFR model. Some of the basic models are the following:
D Divergence approximation of the binomial P Poisson approximation of the binomial BE Bose-Einstein distribution G Geometric approximation of the Bose-Einstein I(n) Inverse Document Frequency model I(F) Inverse Term Frequency model I(ne) Inverse Expected Document Frequency model
When a rare term does not occur in a document then it has almost zero probability of being informative for the document. On the contrary, if a rare term has many occurrences in a document then it has a very high probability (almost the certainty) to be informative for the topic described by the document. Similarly to Ponte and Croft's language model, a risk component is included in the DFR models. If the term-frequency in the document is high then the risk for the term of not being informative is minimal. In such a case Formula 1 gives a high value, but a minimal risk has also the negative effect of providing a small information gain. Therefore, instead of using the full weight provided by the Formula 1, we tune or smooth the weight of Formula 1 by considering only the portion of it which is the amount of information gained with the term.
The more the term occurs in the elite set, the less term-frequency is due to randomness, and thus the smaller the associated risk is. We use two models for computing the information-gain with a term within a document: the Laplace L model and the ratio of two Bernoulli's processes B.
Term Frequency Normalisation
Before using the within-document frequency tf of a term, the document-length dl is normalised to a standard length sl. Consequently, the term-frequencies tf are also recomputed with respect to the standard document-length, that is:
tfn = tf * log(1+ sl/dl) (normalisation 1)
where tfn is the normalised term frequency. A refined version of the normalisation formula is the following:
tfn = tf * log(1 + c*(sl/dl)) (normalisation 2)
where c is a parameter. The parameter c can be set automatically, as described by He and Ounis 'Term Frequency Normalisation Tuning for BM25 and DFR model', in Proceedings of ECIR'05, 2005.
For more details about the Divergence from Randomness framework, you may refer to the PhD thesis of Gianni Amati, or to Amati and Van Rijsbergen's paper Probabilistic models of information retrieval based on measuring divergence from randomness, TOIS 20(4):357-389, 2002.