In all Terrier's implemented DivergenceFromRandomness (DFR) models, the relevance score of a document d for a query Q is given by
The latex source code of the above formula is as follows:
\begin{equation}\label{eDFRFramework}
score(d, Q)=\sum_{t\in Q}qtw\cdot w(t,d)
\end{equation}
where t is a query term in Q. qtw is the query term weight that is given by qtf/qtfmax. qtf is the query term frequency. qtfmax is the maximum qtf among all the query terms. w(t,d) is the weight of document d for a query term t. It is given by the DFR models described below.
DLH
Latex source code of the
DLH HypergeometricModel is as follows:
\begin{equation}\label{eDLH}
\frac{1}{tf+0.5}\cdot\bigg(\log_2(\frac{tf\cdot avg\_l}{l}\cdot\frac{N}{F})+(l-tf)\log_2(1-f)+0.5\log_2\big(2\pi tf(1-f)\big)\bigg)
\end{equation}
BB2
Latex source code of the
BB2 model is as follows:
\begin{equation}\label{eBB2}
\frac{F+1}{n_t\cdot(tfn+1)}\big(-\log_2(N-1)-\log_2(e)+f(N+F-1,N+F-tfn-2)-f(F,F-tfn)\big)
\end{equation}
PL2
Latex source code of the
PL2 model is as follows:
\begin{equation}\label{e}
\frac{1}{tfn+1}\big(tfn\cdot\log_2\frac{tfn}{\lambda}+(\lambda-tfn)\cdot\log_2e+0.5\cdot\log_2(2\pi\cdot tfn)\big)
\end{equation}
I(n)L2
Latex source code of the
I(n)L2 model is as follows:
\begin{equation}\label{eInL2}
\frac{1}{tfn+1}\big(tfn\cdot\log_2\frac{N+1}{n_t+0.5}\big)
\end{equation}
I(F)B2
Latex source code of the
I(F)B2 model is as follows:
\begin{equation}\label{eIFB2}
\frac{F+1}{n_t\cdot(tfn+1)}\big(tfn\cdot\log_2\frac{N+1}{F+0.5}\big)
\end{equation}
In(exp)B2
Latex source code of the
In(exp)B2 model is as follows:
\begin{equation}\label{eInexpB2}
\frac{F+1}{n_t\cdot(tfn+1)}\big(tfn\cdot\log_2\frac{N+1}{n_e+0.5}\big)
\end{equation}
In(exp)C2
Latex source code of the
In(exp)C2 model is as follows:
\begin{equation}\label{e}
\frac{F+1}{n_t\cdot(tfn_e+1)}\big(tfn_e\cdot\log_2\frac{N+1}{n_e+0.5}\big)
\end{equation}
Notations
tf is the within-document frequency of t in d.
avg_l is the average document length in the collection.
l is the document length of d, which is the number of tokens in d.
N is the number of document in the whole collection.
F is the term frequency of t in the whole collection.
nt is the document frequency of t.
tfn is the normalised term frequency. It is given by the normalisation 2:
The latex source code of the normalisation 2 is as follows:
\begin{equation}\label{eNormalisation2}
tfn=tf\cdot\log_2(1+c\cdot\frac{avg\_l}{l})
\end{equation}
where c is a free parameter.
tfne is also the normalised term frequency. It is given by a modified version of the normalisation 2:
The latex source code of the modified normalisation 2 is as follows:
\begin{equation}\label{eNormalisation2e}
tfn_e=tf\cdot\log_e(1+c\cdot\frac{avg\_l}{l})
\end{equation}
where c is a free parameter.
λ is the variance and mean of a Poisson distribution. It is given by F/N and F is much smaller than N.
ne is given by N(1-(1-nt/N)F).
The relation f is given by the Stirling formula:
f(n,m)=(m+0.5)log2(n/m)+(n-m)log2n
