QuerySimulation

Our method of query simulation [He & Ounis, ECIR2005] is inspired by the query-based sampling (Callan & Connell, 2001). The difference between the two approaches is that our method adopts a term weighting model to extract the most informative terms from the top-ranked documents to formulate a query, while the query-based sampling approach uses the top-ranked documents to get various collection samples. Our query simulation method can be described as follows:

  1. Randomly choose a seed-term from the vocabulary.

  2. Rank the documents containing the seed-term using a specific document weighting function, e.g. PL2 or BM25.

  3. Extract the X-1 most informative terms from the Y top-ranked documents using a specific term weighting model. Y is a parameter of the query simulation method. At this stage, we can use any term weighting model from the literature, e.g. the Bo1 DFR term-weighting model.

  4. To avoid selecting a junk term as the seed-term, we consider the most informative one of the extracted terms in step 3 as the new seed-term. Note that the original seed-term is discarded at this stage.

  5. Repeat steps 2 and 3 to extract the X-1 most informative terms from the Y top-ranked documents, which are ranked according to the new seed-term.

  6. The simulated query consists of the new seed-term and the X-1 terms extracted in Step 5.