Cross Language Information Retrieval (CLIR) retrieves information across languages using traditional IR methods. Information is retrieved either from a single crosslingual collection - centralised CLIR -, or from a variety of crosslingual sources - distributed CLIR -. In either case, results are merged into one multilingual merged list. There exist several merging strategies, which select documents according to their similarity scores - raw score merging -, or normalise the similarity scores and merge the documents accordingly - normalised merging-, and so on. Strict merging schemes prefer results common to both routes, but, in the absence of a common result, accept the terms from both routes.

Various CLIR approaches have been developed, which translate either the queries, or the documents, or both. Similarly, translation can be direct - from the source language to the target language -, or indirect - from the source language to a pivot language and then to the target language -, or even triangulated - translation to and from three languages and then fusing the results -. It is more economical to translate the queries only, in which case text can be searched in the target language. However, query translation combined with document translation optimises results.

Machine Translation (MT) systems, Machine-Readable Dictionaries (MRDs), parallel or aligned multilingual corpora, and statistical lexica are some of the methods used to translate the selected input. Unfortunately, the availability of these resources is subject to the popularity of the language in question. Generally, minority languages are seriously underepresented in terms of language resources. The performance of CLIR applications may be significantly enhanced by the use of NaturalLanguageProcessing tools, such as language identification applications, tokenisers/chunkers, morphological analysers, normalisers/guessers, stemmers/lemmatisers, Part-Of-Speech disambiguators, shallow/deep parsers, terminology extractors, Noun Phrase (NP) extractors, idiom recognisers, semantic disambiguators, contextual dictionary lookup tools, comprehension assistants, translation/authoring assistants, summarisers, and so on. It should be noted, however, that CLIR applications should not pursue exact translations/parses. Unlike MT technology, where the aim is to generate the best possible translation, CLIR technology needs to retrieve documents according to their relevance. Translation in CLIR is the means towards the best possible retrieval.

Although most CLIR research has focused on written text, it also includes speech, sign language, multimedia, and other modalities.

The state-of-the-art on CLIR research and development is displayed and discussed annualy, during the Cross Language Evaluation Forum (CLEF) [WWW]


last edited 2007-03-26 10:56:15 by IadhOunis