Language Identifiers are programs that guess the language that a document is written in.
N-Gram-Based Text Categorization (1994) - William B. Cavnar, John M. Trenkle - http://citeseer.ist.psu.edu/68861.html
Language Identification in Web Pages - Bruno Martins, Mário J. Silva - http://xldb.di.fc.ul.pt/index.php?page=Publications
TextCat is a Perl program for classifying languages, based on "N-Gram-Based Text Categorization" paper http://odur.let.rug.nl/~vannoord/TextCat/
For identification, it scores languages for each document, and then selects the languages within 5% of the first one.
scots 62261 english 63167 frisian 65575 middle_frisian 65638 swedish 65857 afrikaans 66352 dutch 66563 estonian 66918 italian 67060 latin 67232 irish 67321 manx 67442 rumantsch 67577
62261 * 1.05 = 65374 ( scots:62261 < english:63167 < boundary:65374 < fresian:65575 < ..) So that means scots and english are chosen, but not fresian
Example for a document that's only English:
english 103768 catalan 118208 french 118922 latin 119014
103768*1.05 = 108956, so only english is selected.
http://tcatng.sourceforge.net/ The TCatNG Toolkit is a Java package that is based on the "N-Gram-Based Text Categorization" paper, but has improvements as well: "This package also implements some extentions to the original proposal. Among other things, the software offers support for Good-Turing smoothing and new fingerprint comparison methods based on the similarity metrics proposed by Lin in "An information-theoretic definition of similarity" and Jiand & Conranth in "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy". Other classification methods besides nearest neighbour are also implemented, such as Support Vector Machines or Bayesian Logistic Regression."