Description
Language Identifiers are programs that guess the language that a document is written in.
Papers
N-Gram-Based Text Categorization (1994) - William B. Cavnar, John M. Trenkle -
http://citeseer.ist.psu.edu/68861.html Language Identification in Web Pages - Bruno Martins, Mário J. Silva -
http://xldb.di.fc.ul.pt/index.php?page=Publications
Programs
TextCat
TextCat is a Perl program for classifying languages, based on "N-Gram-Based Text Categorization" paper
http://odur.let.rug.nl/~vannoord/TextCat/
For identification, it scores languages for each document, and then selects the languages within 5% of the first one.
For example:
scots 62261 english 63167 frisian 65575 middle_frisian 65638 swedish 65857 afrikaans 66352 dutch 66563 estonian 66918 italian 67060 latin 67232 irish 67321 manx 67442 rumantsch 67577
62261 * 1.05 = 65374 ( scots:62261 < english:63167 < boundary:65374 < fresian:65575 < ..) So that means scots and english are chosen, but not fresian
Example for a document that's only English:
english 103768 catalan 118208 french 118922 latin 119014
103768*1.05 = 108956, so only english is selected.
TCatNG
http://tcatng.sourceforge.net/ The TCatNG Toolkit is a Java package that is based on the "N-Gram-Based Text Categorization" paper, but has improvements as well: "This package also implements some extentions to the original proposal. Among other things, the software offers support for Good-Turing smoothing and new fingerprint comparison methods based on the similarity metrics proposed by Lin in "An information-theoretic definition of similarity" and Jiand & Conranth in "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy". Other classification methods besides nearest neighbour are also implemented, such as Support Vector Machines or Bayesian Logistic Regression."
BSD License.