Language Identifiers are programs that guess the language that a document is written in.




TextCat is a Perl program for classifying languages, based on "N-Gram-Based Text Categorization" paper [WWW]

For identification, it scores languages for each document, and then selects the languages within 5% of the first one.

For example:

scots    62261
english  63167
frisian  65575
middle_frisian   65638
swedish  65857
afrikaans        66352
dutch    66563
estonian         66918
italian  67060
latin    67232
irish    67321
manx     67442
rumantsch        67577

62261 * 1.05 = 65374 ( scots:62261 < english:63167 < boundary:65374 < fresian:65575 < ..) So that means scots and english are chosen, but not fresian

Example for a document that's only English:

english  103768
catalan  118208
french   118922
latin    119014

103768*1.05 = 108956, so only english is selected.


[WWW] The TCatNG Toolkit is a Java package that is based on the "N-Gram-Based Text Categorization" paper, but has improvements as well: "This package also implements some extentions to the original proposal. Among other things, the software offers support for Good-Turing smoothing and new fingerprint comparison methods based on the similarity metrics proposed by Lin in "An information-theoretic definition of similarity" and Jiand & Conranth in "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy". Other classification methods besides nearest neighbour are also implemented, such as Support Vector Machines or Bayesian Logistic Regression."

BSD License.

last edited 2007-03-27 11:34:10 by ErikGraf