Files

langstats

May 11, 2019

76749e6 · May 11, 2019

Name	Name	Last commit message	Last commit date
parent directory ..
czech	czech	add more languages detection functions from https://bitbucket.org/med…	Jan 19, 2015
finnish	finnish	add more languages detection functions from https://bitbucket.org/med…	Jan 19, 2015
french	french	add more languages detection functions from https://bitbucket.org/med…	Jan 19, 2015
german	german	add more languages detection functions from https://bitbucket.org/med…	Jan 19, 2015
greek	greek	add more languages detection functions from https://bitbucket.org/med…	Jan 19, 2015
hungarian	hungarian	add more languages detection functions from https://bitbucket.org/med…	Jan 19, 2015
polish	polish	add more languages detection functions from https://bitbucket.org/med…	Jan 19, 2015
russian	russian	Add files via upload	May 11, 2019
spanish	spanish	add more languages detection functions from https://bitbucket.org/med…	Jan 19, 2015
swedish	swedish	add more languages detection functions from https://bitbucket.org/med…	Jan 19, 2015
turkish	turkish	add more languages detection functions from https://bitbucket.org/med…	Jan 19, 2015
ukrainian	ukrainian	Add files via upload	May 11, 2019
Makefile.am	Makefile.am	typo in Makefile.am	Jan 19, 2015
mkcharstats.cpp	mkcharstats.cpp	add more languages detection functions from https://bitbucket.org/med…	Jan 19, 2015
mkpairmodel.py	mkpairmodel.py	Add files via upload	May 11, 2019
readme.txt	readme.txt	fix info about 'sort' parameters	May 11, 2019

readme.txt

Programs and data to determine the bigrams frequencies for extending
mozilla libcharsetdetect to other languages (for the "Two-Char Sequence
Distribution Method")

Steps:
 - Choose langage charset pair (ie: french/cp1252)

 - Assemble a big chunk of text in the appropriate language and charset
   (fetch from ebooks, wikipedia, whatever, use iconv as needed)

 - Produce character frequency table by running charstats on the chunk, as:
   mkcharstats french/french_cp1252.txt | sort -nr +2 > \
         french/charstats_french_cp1252.txt
   or (for other versions of sort)
   mkcharstats french/french_cp1252.txt | sort -nr -k3 > \
         french/charstats_french_cp1252.txt

 - Edit the resulting file, Just get rid of a few lines that break the
   following step (the first one, the last one and the one for space (0x20)

 - Run mkpairmodel.py to produce the c++ language model. There are two
   phases, to produce a correspondance table from code point to order in
   frequency list, then a 64x64 table listing the pair frequencies for the
   64 most common characters:
   
   mkpairmodel.py french/charstats_french_cp1252.txt \
                  french/french_cp1252.txt             > LangFrenchModel.cpp

 - Integrate with the lib c++ code (3 files to change to resize the array,
   declare/define the tables: nsSBCharSetProber.h, nsSBCSGroupProber.cpp
   nsSBCSGroupProber.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

langstats

langstats

readme.txt

Files

langstats

Directory actions

More options

Directory actions

More options

Latest commit

History

langstats

Folders and files

parent directory

readme.txt