|
Computers & Writing Systems
You are here: Encoding > Resources Unicode Character Count utility
Updated 2005-11-16 New sorting options -f and -r added. UnicodeCCount is a quick-and-dirty Unicode-aware replacement for CCount, the character count utility. Written in Perl, the program is available both as the Perl source (requires Perl 5.8.1 or newer) and as a stand-alone Windows EXE. Contents SyntaxUnicodeCCount is a command line utility. When executed without any parameters it emits a short help message: Usage: UnicodeCCount [-e encoding] [-o outputfile] [-c|-d] [-m] [-u|-f] [-r] file ... UnicodeCCount -l A quick and dirty character counter that understands various encodings. Input defaults to utf8, but you can choose other encodings with -e. Data is converted from the specified encoding to Unicode as it is read, and the output data is always utf-8. -l outputs a list of available encodings. -c or -d enforce Unicode normalization (NFC or NFD) as data is read. -m combining mark sequences (base + diacritics) counted separately. -u use the Unicode Collation Algorithm (UCA) rather than the default sort. -f sort by frequency -r reverse sort order Version 0.3 ExampleSuppose I have the first paragraph of the Russian translation of the Universal Declaration of Human Rights in a plain-text UTF-8 file called mytext.txt. The text looks like this: Принимая во внимание, что признание достоинства, присущего всем членам человеческой семьи, и равных и неотъемлемых прав их является основой свободы, справедливости и всеобщего мира; и Then the following command: UnicodeCCount mytext.txt >counts.txt (note the redirection of standard out in order to capture the output to a file) would result in the following UTF-8 data in counts.txt: Character count for 'mytext.txt': U+000A 1 U+000D 1 U+0020 25 U+002C , 4 U+003B ; 1 U+041F П 1 U+0430 а 9 U+0431 б 2 U+0432 в 13 U+0433 г 2 U+0434 д 3 U+0435 е 16 U+0437 з 1 U+0438 и 17 U+0439 й 2 U+043A к 1 U+043B л 5 U+043C м 8 U+043D н 10 U+043E о 16 U+043F п 4 U+0440 р 7 U+0441 с 12 U+0442 т 6 U+0443 у 1 U+0445 х 3 U+0447 ч 4 U+0449 щ 2 U+044A ъ 1 U+044B ы 3 U+044C ь 1 U+044F я 4 U+FEFF 1 Note that the output is tab-separated. Downloads
Previous versions
Related ResourcesLetterMeter, text analysis tool — For MacOSX only SupportAs this program is distributed at no cost, I am unable to provide a commercial level of personal technical support. I am interested in hearing from you, however, and will try to resolve problems that are reported to me. You can send feedback to me here. © 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page. |