Hermetic Word Frequency Counter: Support for UTF-8 Files

Hermetic Word Frequency Counter

Support for UTF-8 Files

For text to be represented by a sequence of bytes in a file, it must be encoded in some way. The usual (and default) encoding is called ANSI, and consists of over 200 single bytes each associated with a particular character (letters, punctuation marks, etc.). 96 of these are part of the ASCII character set, the same on all PCs. Bytes above 128 are associated with letters depending on the locale of the PC (that is, whether it's a PC in England, Poland, Turkey, etc.). Thus ANSI is not itself an encoding; it is the default encoding on your computer. The character set (200 or so bytes) used on most PCs in Western countries is the encoding known as the Windows-1252 character set, which can encode text in English, German, Spanish, etc. (see Non-English Text). For most languages Windows-1252 is not suitable for encoding text, and some other character set is needed. Hermetic Word Frequency Counter properly handles only text which has been, or could be, encoded using Windows-1252.
Text such as English, German, etc., which can be encoded via ANSI can also be encoded in other ways. Often it is encoded using the method known as UTF-8 (where "UTF" = "Unicode Transformation Format"). HTML and XML files need not use UTF-8 encoding, but often do. This program supports files which are UTF-8 encoded, provided that the text (or more exactly, all the words in the text) could also be encoded via Windows-1252.
When you specify an input file this program tries to ascertain what kind of file it is. If it is a binary file (such as an MS Word .doc file), or an RTF file, or is encoded via the double-byte format UCS-2 then it cannot be scanned. It can be scanned only if the text is encoded via ANSI or UTF-8. In the latter case the UTF-8 checkbox is automatically checked:

The program will then treat the text as UTF-8 encoded. You can override this by unchecking the UTF-8 checkbox, but this is not advised. Alternatively, if you believe the text is UTF-8 encoded, but the program does not automatically recognize it as such, and so does not check the checkbox, then you can do so manually before scanning the file.
The words in the common words file can also be UTF-8 encoded, but there is no reason to do that.
The output file is always ANSI-encoded (and, as explained elsewhere, can be read into Excel).

Introduction User Manual: Contents
Hermetic Systems Home Page