Hermetic Word Frequency Counter

	Hermetic Word Frequency Counter
Counts Frequencies of Different Words in a File

Hermetic Word Frequency Counter scans an MS Word DOCX file or a text or text-like file — including HTML and XML files encoded via ANSI or UTF-8 — and counts the number of occurrences of the different words (optionally ignoring common words such as the and this). It is thus also a word-search program. It is possible to specify exactly what counts as a word (e.g., words with or without hyphens or numerals). The words which are found can be listed alphabetically or by frequency, with rank and frequency count displayed for each word.

There are two versions of this word count software: basic (WFC) and advanced (WFCA, which does everything that WFC does, including scanning DOCX files). The main differences are that WFC counts words only in single DOCX, text and text-like files, whereas WFCA counts words in multiple files (in multiple folders) in a single operation and also counts phrases. If you need to count words in only one file at a time then WFC may be what you need. If you have many files or need more options and greater functionality, then you need WFCA. Click on this link for the WFCA page.

To open a file, click on the Input file button and navigate to the desired folder and file. After setting the operation parameters, click on the Count words button. Here is a typical screenshot, showing word counts for a 123.75 Kb text file, with common words ignored, upper/lower case distinguished, the words sorted by frequency, and the archaic words "hath" and "howbeit" ignored:

Hermetic Word Frequency Counter screenshot #2

The "percentage" value is the ratio of the number of occurrences of a word divided by the total number of occurrences of all words shown in the list (not all words in the file) expressed as a percentage.

If the "Disable" box (on the same line as "Output file") is checked then output is to the text box only, not to a file.

Here is another screenshot, showing word counts for a 67.37 Kb MS Word docx file (the text itself, when unpacked, is 110.21 Kb), with common words ignored, upper/lower case not distinguished, and the words sorted alphabetically:

Hermetic Word Frequency Counter screenshot #3

In both cases the process took less than ten seconds (with the Don't display words as found checkbox checked).

"Filesize" is the size of the docx file; "Text" is the size of the text within that file. The former is smaller than the latter because the text within the docx file is compressed.

Theoretically there is no limit on the size of an input file or the number of words in it, but in practice (due to processing time needed) there is a limit of about 10 Mb on text files (and text-like files such as XML and HTML files). There is also a limit of about 10 Mb on the amount of text in an MS Word docx file (though a docx file can be larger than this if it contains many images). For a docx file, only words in the body of the document are counted, not words in footnotes or endnotes.

ANSI is the single-byte text encoding which is the default encoding on your PC. UTF-8 is a variable-byte-length encoding of Unicode characters, often used in HTML and XML files.
For text and text-like files (including HTML and XML files) the text may be encoded via ANSI or UTF-8. It does not act directly on binary files such as pdf and MS-Word doc files (as distinct from docx files); such files can be scanned if saved as "Plain Text" files (see Scannable Files).

The program counts the frequencies of all words in the file (or optionally all words other than common words). If you just want to count the occurrences of a single word (or of each word in a set of words, or of any word matching a given pattern) then you can do this with the Advanced Version of this program.

The 'rank' and 'frequency' values may each be included in, or excluded from, the displayed results.

If the output file consists only of words, with no rank or frequency count values, then you can get these either as a list (one word per line) or as comma-separated. This is done by making the appropriate selection in the Display format drop-down menu. Setting it to words (+ commas) provides an easy way to get a list of keywords, as in:

code,convert,converter,file,folder,html,software,source,windows

The input file need not consist simply of natural language text, but may be an HTML, XML, PHP or C/C++ file, or may mix natural language with tags such as "<table>".

When processing HTML files, HTML tags such as "<center>" are skipped. When processing XML files all text within "<" and ">" is skipped. PHP files are processed as HTML files in which C-style comments are possible. When processing PHP files, text within "<?php" and "?>" is not skipped.

Hermetic Word Frequency Counter

Counts Frequencies of Different Words in a File

Hermetic Word Frequency Counter User Manual