Search KWIC Concordance

Hermetic Systems

Windows software for Generating and Searching
a KWIC Concordance of a Document

Download and installation of this software has been temporarily suspended pending revision.

KWIC = "Keywords in Context". A KWIC concordance of a document is a list of the different words occurring in the document, with all instances of each word shown in context, for example:

fragment of KWIC concordance

The Search KWIC Concordance software scans an MS Word DOCX file, an HTML file, an XML file or any ANSI text file, skipping over so‑called stop words (that is, common words, or words to be ignored, as specified by the user) to generate a KWIC concordance. The context size can be set (from 1 to 9 words before and after the keyword). The words which are found can be listed alphabetically or by frequency. After a concordance is generated it may be searched for specified keywords (which may include word patterns — see below). Furthermore:
This software may also be used with text in languages other than English, in particular, with French, German, Italian, Spanish and Latin text.

This software will also scan HTML files, ignoring tags. It may be told to skip over lists (which tend to interrupt the text).


Generating a Concordance for a Document

To create a concordance for a document, first open the file by clicking on the Input file button and navigate to the desired folder and file. Optionally specify a folder and file name for the concordance file to be created. Whether or not a concordance file is specified, the concordance will be displayed in the textbox (up to the capacity of the textbox). Adjust the settings (if needed).

Search KWIC Concordance screenshot #1

Context size is the number of words preceding and following the keyword in context. Possible values are 0 through 9. If 0 then only the word (optionally with its frequency of occurrence) is shown.

Click on the Create concordance button and the result (with Show frequencies checked and Show line breaks unchecked) is something like:

Screenshot #2 for Search KWIC Concordance

With Show frequencies unchecked and Show line breaks checked we obtain concordance items such as:

Buggers

where a forward slash indicates the presence of an end‑of‑line in the text file.

ANSI is the single‑byte text encoding which is the default encoding on your PC. UTF‑8 is a variable‑byte‑length encoding, often used in HTML files.
Text files may be encoded via ANSI or UTF‑8. The program does not act directly on binary files such as PDF and Word DOC files, but it does handle DOCX files. Other kinds of files can be processed if saved as "Plain Text" files.

This program can process text in most European languages, including German, Spanish, Italian, Portuguese and French. For each of these languages a user can select a file of stop words (supplied).


Searching a Concordance

After a concordance has been generated (with or without creating concordance file) you can search the concordance for specified words or word patterns.

Select the Search for one or more keywords option to see the Search for keywords button. This also enables the Search results file and Keywords file buttons. Clicking on the Search results file button allows you to specify a folder and a file to which to write the results of a search. If no file is specified then the results are only displayed in the textbox, not written to a file.

Clicking on the Keywords file button allows you to select a file containing keywords to search for, as many as you wish. Keywords must be single words, not phrases. The keywords should be separated by spaces, not commas. They do not have to be in alphabetical order, and several keywords can occur on a single line in the file.

In addition to the words in the keywords file you can also specify extra keywords by entering them into the Search keywords textbox. As before, the keywords should be separated by spaces, not commas. For example, mother father. Clicking on the Search for keywords button produces (in this example):

KWIC concordance -- Search mother father

KWIC concordance -- Search sister+motherIf a search term consists of two words juxtaposed by +, e.g., sister+mother, then a search will find all occurrences of the first word where the second word occurs in the same context (as shown at right). In this case we see that the term sister occurs 6 times but only twice with mother in the same context. Only the first search word is shown in angle brackets.

KWIC concordance -- Search mother+sister The result of searching on mother+sister is shown at right. Results depend on the context size; a larger context size may return more items.

Pattern-matching may be used in search terms (but only for extra keywords, not keywords in the keywords file). The character * matches any string of characters (including the empty string), the character ? matches any single character and the character # matches any numerical digit (0‑9). For example:


Patterns (in extra keywords) can also be juxtaposed, for example:

KWIC concordance -- juxtaposed patterns

KWIC concordance -- more juxtaposed patternsIf fa*+mo* is the search term, instead of mo*+fa*, we obtain:



Generating an Index File

and index fileAn index file is a file containing information about each of the words in the document. To generate an index file along with a concordance file, check the index file check box. The index file is not used in the current session, but when the program is restarted it will look for an index file and (optionally) load it, in which case a concordance file need not be re‑generated prior to conducting a search. An index file need not be generated for documents that are not large (because generating the concordance is done quickly). For large documents, for which the generation of a concordance might take several minutes or more, it is useful to generate an index so that the concordance need not be recreated on subsequent runs.

The index file contains information about each word in the concordance at the time when the concordance was generated, and thus is relative to the settings for allowing and ignoring words in the top frame in the Settings panel. Thus if these have changed since the last time an index file was generated (or if a different stop words file is being used) then a new concordance (optionally plus index) should be generated in order to guarantee correct results of a search.

If you never wish to use an index file then simply leave the index file check box unchecked when creating a concordance.