Dave Jarvis' Repositories

git clone https://repo.autonoma.ca/repo/keenwrite.git
# Lexicons

This directory contains lexicons used for spell checking. Each lexicon
file contains tab-delimited word-frequency pairs.

Compiling a high-quality list of correctly spelled words requires the
following steps:

1. Download a unigram frequency list for all words for a given language.
1. Download a high-quality source list of correctly spelled words.
1. Filter the unigram frequency list using all words in the source list.
1. Sort the filtered list by the frequency in descending order.

The latter steps can be accomplished as follows:

    # Extract unigram and frequency based on existence in source lexicon.
    for i in $(cat source-lexicon.txt); do
      grep -m 1 "^$i"$'\t' unigram-frequencies.txt;
    done > filtered.txt

    # Sort numerically (-n) using column two (-k2) in reverse order (-r).
    sort -n -k2 -r filtered.txt > en.txt

There may be more efficient ways to filter the data, which takes a few hours
to complete (on modern hardware).

# Resources

There are numerous sources of word and frequency lists available, including:

* https://storage.googleapis.com/books/ngrams/books/datasetsv3.html
* https://github.com/hermitdave/FrequencyWords/
* https://github.com/neilk/wordfrequencies