Dave Jarvis' Repositories

\input{chapter.tex}


\chapter{\label{cha:Word-Split}Word Split}

Word Split is an application that splits conjoined terms into their
constituent words. The software cannot convert conjoined terms perfectly;
rather, it reduces manual labour through automation.


\section{Download}

Word Split is available for download at:

\href{http://www.whitemagicsoftware.com/software/java/wordsplit/}{http://www.whitemagicsoftware.com/software/java/wordsplit/}


\section{Overview}

Splitting conjoined words (for example, \emph{clientaccountedit})
into constituent words (such as, \emph{client},\emph{ account}, and\emph{
edit}) requires the following steps:
\begin{enumerate}
\item \textbf{Corpus.} Create a large amount of text.
\item \textbf{Probability Lexicon.} Derive a probability lexicon from corpus
texts.
\item \textbf{Conjoined.} Obtain a list of conjoined words (such as database
column names).
\item \textbf{Word Split.} Execute Word Split using the lexicon and the
conjoined terms.
\end{enumerate}

\section{Corpus}

Create the corpus text from large bodies of text that contain words
that comprise conjoined texts. For example, a database column called
\emph{clientaccountedit} works best with corpus text that contains
the words \emph{account}, \emph{client}, and \emph{edit}. The more
that these words appear in the corpus, the more likely that \emph{clientaccountedit}
is split to \emph{client}, \emph{account}, and \emph{edit}.

Excellent sources for creating corpus text include:
\begin{itemize}
\item Emails
\item Website Contents
\item Technical Documents
\item Requirements Documents
\end{itemize}
The results directly relate to the amount of corpus text content;
more material improves the results.

Create a corpus as follows:
\begin{enumerate}
\item Concatenate large corpus text files into a single corpus file.
\item Append a dictionary to the corpus file.\footnote{Eliminate words with uppercase characters and punctuation marks.}
\end{enumerate}
Appending a dictionary to the corpus file ensures that any words in
the conjoined terms that are not in the corpus are split. The corpus
file probably does not contain all the words used in the conjoined
terms; whereas, the dictionary probably does. Even though this means
that all words in the dictionary are included in the probability lexicon,
those words are assigned the lowest probability. Without every possible
word in the lexicon, the word split software will not work as expected.


\section{Probability Lexicon}

The shell script in \ref{alg:Script-Tally-Corpus} shows how to create
a tally of words.

\begin{algorithm}[H]
\includegraphics{source/scripts/tally-corpus\lyxdot sh}

\caption{\label{alg:Script-Tally-Corpus}Script - Tally Corpus}
\end{algorithm}


The lines in \ref{alg:Script-Tally-Corpus} perform the following
actions:
\begin{itemize}
\item \textbf{Line 4.} This terse line accomplishes a few things:

\begin{itemize}
\item Reads all the text, in paragraph form, from a file called \filename{\filetxtcorpus}.
\item Replaces spaces with new lines, so each word is tallied individually.
\item Removes all non-alphabetic characters from every word.
\end{itemize}
\item \textbf{Line 5.} Convert all the words to lower case, which later
on will be matched against a \nohyphens{dictionary} of exclusively
lower case words.
\item \textbf{Line 6.} Sort the words alphabetically, so consecutive instances
of the same word can can be tallied.
\item \textbf{Line 7.} Tally each unique word; this frequency can be translated
to a relative probability using division.
\item \textbf{Line 8.} Sort the resulting counts numerically, to get the
most frequently used words at the top. This step also helps calculate
relative probabilities.
\item \textbf{Line 11.} This line executes the following tasks:

\begin{itemize}
\item Exclude any words in the frequency list that are not in the dictionary.
In practice, the corpus will likely be sourced from technical documentation,
which inevitably will include database column names. As the point
of the word splitting is to chop the column names into their equivalent
phrases (that is, separate conjoined words with spaces), having the
column names themselves as part of the lexicon defeats the purpose.
In other words, this line removes words like \texttt{clientaccountedit},
but keeps the words \emph{account}, \emph{accounted}, \emph{client},
\emph{count}, \emph{counted}, \emph{edit}, \emph{ed}, \emph{it}, \emph{lie},
\emph{tac}, and others in the lexicon with their respective probabilities.
\item Write each word out to a comma-separated file format (for ease of
editing in a spreadsheet).
\end{itemize}
\item \textbf{Line 14.} In practice, three-letter words cause the word splitting
software to fail due to matches on words like \emph{lie}, \emph{tac},
\emph{dit}, and others. For the most part, longer column names are
predominately made out of real, human-readable names. Any part of
the column that does not match a word in the lexicon is separated
from words that do match, by virtue of the words that do match being
surrounded by spaces. So, this line removes super-short words.
\end{itemize}
The shell script in \ref{alg:Script-Tally-Corpus} generates a tally
lexicon. Running the script requires a dictionary file and a corpus
file. An example output from running the tally script resembles the
following:
\begin{lyxcode}
account,1000

accounted,979

client,971

counted,544

edit,942
\end{lyxcode}
After the tally lexicon is created, the probability lexicon must be
created. The source code in \ref{alg:Script-Tallies-Probabilities}
shows an \texttt{awk} script that converts the numeric tallies into
relative probabilities.

\begin{algorithm}[H]
\includegraphics{source/scripts/probability\lyxdot awk}

\caption{\label{alg:Script-Tallies-Probabilities}Script - Tallies to Probabilities}
\end{algorithm}


Running \ref{alg:Script-Tallies-Probabilities} produces a probability
lexicon; an example probability lexicon file resembles the following:
\begin{lyxcode}
account,1

accounted,0.979

client,0.971

counted,0.544

edit,0.942
\end{lyxcode}

\section{Conjoined}

Conjoined terms must be listed in a file with one conjoined term per
line.

An example conjoined input file resembles the following:
\begin{lyxcode}
clientaccountedit

clientaccount

accountedit
\end{lyxcode}

\section{Word Split}

This is a Java-based application that, when given a probability lexicon
and a list of conjoined terms, produces a list conjoined terms and
segmented solutions. Review the file \texttt{run.sh} to see how to
run Word Split.

Example output from running Word Split resembles the following:
\begin{lyxcode}
clientaccountedit,client~account~edit

clientaccount,client~account

accountedit,account~edit
\end{lyxcode}
The text is split into meaningful words.