\input{chapter.tex} \chapter{\label{cha:Word-Split}Word Split} Word Split is an application that splits conjoined terms into their constituent words. The software cannot convert conjoined terms perfectly; rather, it reduces manual labour through automation. \section{Download} Word Split is available for download at: \href{http://www.whitemagicsoftware.com/software/java/wordsplit/}{http://www.whitemagicsoftware.com/software/java/wordsplit/} \section{Overview} Splitting conjoined words (for example, \emph{clientaccountedit}) into constituent words (such as, \emph{client},\emph{ account}, and\emph{ edit}) requires the following steps: \begin{enumerate} \item \textbf{Corpus.} Create a large amount of text. \item \textbf{Probability Lexicon.} Derive a probability lexicon from corpus texts. \item \textbf{Conjoined.} Obtain a list of conjoined words (such as database column names). \item \textbf{Word Split.} Execute Word Split using the lexicon and the conjoined terms. \end{enumerate} \section{Corpus} Create the corpus text from large bodies of text that contain words that comprise conjoined texts. For example, a database column called \emph{clientaccountedit} works best with corpus text that contains the words \emph{account}, \emph{client}, and \emph{edit}. The more that these words appear in the corpus, the more likely that \emph{clientaccountedit} is split to \emph{client}, \emph{account}, and \emph{edit}. Excellent sources for creating corpus text include: \begin{itemize} \item Emails \item Website Contents \item Technical Documents \item Requirements Documents \end{itemize} The results directly relate to the amount of corpus text content; more material improves the results. Create a corpus as follows: \begin{enumerate} \item Concatenate large corpus text files into a single corpus file. \item Append a dictionary to the corpus file.\footnote{Eliminate words with uppercase characters and punctuation marks.} \end{enumerate} Appending a dictionary to the corpus file ensures that any words in the conjoined terms that are not in the corpus are split. The corpus file probably does not contain all the words used in the conjoined terms; whereas, the dictionary probably does. Even though this means that all words in the dictionary are included in the probability lexicon, those words are assigned the lowest probability. Without every possible word in the lexicon, the word split software will not work as expected. \section{Probability Lexicon} The shell script in \ref{alg:Script-Tally-Corpus} shows how to create a tally of words. \begin{algorithm}[H] \includegraphics{source/scripts/tally-corpus\lyxdot sh} \caption{\label{alg:Script-Tally-Corpus}Script - Tally Corpus} \end{algorithm} The lines in \ref{alg:Script-Tally-Corpus} perform the following actions: \begin{itemize} \item \textbf{Line 4.} This terse line accomplishes a few things: \begin{itemize} \item Reads all the text, in paragraph form, from a file called \filename{\filetxtcorpus}. \item Replaces spaces with new lines, so each word is tallied individually. \item Removes all non-alphabetic characters from every word. \end{itemize} \item \textbf{Line 5.} Convert all the words to lower case, which later on will be matched against a \nohyphens{dictionary} of exclusively lower case words. \item \textbf{Line 6.} Sort the words alphabetically, so consecutive instances of the same word can can be tallied. \item \textbf{Line 7.} Tally each unique word; this frequency can be translated to a relative probability using division. \item \textbf{Line 8.} Sort the resulting counts numerically, to get the most frequently used words at the top. This step also helps calculate relative probabilities. \item \textbf{Line 11.} This line executes the following tasks: \begin{itemize} \item Exclude any words in the frequency list that are not in the dictionary. In practice, the corpus will likely be sourced from technical documentation, which inevitably will include database column names. As the point of the word splitting is to chop the column names into their equivalent phrases (that is, separate conjoined words with spaces), having the column names themselves as part of the lexicon defeats the purpose. In other words, this line removes words like \texttt{clientaccountedit}, but keeps the words \emph{account}, \emph{accounted}, \emph{client}, \emph{count}, \emph{counted}, \emph{edit}, \emph{ed}, \emph{it}, \emph{lie}, \emph{tac}, and others in the lexicon with their respective probabilities. \item Write each word out to a comma-separated file format (for ease of editing in a spreadsheet). \end{itemize} \item \textbf{Line 14.} In practice, three-letter words cause the word splitting software to fail due to matches on words like \emph{lie}, \emph{tac}, \emph{dit}, and others. For the most part, longer column names are predominately made out of real, human-readable names. Any part of the column that does not match a word in the lexicon is separated from words that do match, by virtue of the words that do match being surrounded by spaces. So, this line removes super-short words. \end{itemize} The shell script in \ref{alg:Script-Tally-Corpus} generates a tally lexicon. Running the script requires a dictionary file and a corpus file. An example output from running the tally script resembles the following: \begin{lyxcode} account,1000 accounted,979 client,971 counted,544 edit,942 \end{lyxcode} After the tally lexicon is created, the probability lexicon must be created. The source code in \ref{alg:Script-Tallies-Probabilities} shows an \texttt{awk} script that converts the numeric tallies into relative probabilities. \begin{algorithm}[H] \includegraphics{source/scripts/probability\lyxdot awk} \caption{\label{alg:Script-Tallies-Probabilities}Script - Tallies to Probabilities} \end{algorithm} Running \ref{alg:Script-Tallies-Probabilities} produces a probability lexicon; an example probability lexicon file resembles the following: \begin{lyxcode} account,1 accounted,0.979 client,0.971 counted,0.544 edit,0.942 \end{lyxcode} \section{Conjoined} Conjoined terms must be listed in a file with one conjoined term per line. An example conjoined input file resembles the following: \begin{lyxcode} clientaccountedit clientaccount accountedit \end{lyxcode} \section{Word Split} This is a Java-based application that, when given a probability lexicon and a list of conjoined terms, produces a list conjoined terms and segmented solutions. Review the file \texttt{run.sh} to see how to run Word Split. Example output from running Word Split resembles the following: \begin{lyxcode} clientaccountedit,client~account~edit clientaccount,client~account accountedit,account~edit \end{lyxcode} The text is split into meaningful words.