Dave Jarvis' Repositories

#LyX 1.6.7 created this file. For more info see http://www.lyx.org/
\lyxformat 345
\begin_document
\begin_header
\textclass scrbook
\begin_preamble
\input{preamble.tex}
\end_preamble
\use_default_options false
\language english
\inputencoding utf8
\font_roman lmodern
\font_sans helvet
\font_typewriter courier
\font_default_family default
\font_sc false
\font_osf false
\font_sf_scale 100
\font_tt_scale 100

\graphics default
\float_placement H
\paperfontsize default
\spacing single
\use_hyperref false
\papersize letterpaper
\use_geometry false
\use_amsmath 2
\use_esint 1
\cite_engine basic
\use_bibtopic false
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\papercolumns 1
\papersides 1
\paperpagestyle default
\bullet 0 5 11 -1
\bullet 1 5 24 -1
\bullet 2 0 0 -1
\tracking_changes false
\output_changes false
\author "" 
\author "" 
\end_header

\begin_body

\begin_layout Standard
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
input{chapter.tex}
\end_layout

\end_inset


\end_layout

\begin_layout Chapter
\begin_inset CommandInset label
LatexCommand label
name "cha:Word-Split"

\end_inset

Word Split
\end_layout

\begin_layout Standard
Word Split is an application that splits conjoined terms into their constituent
 words.
 The software cannot convert conjoined terms perfectly; rather, it reduces
 manual labour through automation.
\end_layout

\begin_layout Section
Download
\end_layout

\begin_layout Standard
Word Split is available for download at:
\end_layout

\begin_layout Standard
\begin_inset CommandInset href
LatexCommand href
target "http://www.whitemagicsoftware.com/software/java/wordsplit/"

\end_inset


\end_layout

\begin_layout Section
Overview
\end_layout

\begin_layout Standard
Splitting conjoined words (for example, 
\emph on
clientaccountedit
\emph default
) into constituent words (such as, 
\emph on
client
\emph default
,
\emph on
 account
\emph default
, and
\emph on
 edit
\emph default
) requires the following steps:
\end_layout

\begin_layout Enumerate

\series bold
Corpus.

\series default
 Create a large amount of text.
\end_layout

\begin_layout Enumerate

\series bold
Probability Lexicon.

\series default
 Derive a probability lexicon from corpus texts.
\end_layout

\begin_layout Enumerate

\series bold
Conjoined.

\series default
 Obtain a list of conjoined words (such as database column names).
\end_layout

\begin_layout Enumerate

\series bold
Word Split.

\series default
 Execute Word Split using the lexicon and the conjoined terms.
\end_layout

\begin_layout Section
Corpus
\end_layout

\begin_layout Standard
Create the corpus text from large bodies of text that contain words that
 comprise conjoined texts.
 For example, a database column called 
\emph on
clientaccountedit
\emph default
 works best with corpus text that contains the words 
\emph on
account
\emph default
, 
\emph on
client
\emph default
, and 
\emph on
edit
\emph default
.
 The more that these words appear in the corpus, the more likely that 
\emph on
clientaccountedit
\emph default
 is split to 
\emph on
client
\emph default
, 
\emph on
account
\emph default
, and 
\emph on
edit
\emph default
.
\end_layout

\begin_layout Standard
Excellent sources for creating corpus text include:
\end_layout

\begin_layout Itemize
Emails
\end_layout

\begin_layout Itemize
Website Contents
\end_layout

\begin_layout Itemize
Technical Documents
\end_layout

\begin_layout Itemize
Requirements Documents
\end_layout

\begin_layout Standard
The results directly relate to the amount of corpus text content; more material
 improves the results.
\end_layout

\begin_layout Standard
Create a corpus as follows:
\end_layout

\begin_layout Enumerate
Concatenate large corpus text files into a single corpus file.
\end_layout

\begin_layout Enumerate
Append a dictionary to the corpus file.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Eliminate words with uppercase characters and punctuation marks.
\end_layout

\end_inset


\end_layout

\begin_layout Standard
Appending a dictionary to the corpus file ensures that any words in the
 conjoined terms that are not in the corpus are split.
 The corpus file probably does not contain all the words used in the conjoined
 terms; whereas, the dictionary probably does.
 Even though this means that all words in the dictionary are included in
 the probability lexicon, those words are assigned the lowest probability.
 Without every possible word in the lexicon, the word split software will
 not work as expected.
\end_layout

\begin_layout Section
Probability Lexicon
\end_layout

\begin_layout Standard
The shell script in 
\begin_inset CommandInset ref
LatexCommand ref
reference "alg:Script-Tally-Corpus"

\end_inset

 shows how to create a tally of words.
\end_layout

\begin_layout Standard
\begin_inset Float algorithm
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset Graphics
	filename source/scripts/tally-corpus.sh.png
	display false

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
\begin_inset CommandInset label
LatexCommand label
name "alg:Script-Tally-Corpus"

\end_inset

Script - Tally Corpus
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
The lines in 
\begin_inset CommandInset ref
LatexCommand ref
reference "alg:Script-Tally-Corpus"

\end_inset

 perform the following actions:
\end_layout

\begin_layout Itemize

\series bold
Line 4.

\series default
 This terse line accomplishes a few things:
\end_layout

\begin_deeper
\begin_layout Itemize
Reads all the text, in paragraph form, from a file called 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
filename{
\backslash
filetxtcorpus}
\end_layout

\end_inset

.
\end_layout

\begin_layout Itemize
Replaces spaces with new lines, so each word is tallied individually.
\end_layout

\begin_layout Itemize
Removes all non-alphabetic characters from every word.
\end_layout

\end_deeper
\begin_layout Itemize

\series bold
Line 5.

\series default
 Convert all the words to lower case, which later on will be matched against
 a 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
nohyphens{dictionary}
\end_layout

\end_inset

 of exclusively lower case words.
\end_layout

\begin_layout Itemize

\series bold
Line 6.

\series default
 Sort the words alphabetically, so consecutive instances of the same word
 can can be tallied.
\end_layout

\begin_layout Itemize

\series bold
Line 7.

\series default
 Tally each unique word; this frequency can be translated to a relative
 probability using division.
\end_layout

\begin_layout Itemize

\series bold
Line 8.

\series default
 Sort the resulting counts numerically, to get the most frequently used
 words at the top.
 This step also helps calculate relative probabilities.
\end_layout

\begin_layout Itemize

\series bold
Line 11.

\series default
 This line executes the following tasks:
\end_layout

\begin_deeper
\begin_layout Itemize
Exclude any words in the frequency list that are not in the dictionary.
 In practice, the corpus will likely be sourced from technical documentation,
 which inevitably will include database column names.
 As the point of the word splitting is to chop the column names into their
 equivalent phrases (that is, separate conjoined words with spaces), having
 the column names themselves as part of the lexicon defeats the purpose.
 In other words, this line removes words like 
\family typewriter
clientaccountedit
\family default
, but keeps the words 
\emph on
account
\emph default
, 
\emph on
accounted
\emph default
, 
\emph on
client
\emph default
, 
\emph on
count
\emph default
, 
\emph on
counted
\emph default
, 
\emph on
edit
\emph default
, 
\emph on
ed
\emph default
, 
\emph on
it
\emph default
, 
\emph on
lie
\emph default
, 
\emph on
tac
\emph default
, and others in the lexicon with their respective probabilities.
\end_layout

\begin_layout Itemize
Write each word out to a comma-separated file format (for ease of editing
 in a spreadsheet).
\end_layout

\end_deeper
\begin_layout Itemize

\series bold
Line 14.

\series default
 In practice, three-letter words cause the word splitting software to fail
 due to matches on words like 
\emph on
lie
\emph default
, 
\emph on
tac
\emph default
, 
\emph on
dit
\emph default
, and others.
 For the most part, longer column names are predominately made out of real,
 human-readable names.
 Any part of the column that does not match a word in the lexicon is separated
 from words that do match, by virtue of the words that do match being surrounded
 by spaces.
 So, this line removes super-short words.
\end_layout

\begin_layout Standard
The shell script in 
\begin_inset CommandInset ref
LatexCommand ref
reference "alg:Script-Tally-Corpus"

\end_inset

 generates a tally lexicon.
 Running the script requires a dictionary file and a corpus file.
 An example output from running the tally script resembles the following:
\end_layout

\begin_layout LyX-Code

\family typewriter
account,1000
\end_layout

\begin_layout LyX-Code
accounted,979
\end_layout

\begin_layout LyX-Code
client,971
\end_layout

\begin_layout LyX-Code
counted,544
\end_layout

\begin_layout LyX-Code
edit,942
\end_layout

\begin_layout Standard
After the tally lexicon is created, the probability lexicon must be created.
 The source code in 
\begin_inset CommandInset ref
LatexCommand ref
reference "alg:Script-Tallies-Probabilities"

\end_inset

 shows an 
\family typewriter
awk
\family default
 script that converts the numeric tallies into relative probabilities.
\end_layout

\begin_layout Standard
\begin_inset Float algorithm
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset Graphics
	filename source/scripts/probability.awk.png
	display false

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
\begin_inset CommandInset label
LatexCommand label
name "alg:Script-Tallies-Probabilities"

\end_inset

Script - Tallies to Probabilities
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
Running 
\begin_inset CommandInset ref
LatexCommand ref
reference "alg:Script-Tallies-Probabilities"

\end_inset

 produces a probability lexicon; an example probability lexicon file resembles
 the following:
\end_layout

\begin_layout LyX-Code

\family typewriter
account,1
\end_layout

\begin_layout LyX-Code
accounted,0.979
\end_layout

\begin_layout LyX-Code
client,0.971
\end_layout

\begin_layout LyX-Code
counted,0.544
\end_layout

\begin_layout LyX-Code
edit,0.942
\end_layout

\begin_layout Section
Conjoined
\end_layout

\begin_layout Standard
Conjoined terms must be listed in a file with one conjoined term per line.
\end_layout

\begin_layout Standard
An example conjoined input file resembles the following:
\end_layout

\begin_layout LyX-Code
clientaccountedit
\end_layout

\begin_layout LyX-Code
clientaccount
\end_layout

\begin_layout LyX-Code
accountedit
\end_layout

\begin_layout Section
Word Split
\end_layout

\begin_layout Standard
This is a Java-based application that, when given a probability lexicon
 and a list of conjoined terms, produces a list conjoined terms and segmented
 solutions.
 Review the file 
\family typewriter
run.sh
\family default
 to see how to run Word Split.
\end_layout

\begin_layout Standard
Example output from running Word Split resembles the following:
\end_layout

\begin_layout LyX-Code
clientaccountedit,client account edit
\end_layout

\begin_layout LyX-Code
clientaccount,client account
\end_layout

\begin_layout LyX-Code
accountedit,account edit
\end_layout

\begin_layout Standard
The text is split into meaningful words.
\end_layout

\end_body
\end_document