Dave Jarvis' Repositories

git clone https://repo.autonoma.ca/repo/keenwrite.git

Add DocumentParser to replace JSoup functionality, add KeenQuotes library

AuthorDaveJarvis <email>
Date2021-06-27 14:18:15 GMT-0700
Commit5cd6619ab0869afa39743db759b00e88982eb0b6
Parentbe43467
Delta94 lines added, 299 lines removed, 205-line decrease
src/test/resources/com/keenwrite/quotes/smartypants.txt
-# ########################################################################
-# Decades
-# ########################################################################
-The Roaring '20s had the best music, no?
-The Roaring &apos;20s had the best music, no?
-
-Took place in '04, yes'm!
-Took place in &apos;04, yes&apos;m!
-
-# ########################################################################
-# Inside contractions (no leading/trailing apostrophes)
-# ########################################################################
-I don't like it: I love's it!
-I don&apos;t like it: I love&apos;s it!
-
-We'd've thought that pancakes'll be sweeter there.
-We&apos;d&apos;ve thought that pancakes&apos;ll be sweeter there.
-
-She'd be coming o'er when the horse'd gone to pasture...
-She&apos;d be coming o&apos;er when the horse&apos;d gone to pasture...
-
-# ########################################################################
-# Beginning contractions (leading apostrophes)
-# ########################################################################
-'Twas and 'tis whate'er lay 'twixt dawn and dusk 'n River Styx.
-&apos;Twas and &apos;tis whate&apos;er lay &apos;twixt dawn and dusk &apos;n River Styx.
-
-# ########################################################################
-# Ending contractions (trailing apostrophes)
-# ########################################################################
-Didn' get th' message.
-Didn&apos; get th&apos; message.
-
-Namsayin', y'know what I'ma sayin'?
-Namsayin&apos;, y&apos;know what I&apos;ma sayin&apos;?
-
-# ########################################################################
-# Outside contractions (leading and trailing, no middle)
-# ########################################################################
-Salt 'n' vinegar, fish-'n'-chips, sugar 'n' spice!
-Salt &apos;n&apos; vinegar, fish-&apos;n&apos;-chips, sugar &apos;n&apos; spice!
-
-# ########################################################################
-# Primes (single, double)
-# ########################################################################
-She stood 5\'7\".
-She stood 5&prime;7&Prime;.
-
-# No space after the feet sign.
-It's 4'11" away.
-It&apos;s 4&prime;11&Prime; away.
-
-Alice's friend is 6'3" tall.
-Alice&apos;s friend is 6&prime;3&Prime; tall.
-
-Bob's table is 5'' × 4''.
-Bob&apos;s table is 5&Prime; × 4&Prime;.
-
-What's this -5.5'' all about?
-What&apos;s this -5.5&Prime; all about?
-
-+7.9'' is weird.
-+7.9&Prime; is weird.
-
-Foolscap? Naw, I use 11.5"x14.25" paper!
-Foolscap? Naw, I use 11.5&Prime;x14.25&Prime; paper!
-
-An angular measurement, 3° 5' 30" means 3 degs, 5 arcmins, and 30 arcsecs.
-An angular measurement, 3° 5&prime; 30&Prime; means 3 degs, 5 arcmins, and 30 arcsecs.
-
-# ########################################################################
-# Backticks (left and right double quotes)
-# ########################################################################
-``I am Sam''
-&ldquo;I am Sam&rdquo;
-
-``Sam's away today''
-&ldquo;Sam&apos;s away today&rdquo;
-
-``Sam's gone!
-&ldquo;Sam&apos;s gone!
-
-``5'10" tall 'e was!''
-&ldquo;5&prime;10&Prime; tall &apos;e was!&rdquo;
-
-# ########################################################################
-# Consecutive quotes
-# ########################################################################
-"'I'm trouble.'"
-&ldquo;&lsquo;I&apos;m trouble.&rsquo;&rdquo;
-
-'"Trouble's my name."'
-&lsquo;&ldquo;Trouble&apos;s my name.&ldquo;&lsquo;
-
-# ########################################################################
-# Escaped quotes
-# ########################################################################
-\"What?\"
-&ldquo;What?&rdquo;
-
-# ########################################################################
-# Double quotes
-# ########################################################################
-"I am Sam"
-&ldquo;I am Sam&rdquo;
-
-"...even better!"
-&ldquo;...even better!&rdquo;
-
-"It was so," said he.
-&ldquo;It was so,&rdquo; said he.
-
-"She said, 'Llamas'll languish, they'll--
-&ldquo;She said, &lsquo;Llamas&apos;ll languish, they&apos;ll--
-
-With "air quotes" in the middle.
-With &ldquo;air quotes&rdquo; in the middle.
-
-With--"air quotes"--and dashes.
-With--&ldquo;air quotes&rdquo;--and dashes.
-
-"Not "quite" what you expected?"
-&ldquo;Not &ldquo;quite&rdquo; what you expected?&rdquo;
-
-# ########################################################################
-# Nested quotations
-# ########################################################################
-"'Here I am,' said Sam"
-&ldquo;&lsquo;Here I am,&rsquo; said Sam&rdquo;
-
-'"Here I am," said Sam'
-&lsquo;&ldquo;Here I am,&rdquo;, said Sam&rsquo;
-
-'Hello, "Dr. Brown," what's your real name?'
-&lsquo;Hello, &ldquo;Dr. Brown,&rdquo; what's your real name?&rsquo;
-
-"'Twas, t'wasn't thy name, 'twas it?" said Jim "the Barber" Brown.
-&ldquo;&apos;Twas, t&apos;wasn&apos;t thy name, &apos;twas it?&rdquo; said Jim &ldquo;the Barber&rdquo; Brown.
-
-# ########################################################################
-# Single quotes
-# ########################################################################
-'I am Sam'
-&lsquo;I am Sam&rsquo;
-
-'It was so,' said he.
-&lsquo;It was so,&rsquo; said he.
-
-'...even better!'
-&lsquo;...even better!&rsquo;
-
-With 'quotes' in the middle.
-With &lsquo;quotes&rsquo; in the middle.
-
-With--'imaginary'--dashes.
-With--&lsquo;imaginary&rsquo;--dashes.
-
-'Not 'quite' what you expected?'
-&lsquo;Not &lsquo;quite&rsquo; what you expected?&rsquo;
-
-''Cause I don't like it, 's why,' said Pat.
-&lsquo;&apos;Cause I don't like it, &apos;s why,&rsquo; said Pat.
-
-'It's a beautiful day!'
-&lsquo;It&apos;s a beautiful day!&rsquo;
-
-'He said, 'Thinkin'.'
-&lsquo;He said, &lsquo;Thinkin&rsquo;.&rsquo;
-
-# ########################################################################
-# Possessives
-# ########################################################################
-Sam's Sams' and the Ross's roses' thorns were prickly.
-Sam&apos;s Sams&apos; and the Ross&apos;s roses&apos; thorns were prickly.
-
-# ########################################################################
-# Mixed
-# ########################################################################
-"I heard she said, 'That's Sam's'," said the Sams' cat.
-&ldquo;I heard she said, &lsquo;That&apos;s Sam&apos;s&rsquo;,&rdquo; said the Sams&apos; cat.
-
-"'Janes' said, ''E'll be spooky, Sam's son with the jack-o'-lantern!'" said the O'Mally twins'---y'know---ghosts in unison.
-&ldquo;&lsquo;Janes&apos; said, &lsquo;&apos;E&apos;ll be spooky, Sam&apos;s son with the jack-o&apos;-lantern!&rsquo;&rdquo; said the O&apos;Mally twins&apos;---y&apos;know---ghosts in unison.
-
-'He's at Sams'
-&lsquo;He&apos; at Sams&rsquo;
-
-\"Hello!\"
-&ldquo;Hello!&rdquo;
-
-ma'am
-ma&apos;am
-
-'Twas midnight
-&apos;Twas midnight
-
-\"Hello,\" said the spider. \"'Shelob' is my name.\"
-&ldquo;Hello,&rdquo; said the spider. &ldquo;&lsquo;Shelob&rsquo; is my name.&rdquo;
-
-'A', 'B', and 'C' are letters.
-&lsquo;A&rsquo; &lsquo;B&rsquo; and &lsquo;C&rsquo; are letters.
-
-'Oak,' 'elm,' and 'beech' are names of trees. So is 'pine.'
-&lsquo;Oak,&rsquo; &lsquo;elm,&rsquo; and &lsquo;beech&rsquo; are names of trees. So is &lsquo;pine.&rsquo;
-
-'He said, \"I want to go.\"' Were you alive in the 70's?
-&lsquo;He said, &ldquo;I want to go.&rdquo;&rsquo; Were you alive in the 70&apos;s?
-
-\"That's a 'magic' sock.\"
-&ldquo;That&apos;s a &lsquo;magic&rsquo; sock.&rdquo;
-
-Website! Company Name, Inc. (\"Company Name\" or \"Company\") recommends reading the following terms and conditions, carefully:
-Website! Company Name, Inc. (&ldquo;Company Name&rdquo; or &ldquo;Company&rdquo;) recommends reading the following terms and conditions, carefully:
-
-Website! Company Name, Inc. ('Company Name' or 'Company') recommends reading the following terms and conditions, carefully:
-Website! Company Name, Inc. (&lsquo;Company Name&rsquo; or &lsquo;Company&rsquo;) recommends reading the following terms and conditions, carefully:
-
-Workin' hard
-Workin&apos; hard
-
-'70s are my favorite numbers,' she said.
-&lsquo;70s are my favorite numbers,&rsquo; she said.
-
-'70s fashion was weird.
-&apos;70s fashion was weird.
-
-12\" record, 5'10\" height
-12&Prime; record, 5&prime;10&Prime; height
-
-Model \"T2000\"
-Model &ldquo;T2000&rdquo;
-
-iPad 3's battery life is not great.
-iPad 3&apos;s battery life is not great.
-
-Book 'em, Danno. Rock 'n' roll. 'Cause 'twas the season.
-Book &apos;em, Danno. Rock &apos;n&apos; roll. &apos;Cause &apos;twas the season.
-
-'85 was a good year. (The entire '80s were.)
-&apos;85 was a good year. (The entire &apos;80s were.)
-
src/test/java/com/keenwrite/quotes/TypographerTest.java
+/* Copyright 2021 White Magic Software, Ltd. -- All rights reserved. */
+package com.keenwrite.quotes;
+
+import com.keenwrite.dom.DocumentParser;
+import org.junit.jupiter.api.Test;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStreamReader;
+
+import static java.lang.System.lineSeparator;
+import static java.lang.System.out;
+import static java.util.stream.Collectors.joining;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+
+/**
+ * Test to make sure that the {@link Typographer} can replace straight quotes
+ * without affecting preformatted HTML text elements.
+ */
+class TypographerTest {
+
+ private static final String LINE_SEP = lineSeparator();
+
+ @Test
+ void test_Quotes_Straight_Curly() throws IOException {
+ final var xhtml = read( "17165.html" );
+ final var doc = DocumentParser.parse( xhtml );
+
+ Typographer.curl( doc );
+
+ out.println( doc );
+ }
+
+ /**
+ * Opens a text file for reading. Callers are responsible for closing.
+ *
+ * @param filename The file to open.
+ * @return An instance of {@link BufferedReader} that can be used to
+ * read all the lines in the file.
+ */
+ private BufferedReader open( final String filename ) {
+ final var is = getClass().getResourceAsStream( filename );
+ assertNotNull( is );
+
+ return new BufferedReader( new InputStreamReader( is ) );
+ }
+
+ @SuppressWarnings( "SameParameterValue" )
+ private String read( final String filename ) throws IOException {
+ try( final var reader = open( filename ) ) {
+ return reader.lines().collect( joining( LINE_SEP ) );
+ }
+ }
+}
src/main/java/com/keenwrite/processors/XhtmlProcessor.java
package com.keenwrite.processors;
+import com.keenwrite.dom.DocumentParser;
import com.keenwrite.preferences.Key;
import com.keenwrite.preferences.Workspace;
import com.keenwrite.ui.heuristics.WordCounter;
import javafx.beans.property.StringProperty;
-import org.jsoup.nodes.Document;
+import org.w3c.dom.Document;
-import javax.xml.parsers.DocumentBuilderFactory;
-import javax.xml.transform.TransformerFactory;
-import javax.xml.transform.dom.DOMSource;
-import javax.xml.transform.stream.StreamResult;
import java.io.FileNotFoundException;
import java.nio.file.Path;
import java.util.Locale;
import java.util.Map;
-import java.util.Map.Entry;
import java.util.regex.Pattern;
import static com.keenwrite.Bootstrap.APP_TITLE_LOWERCASE;
+import static com.keenwrite.dom.DocumentParser.createMeta;
+import static com.keenwrite.dom.DocumentParser.walk;
import static com.keenwrite.events.StatusEvent.clue;
import static com.keenwrite.io.HttpFacade.httpGet;
import static java.util.regex.Pattern.UNICODE_CHARACTER_CLASS;
import static java.util.regex.Pattern.compile;
-import static javax.xml.transform.OutputKeys.INDENT;
-import static javax.xml.transform.OutputKeys.OMIT_XML_DECLARATION;
-import static org.jsoup.Jsoup.parse;
-import static org.jsoup.nodes.Document.OutputSettings.Syntax;
+import static org.w3c.dom.Node.TEXT_NODE;
/**
- * Responsible for making an HTML document complete by wrapping it with html
+ * Responsible for making an XHTML document complete by wrapping it with html
* and body elements. This doesn't have to be super-efficient because it's
* not run in real-time.
clue( "Main.status.typeset.xhtml" );
- final var doc = parse( html );
+ final var decorated =
+ "<html><head><title>untitled</title></head><body>" +
+ html +
+ "</body></html>";
+ final var doc = DocumentParser.parse( decorated );
setMetaData( doc );
- doc.outputSettings().syntax( Syntax.xml );
- for( final var img : doc.getElementsByTag( "img" ) ) {
+ walk( doc, "img", node -> {
try {
- final var imageFile = exportImage( img.attr( "src" ) );
+ final var attrs = node.getAttributes();
- img.attr( "src", imageFile.toString() );
+ if( attrs != null ) {
+ final var attr = attrs.getNamedItem( "src" );
+
+ if( attr != null ) {
+ final var imageFile = exportImage( attr.getTextContent() );
+
+ attr.setTextContent( imageFile.toString() );
+ }
+ }
} catch( final Exception ex ) {
clue( ex );
}
- }
+ } );
- return doc.html();
+ //Typographer.curl( doc );
+
+ return DocumentParser.toString( doc );
}
/**
* Applies the metadata fields to the document.
*
* @param doc The document to adorn with metadata.
*/
private void setMetaData( final Document doc ) {
- doc.title( getTitle() );
-
final var metadata = createMetaData( doc );
- final var head = doc.head();
- metadata.entrySet().forEach( entry -> head.append( createMeta( entry ) ) );
- }
- private String createMeta( final Entry<String, String> entry ) {
- return format(
- "<meta name='%s' content='%s'>", entry.getKey(), entry.getValue()
+ walk( doc, "title", node -> node.setTextContent( getTitle() ) );
+ walk( doc, "head", node ->
+ metadata.entrySet()
+ .forEach( entry -> node.appendChild( createMeta( doc, entry ) ) )
);
}
// Strip comments, superfluous whitespace, DOCTYPE, and XML declarations.
if( mediaType.isSvg() ) {
- sanitize( imageFile );
+ DocumentParser.sanitize( imageFile );
}
}
return imageFile;
- }
-
- /**
- * Remove whitespace, comments, and XML/DOCTYPE declarations to make
- * processing work with ConTeXt.
- *
- * @param path The SVG file to process.
- * @throws Exception The file could not be processed.
- */
- private void sanitize( final Path path )
- throws Exception {
- final var file = path.toFile();
-
- final var dbf = DocumentBuilderFactory.newInstance();
- dbf.setIgnoringComments( true );
- dbf.setIgnoringElementContentWhitespace( true );
-
- final var db = dbf.newDocumentBuilder();
- final var document = db.parse( file );
-
- final var tf = TransformerFactory.newInstance();
- final var transformer = tf.newTransformer();
-
- final var source = new DOMSource( document );
- final var result = new StreamResult( file );
- transformer.setOutputProperty( OMIT_XML_DECLARATION, "yes" );
- transformer.setOutputProperty( INDENT, "no" );
- transformer.transform( source, result );
}
private String getWordCount( final Document doc ) {
- final var text = doc.wholeText();
- final var wordCounter = WordCounter.create( getLocale() );
- return valueOf( wordCounter.countWords( text ) );
+ final var sb = new StringBuilder( 65536 * 10 );
+
+ walk( doc, "*", node -> {
+ if( node.getNodeType() == TEXT_NODE && node.getTextContent() != null ) {
+ sb.append( node.getTextContent() );
+ }
+ } );
+
+ return valueOf( WordCounter.create( getLocale() ).count( sb.toString() ) );
}