Syntax falls into two categories: block syntax and inline syntax. Block syntax operates on one or more whole lines of content to form structures like paragraphs, lists, tables, etc. Inline syntax operates on stretches of text within a block, such as marking a word as bold or inserting a link. Inline syntax cannot span multiple blocks.
A reader is a subclass of MDReader and handles the parsing of one kind of
syntax. E.g. MDUnorderedListReader handles bulleted lists and their items,
MDStrongReader handles bold text, etc.
A reader can parse block syntax, inline syntax, or both.
Parsing occurs in three phases.
Blocks: Readers are consulted line-by-line to see if their supported
block syntax is found starting at the given line pointer. If so, it returns an
MDNode object encoding the block, otherwise it returns null and the next
reader is checked. See MDReader.readBlock().
Tokenizing: Readers are asked to see if a supported token is located at
the beginning of a given string. If so, an MDToken is returned, otherwise
it returns null and then next reader is checked. See MDReader.readToken().
Substitution: Readers are given a mixed array of MDTokens (from the
previous step) and MDNodes. The goal of substitution is to eventually convert
all tokens into nodes, with any left over at the end being converted to
MDTextNodes. If a reader can make one substitution it returns true, otherwise
false and the next reader is checked. See MDReader.substituteTokens().
To parse a block format, override the readBlock method in a subclass of
MDReader. It will be passed an MDState instance. The two state properties of
importance here are lines and p. lines is an array of the lines of markdown,
and p is the current line pointer index. The job of readBlock is to check
if the block syntax of interest is located at line p. Most of the time it won’t
be, in which case the method should just return null, preferably as early as
possible for good performance.
If the format is detected at line p then the reader should look ahead and see
how many subsequent lines are also part of the block, process them as necessary,
and return an appropriate MDNode subclass with their contents. Before returning,
the reader must set the state’s p to the line index just after the last line
of the detected block.
Most blocks will have inner content. This can be processed into one or more
MDNodes by calling state.inlineMarkdownToNode or state.inlineMarkdownToNodes
for inline content. If the content may have nested block structures, create a
subarray from state.lines and create a sub-state with state.copy(lines).
This creates a child state for the sub-document within the block structure. You
can then call substate.readBlocks() to get an array of MDBlockNode to use
as content. Note: if your block syntax has some kind of indentation or marker
at the beginning of every line, those should be stripped from the lines passed
to the copied sub-state.
A note about sub-states. States can have parent states. You can store any needed state info on an
MDStateinstance, but you will usually want to do so onstate.root, notstateitself. This ensures those properties are accessed on the original state, not a disposable substate.
Inline parsing is usually done by overriding both the readToken and
substituteTokens methods.
The readToken method is passed an MDState and a tail portion of a line of
markdown. The method should check if the line begins with a supported token.
This will generally be some punctuation symbol or sometimes a more complex
sequence. For inline syntax consisting of pairs of double symbols (e.g. **strong**), each symbol should usually be tokenized individually rather than
as pairs, just in case another syntax recognizes single symbols. If no token
is found at the start of the string, return null. Do not match tokens
anywhere but at the start of the given string.
Substitution is done by the substituteTokens method. This one is trickier.
It is called with a mixed array of tokens (MDToken) and nodes (MDNode).
The goal is for readers to keep doing one search-and-replace when possible.
Only the first match should be performed, then return true if a substitution
was made or false if not.
There are helper methods for performing substitution. MDToken has static methods
findFirstTokens and findPairedTokens. They work like crude regexes that
operate on tokens instead of characters. For syntax consisting of text enclosed
between symbols (e.g. **strong**, _emphasis_, ~~strike~~), you can
subclass MDSimplePairInlineReader and use its attemptPair method to make
substitution easy.
Many markdown syntaxes use the same characters and similar syntax. This can
cause problems depending on which reader is run first. This can be resolved by
prioritization. Each parsing phase has its own priority order, and each reader
can dictate what readers it wants to be ahead of or behind by overriding the
compareBlockOrdering, compareTokenizeOrdering, or compareSubstituteOrdering
methods. Each is called with another reader instance, and the method should
return a -1 if it wants to be run before the given reader, 1 if it wants to
be run after the given reader, or 0 as a “don’t care” value. The method should
only return -1 or 1 when it’s necessary to resolve a specific conflict;
most of the time it should return 0. The default implementation always returns
0.
Prioritization can be further refined for the substitution phase using multiple
passes. Passes are good for especially ambiguous constructs, like ***strong** emphasis*. In this example, a naive MDEmphasisReader might grab the first asterisk as the start of the sequence and the one immediately following “strong”,
stealing the second and third asterisk as part of the inner text and leaving the
second asterisk after “strong” and the one after “emphasis” alone.
[*]**strong[*]* emphasis*
An incorrect, naive substitution (start and end tokens shown in brackets)
By using mulitiple passes, the reader can use early passes to be more conservative and hold out for a better match first, or let other readers find a better match, then on a later pass it can grab whatever’s left over.
MDStrongReaderfinds a match on the first pass because there are no inner asterisks to cause ambiguities
*[**]strong[**] emphasis*Then,
MDEmphasisReaderfinds the outer pair on a later pass
*<MDStrongNode> emphasis*
Readers have the option of performing tasks before and after the document has
been parsed by overriding the preProcess and/or postProcess methods.
Pre-processing is mostly useful for initializing MDState in some way.
Post-processing can be used to manipulate the MDNode tree after it has been
constructed by the parsing phases. For example, finding all the footnotes in a
document in the order they appear for proper numbering. postProcess is passed
an array of top-level MDNodes in the parsed document. This array can be
altered using .splice operations if desired to add additional structures,
such as a list of all the footnote content at the bottom of the document, to
reuse the same example.
One useful utility for post-processing is MDNode.visitChildren. This will
recursively visit every node in the tree and call a callback function with each
one. The node tree cannot be restructured, but each node can be inspected or
altered.
To perform substitution, MDUtils.replaceNodes can be called. It will call a
replacer function with every node in the tree recursively. If the function
returns null, no change is made. If the function returns a new MDNode
instance it will replace that node in the tree. When a replacement is made,
neither the original nor the replacement is recursed into. To replace a node
with multiple other nodes, wrap those multiple nodes in a generic MDNode as
its children. To remove a node without a replacement, return an MDNode with
no children. It will be rendered as nothing when converted to HTML.