# Custom Syntax Modules
Syntax falls into two categories: block syntax and inline syntax. Block syntax
operates on one or more whole lines of content to form structures like paragraphs,
lists, tables, etc. Inline syntax operates on stretches of text within a block,
such as marking a word as bold or inserting a link. Inline syntax cannot span multiple blocks.
## Readers
A reader is a subclass of `MDReader` and handles the parsing of one kind of
syntax. E.g. `MDUnorderedListReader` handles bulleted lists and their items,
`MDStrongReader` handles bold text, etc.
A reader can parse block syntax, inline syntax, or both.
## Parsing process
Parsing occurs in three phases.
1. **Blocks:** Readers are consulted line-by-line to see if their supported
block syntax is found starting at the given line pointer. If so, it returns an
`MDNode` object encoding the block, otherwise it returns `null` and the next
reader is checked. See `MDReader.readBlock()`.
2. **Tokenizing:** Readers are asked to see if a supported token is located at
the beginning of a given string. If so, an `MDToken` is returned, otherwise
it returns `null` and then next reader is checked. See `MDReader.readToken()`.
3. **Substitution:** Readers are given a mixed array of `MDToken`s (from the
previous step) and `MDNode`s. The goal of substitution is to eventually convert
all tokens into nodes, with any left over at the end being converted to
`MDTextNode`s. If a reader can make one substitution it returns `true`, otherwise
`false` and the next reader is checked. See `MDReader.substituteTokens()`.
## Parsing Blocks
To parse a block format, override the `readBlock` method in a subclass of
`MDReader`. It will be passed an `MDState` instance. The two state properties of
importance here are `lines` and `p`. `lines` is an array of the lines of markdown,
and `p` is the current line pointer index. The job of `readBlock` is to check
if the block syntax of interest is located at line `p`. Most of the time it won't
be, in which case the method should just return `null`, preferably as early as
possible for good performance.
If the format is detected at line `p` then the reader should look ahead and see
how many subsequent lines are also part of the block, process them as necessary,
and return an appropriate `MDNode` subclass with their contents. Before returning,
the reader _must_ set the state's `p` to the line index just after the last line
of the detected block.
Most blocks will have inner content. This can be processed into one or more
`MDNode`s by calling `state.inlineMarkdownToNode` or `state.inlineMarkdownToNodes`
for inline content. If the content may have nested block structures, create a
subarray from `state.lines` and create a sub-state with `state.copy(lines)`.
This creates a child state for the sub-document within the block structure. You
can then call `substate.readBlocks()` to get an array of `MDBlockNode` to use
as content. Note: if your block syntax has some kind of indentation or marker
at the beginning of every line, those should be stripped from the lines passed
to the copied sub-state.
> **A note about sub-states.** States can have parent states. You can store any
> needed state info on an `MDState` instance, but you will usually want to do
> so on `state.root`, not `state` itself. This ensures those properties are
> accessed on the original state, not a disposable substate.
## Parsing Inline Syntax
Inline parsing is usually done by overriding both the `readToken` and
`substituteTokens` methods.
The `readToken` method is passed an `MDState` and a tail portion of a line of
markdown. The method should check if the line begins with a supported token.
This will generally be some punctuation symbol or sometimes a more complex
sequence. For inline syntax consisting of pairs of double symbols (e.g. `**strong**`), each symbol should usually be tokenized individually rather than
as pairs, just in case another syntax recognizes single symbols. If no token
is found at the start of the string, return `null`. Do _not_ match tokens
anywhere but at the start of the given string.
Substitution is done by the `substituteTokens` method. This one is trickier.
It is called with a mixed array of tokens (`MDToken`) and nodes (`MDNode`).
The goal is for readers to keep doing one search-and-replace when possible.
Only the first match should be performed, then return `true` if a substitution
was made or `false` if not.
There are helper methods for performing substitution. `MDToken` has static methods
`findFirstTokens` and `findPairedTokens`. They work like crude regexes that
operate on tokens instead of characters. For syntax consisting of text enclosed
between symbols (e.g. `**strong**`, `_emphasis_`, `~~strike~~`), you can
subclass `MDSimplePairInlineReader` and use its `attemptPair` method to make
substitution easy.
## Priority
Many markdown syntaxes use the same characters and similar syntax. This can
cause problems depending on which reader is run first. This can be resolved by
prioritization. Each parsing phase has its own priority order, and each reader
can dictate what readers it wants to be ahead of or behind by overriding the
`compareBlockOrdering`, `compareTokenizeOrdering`, or `compareSubstituteOrdering`
methods. Each is called with another reader instance, and the method should
return a `-1` if it wants to be run before the given reader, `1` if it wants to
be run after the given reader, or `0` as a "don't care" value. The method should
only return `-1` or `1` when it's necessary to resolve a specific conflict;
most of the time it should return `0`. The default implementation always returns
`0`.
Prioritization can be further refined for the substitution phase using multiple
passes. Passes are good for especially ambiguous constructs, like `***strong** emphasis*`. In this example, a naive `MDEmphasisReader` might grab the first asterisk as the start of the sequence and the one immediately following "strong",
stealing the second and third asterisk as part of the inner text and leaving the
second asterisk after "strong" and the one after "emphasis" alone.
> `[*]**strong[*]* emphasis*`
> An incorrect, naive substitution (start and end tokens shown in brackets)
By using mulitiple passes, the reader can use early passes to be more
conservative and hold out for a better match first, or let other readers find
a better match, then on a later pass it can grab whatever's left over.
> 1. `MDStrongReader` finds a match on the first pass because there are no inner
> asterisks to cause ambiguities
> `*[**]strong[**] emphasis*`
>
> 2. Then, `MDEmphasisReader` finds the outer pair on a later pass
> `* emphasis*`
## Pre- and Post-Processing
Readers have the option of performing tasks before and after the document has
been parsed by overriding the `preProcess` and/or `postProcess` methods.
Pre-processing is mostly useful for initializing `MDState` in some way.
Post-processing can be used to manipulate the `MDNode` tree after it has been
constructed by the parsing phases. For example, finding all the footnotes in a
document in the order they appear for proper numbering. `postProcess` is passed
an array of top-level `MDNode`s in the parsed document. This array can be
altered using `.splice` operations if desired to add additional structures,
such as a list of all the footnote content at the bottom of the document, to
reuse the same example.
One useful utility for post-processing is `MDNode.visitChildren`. This will
recursively visit every node in the tree and call a callback function with each
one. The node tree cannot be restructured, but each node can be inspected or
altered.
To perform substitution, `MDUtils.replaceNodes` can be called. It will call a
replacer function with every node in the tree recursively. If the function
returns `null`, no change is made. If the function returns a new `MDNode`
instance it will replace that node in the tree. When a replacement is made,
neither the original nor the replacement is recursed into. To replace a node
with multiple other nodes, wrap those multiple nodes in a generic `MDNode` as
its children. To remove a node without a replacement, return an `MDNode` with
no children. It will be rendered as nothing when converted to HTML.