# Custom Syntax Modules Syntax falls into two categories: block syntax and inline syntax. Block syntax operates on one or more whole lines of content to form structures like paragraphs, lists, tables, etc. Inline syntax operates on stretches of text within a block, such as marking a word as bold or inserting a link. Inline syntax cannot span multiple blocks. ## Readers A reader is a subclass of `MDReader` and handles the parsing of one kind of syntax. E.g. `MDUnorderedListReader` handles bulleted lists and their items, `MDStrongReader` handles bold text, etc. A reader can parse block syntax, inline syntax, or both. ## Parsing process Parsing occurs in three phases. 1. **Blocks:** Readers are consulted line-by-line to see if their supported block syntax is found starting at the given line pointer. If so, it returns an `MDNode` object encoding the block, otherwise it returns `null` and the next reader is checked. See `MDReader.readBlock()`. 2. **Tokenizing:** Readers are asked to see if a supported token is located at the beginning of a given string. If so, an `MDToken` is returned, otherwise it returns `null` and then next reader is checked. See `MDReader.readToken()`. 3. **Substitution:** Readers are given a mixed array of `MDToken`s (from the previous step) and `MDNode`s. The goal of substitution is to eventually convert all tokens into nodes, with any left over at the end being converted to `MDTextNode`s. If a reader can make one substitution it returns `true`, otherwise `false` and the next reader is checked. See `MDReader.substituteTokens()`. ## Parsing Blocks To parse a block format, override the `readBlock` method in a subclass of `MDReader`. It will be passed an `MDState` instance. The two state properties of importance here are `lines` and `p`. `lines` is an array of the lines of markdown, and `p` is the current line pointer index. The job of `readBlock` is to check if the block syntax of interest is located at line `p`. Most of the time it won't be, in which case the method should just return `null`, preferably as early as possible for good performance. If the format is detected at line `p` then the reader should look ahead and see how many subsequent lines are also part of the block, process them as necessary, and return an appropriate `MDNode` subclass with their contents. Before returning, the reader _must_ set the state's `p` to the line index just after the last line of the detected block. Most blocks will have inner content. This can be processed into one or more `MDNode`s by calling `state.inlineMarkdownToNode` or `state.inlineMarkdownToNodes` for inline content. If the content may have nested block structures, create a subarray from `state.lines` and create a sub-state with `state.copy(lines)`. This creates a child state for the sub-document within the block structure. You can then call `substate.readBlocks()` to get an array of `MDBlockNode` to use as content. Note: if your block syntax has some kind of indentation or marker at the beginning of every line, those should be stripped from the lines passed to the copied sub-state. > **A note about sub-states.** States can have parent states. You can store any > needed state info on an `MDState` instance, but you will usually want to do > so on `state.root`, not `state` itself. This ensures those properties are > accessed on the original state, not a disposable substate. ## Parsing Inline Syntax Inline parsing is usually done by overriding both the `readToken` and `substituteTokens` methods. The `readToken` method is passed an `MDState` and a tail portion of a line of markdown. The method should check if the line begins with a supported token. This will generally be some punctuation symbol or sometimes a more complex sequence. For inline syntax consisting of pairs of double symbols (e.g. `**strong**`), each symbol should usually be tokenized individually rather than as pairs, just in case another syntax recognizes single symbols. If no token is found at the start of the string, return `null`. Do _not_ match tokens anywhere but at the start of the given string. Substitution is done by the `substituteTokens` method. This one is trickier. It is called with a mixed array of tokens (`MDToken`) and nodes (`MDNode`). The goal is for readers to keep doing one search-and-replace when possible. Only the first match should be performed, then return `true` if a substitution was made or `false` if not. There are helper methods for performing substitution. `MDToken` has static methods `findFirstTokens` and `findPairedTokens`. They work like crude regexes that operate on tokens instead of characters. For syntax consisting of text enclosed between symbols (e.g. `**strong**`, `_emphasis_`, `~~strike~~`), you can subclass `MDSimplePairInlineReader` and use its `attemptPair` method to make substitution easy. ## Priority Many markdown syntaxes use the same characters and similar syntax. This can cause problems depending on which reader is run first. This can be resolved by prioritization. Each parsing phase has its own priority order, and each reader can dictate what readers it wants to be ahead of or behind by overriding the `compareBlockOrdering`, `compareTokenizeOrdering`, or `compareSubstituteOrdering` methods. Each is called with another reader instance, and the method should return a `-1` if it wants to be run before the given reader, `1` if it wants to be run after the given reader, or `0` as a "don't care" value. The method should only return `-1` or `1` when it's necessary to resolve a specific conflict; most of the time it should return `0`. The default implementation always returns `0`. Prioritization can be further refined for the substitution phase using multiple passes. Passes are good for especially ambiguous constructs, like `***strong** emphasis*`. In this example, a naive `MDEmphasisReader` might grab the first asterisk as the start of the sequence and the one immediately following "strong", stealing the second and third asterisk as part of the inner text and leaving the second asterisk after "strong" and the one after "emphasis" alone. > `[*]**strong[*]* emphasis*`
> An incorrect, naive substitution (start and end tokens shown in brackets) By using mulitiple passes, the reader can use early passes to be more conservative and hold out for a better match first, or let other readers find a better match, then on a later pass it can grab whatever's left over. > 1. `MDStrongReader` finds a match on the first pass because there are no inner > asterisks to cause ambiguities
> `*[**]strong[**] emphasis*` > > 2. Then, `MDEmphasisReader` finds the outer pair on a later pass
> `* emphasis*` ## Pre- and Post-Processing Readers have the option of performing tasks before and after the document has been parsed by overriding the `preProcess` and/or `postProcess` methods. Pre-processing is mostly useful for initializing `MDState` in some way. Post-processing can be used to manipulate the `MDNode` tree after it has been constructed by the parsing phases. For example, finding all the footnotes in a document in the order they appear for proper numbering. `postProcess` is passed an array of top-level `MDNode`s in the parsed document. This array can be altered using `.splice` operations if desired to add additional structures, such as a list of all the footnote content at the bottom of the document, to reuse the same example. One useful utility for post-processing is `MDNode.visitChildren`. This will recursively visit every node in the tree and call a callback function with each one. The node tree cannot be restructured, but each node can be inspected or altered. To perform substitution, `MDNode.replaceNodes` can be called. It will call a replacer function with every node in the tree recursively. If the function returns `null`, no change is made. If the function returns a new `MDNode` instance it will replace that node in the tree. When a replacement is made, neither the original nor the replacement is recursed into. To replace a node with multiple other nodes, wrap those multiple nodes in a generic `MDNode` as its children. To remove a node without a replacement, return an `MDNode` with no children. It will be rendered as nothing when converted to HTML.