PHP and Javascript implementations of a simple markdown parser
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

Custom Syntax Modules

Syntax falls into two categories: block syntax and inline syntax. Block syntax operates on one or more whole lines of content to form structures like paragraphs, lists, tables, etc. Inline syntax operates on stretches of text within a block, such as marking a word as bold or inserting a link. Inline syntax cannot span multiple blocks.

Readers

A reader is a subclass of MDReader and handles the parsing of one kind of syntax. E.g. MDUnorderedListReader handles bulleted lists and their items, MDStrongReader handles bold text, etc.

A reader can parse block syntax, inline syntax, or both.

Parsing process

Parsing occurs in three phases.

  1. Blocks: Readers are consulted line-by-line to see if their supported block syntax is found starting at the given line pointer. If so, it returns an MDNode object encoding the block, otherwise it returns null and the next reader is checked. See MDReader.readBlock().

  2. Tokenizing: Readers are asked to see if a supported token is located at the beginning of a given string. If so, an MDToken is returned, otherwise it returns null and then next reader is checked. See MDReader.readToken().

  3. Substitution: Readers are given a mixed array of MDTokens (from the previous step) and MDNodes. The goal of substitution is to eventually convert all tokens into nodes, with any left over at the end being converted to MDTextNodes. If a reader can make one substitution it returns true, otherwise false and the next reader is checked. See MDReader.substituteTokens().

Parsing Blocks

To parse a block format, override the readBlock method in a subclass of MDReader. It will be passed an MDState instance. The two state properties of importance here are lines and p. lines is an array of the lines of markdown, and p is the current line pointer index. The job of readBlock is to check if the block syntax of interest is located at line p. Most of the time it won’t be, in which case the method should just return null, preferably as early as possible for good performance.

If the format is detected at line p then the reader should look ahead and see how many subsequent lines are also part of the block, process them as necessary, and return an appropriate MDNode subclass with their contents. Before returning, the reader must set the state’s p to the line index just after the last line of the detected block.

Most blocks will have inner content. This can be processed into one or more MDNodes by calling state.inlineMarkdownToNode or state.inlineMarkdownToNodes for inline content. If the content may have nested block structures, create a subarray from state.lines and create a sub-state with state.copy(lines). This creates a child state for the sub-document within the block structure. You can then call substate.readBlocks() to get an array of MDBlockNode to use as content. Note: if your block syntax has some kind of indentation or marker at the beginning of every line, those should be stripped from the lines passed to the copied sub-state.

A note about sub-states. States can have parent states. You can store any needed state info on an MDState instance, but you will usually want to do so on state.root, not state itself. This ensures those properties are accessed on the original state, not a disposable substate.

Parsing Inline Syntax

Inline parsing is usually done by overriding both the readToken and substituteTokens methods.

The readToken method is passed an MDState and a tail portion of a line of markdown. The method should check if the line begins with a supported token. This will generally be some punctuation symbol or sometimes a more complex sequence. For inline syntax consisting of pairs of double symbols (e.g. **strong**), each symbol should usually be tokenized individually rather than as pairs, just in case another syntax recognizes single symbols. If no token is found at the start of the string, return null. Do not match tokens anywhere but at the start of the given string.

Substitution is done by the substituteTokens method. This one is trickier. It is called with a mixed array of tokens (MDToken) and nodes (MDNode). The goal is for readers to keep doing one search-and-replace when possible. Only the first match should be performed, then return true if a substitution was made or false if not.

There are helper methods for performing substitution. MDToken has static methods findFirstTokens and findPairedTokens. They work like crude regexes that operate on tokens instead of characters. For syntax consisting of text enclosed between symbols (e.g. **strong**, _emphasis_, ~~strike~~), you can subclass MDSimplePairInlineReader and use its attemptPair method to make substitution easy.

Priority

Many markdown syntaxes use the same characters and similar syntax. This can cause problems depending on which reader is run first. This can be resolved by prioritization. Each parsing phase has its own priority order, and each reader can dictate what readers it wants to be ahead of or behind by overriding the compareBlockOrdering, compareTokenizeOrdering, or compareSubstituteOrdering methods. Each is called with another reader instance, and the method should return a -1 if it wants to be run before the given reader, 1 if it wants to be run after the given reader, or 0 as a “don’t care” value. The method should only return -1 or 1 when it’s necessary to resolve a specific conflict; most of the time it should return 0. The default implementation always returns 0.

Prioritization can be further refined for the substitution phase using multiple passes. Passes are good for especially ambiguous constructs, like ***strong** emphasis*. In this example, a naive MDEmphasisReader might grab the first asterisk as the start of the sequence and the one immediately following “strong”, stealing the second and third asterisk as part of the inner text and leaving the second asterisk after “strong” and the one after “emphasis” alone.

[*]**strong[*]* emphasis*
An incorrect, naive substitution (start and end tokens shown in brackets)

By using mulitiple passes, the reader can use early passes to be more conservative and hold out for a better match first, or let other readers find a better match, then on a later pass it can grab whatever’s left over.

  1. MDStrongReader finds a match on the first pass because there are no inner asterisks to cause ambiguities
    *[**]strong[**] emphasis*

  2. Then, MDEmphasisReader finds the outer pair on a later pass
    *<MDStrongNode> emphasis*

Pre- and Post-Processing

Readers have the option of performing tasks before and after the document has been parsed by overriding the preProcess and/or postProcess methods.

Pre-processing is mostly useful for initializing MDState in some way.

Post-processing can be used to manipulate the MDNode tree after it has been constructed by the parsing phases. For example, finding all the footnotes in a document in the order they appear for proper numbering. postProcess is passed an array of top-level MDNodes in the parsed document. This array can be altered using .splice operations if desired to add additional structures, such as a list of all the footnote content at the bottom of the document, to reuse the same example.

One useful utility for post-processing is MDNode.visitChildren. This will recursively visit every node in the tree and call a callback function with each one. The node tree cannot be restructured, but each node can be inspected or altered.

To perform substitution, MDUtils.replaceNodes can be called. It will call a replacer function with every node in the tree recursively. If the function returns null, no change is made. If the function returns a new MDNode instance it will replace that node in the tree. When a replacement is made, neither the original nor the replacement is recursed into. To replace a node with multiple other nodes, wrap those multiple nodes in a generic MDNode as its children. To remove a node without a replacement, return an MDNode with no children. It will be rendered as nothing when converted to HTML.