Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reintroduce AST parse/walk #276

Merged
merged 20 commits into from
Apr 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,6 @@ build/
actual.txt
test.txt
test/progit
test/benchinput.md
test/benchmark/large.md

*.orig
8 changes: 8 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ end

group :benchmark do
gem "benchmark-ips"
gem "markly"
gem "kramdown"
gem "kramdown-parser-gfm"
gem "redcarpet"
Expand Down
192 changes: 150 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,108 @@ require 'commonmarker'
Commonmarker.to_html('"Hi *there*"', options: {
parse: { smart: true }
})
# <p>“Hi <em>there</em>”</p>\n
# => <p>“Hi <em>there</em>”</p>\n
```

The second argument is optional--[see below](#options) for more information.
(The second argument is optional--[see below](#options-and-plugins) for more information.)

### Generating a document

You can also parse a string to receive a `:document` node. You can then print that node to HTML, iterate over the children, and do other fun node stuff. For example:

```ruby
require 'commonmarker'

doc = Commonmarker.parse("*Hello* world", options: {
parse: { smart: true }
})
puts(doc.to_html) # => <p><em>Hello</em> world</p>\n

doc.walk do |node|
puts node.type # => [:document, :paragraph, :emph, :text, :text]
end
```

(The second argument is optional--[see below](#options-and-plugins) for more information.)

When it comes to modifying the document, you can perform the following operations:

- `insert_before`
- `insert_after`
- `prepend_child`
- `append_child`
- `delete`

You can also get the source position of a node by calling `source_position`:

```ruby
doc = Commonmarker.parse("*Hello* world")
puts doc.first_child.first_child.source_position
# => {:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>7}
```

You can also modify the following attributes:

- `url`
- `title`
- `header_level`
- `list_type`
- `list_start`
- `list_tight`
- `fence_info`

#### Example: Walking the AST

You can use `walk` or `each` to iterate over nodes:

- `walk` will iterate on a node and recursively iterate on a node's children.
- `each` will iterate on a node and its children, but no further.

```ruby
require 'commonmarker'

# parse some string
doc = Commonmarker.parse("# The site\n\n [GitHub](https://www.github.com)")

# Walk tree and print out URLs for links
doc.walk do |node|
if node.type == :link
printf("URL = %s\n", node.url)
end
end
# => URL = https://www.github.com

# Transform links to regular text
doc.walk do |node|
if node.type == :link
node.insert_before(node.first_child)
node.delete
end
end
# => <h1><a href=\"#the-site\"></a>The site</h1>\n<p>GitHub</p>\n
```

#### Example: Converting a document back into raw CommonMark

You can use `to_commonmark` on a node to render it as raw text:

```ruby
require 'commonmarker'

# parse some string
doc = Commonmarker.parse("# The site\n\n [GitHub](https://www.github.com)")

# Transform links to regular text
doc.walk do |node|
if node.type == :link
node.insert_before(node.first_child)
node.delete
end
end

doc.to_commonmark
# => # The site\n\nGitHub\n
```

## Options and plugins

Expand All @@ -53,21 +151,23 @@ Note that there is a distinction in comrak for "parse" options and "render" opti

### Parse options

| Name | Description | Default |
| --------------------- | ------------------------------------------------------------------------------------ | ------- |
| `smart` | Punctuation (quotes, full-stops and hyphens) are converted into 'smart' punctuation. | `false` |
| `default_info_string` | The default info string for fenced code blocks. | `""` |
| Name | Description | Default |
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------- |
| `smart` | Punctuation (quotes, full-stops and hyphens) are converted into 'smart' punctuation. | `false` |
| `default_info_string` | The default info string for fenced code blocks. | `""` |
| `relaxed_autolinks` | Enable relaxing of the autolink extension parsing, allowing links to be recognized when in brackets, as well as permitting any url scheme. | `false` |

### Render options

| Name | Description | Default |
| ----------------- | ------------------------------------------------------------------------------------------------------ | ------- |
| `hardbreaks` | [Soft line breaks](http://spec.commonmark.org/0.27/#soft-line-breaks) translate into hard line breaks. | `true` |
| `github_pre_lang` | GitHub-style `<pre lang="xyz">` is used for fenced code blocks with info tags. | `true` |
| `width` | The wrap column when outputting CommonMark. | `80` |
| `unsafe` | Allow rendering of raw HTML and potentially dangerous links. | `false` |
| `escape` | Escape raw HTML instead of clobbering it. | `false` |
| `sourcepos` | Include source position attribute in HTML and XML output. | `false` |
| Name | Description | Default |
| -------------------- | ------------------------------------------------------------------------------------------------------ | ------- |
| `hardbreaks` | [Soft line breaks](http://spec.commonmark.org/0.27/#soft-line-breaks) translate into hard line breaks. | `true` |
| `github_pre_lang` | GitHub-style `<pre lang="xyz">` is used for fenced code blocks with info tags. | `true` |
| `width` | The wrap column when outputting CommonMark. | `80` |
| `unsafe` | Allow rendering of raw HTML and potentially dangerous links. | `false` |
| `escape` | Escape raw HTML instead of clobbering it. | `false` |
| `sourcepos` | Include source position attribute in HTML and XML output. | `false` |
| `escaped_char_spans` | Wrap escaped characters in span tags | `true` |

As well, there are several extensions which you can toggle in the same manner:

Expand All @@ -80,19 +180,21 @@ Commonmarker.to_html('"Hi *there*"', options: {

### Extension options

| Name | Description | Default |
| ------------------------ | ------------------------------------------------------------------------------------------------------------------- | ------- |
| `strikethrough` | Enables the [strikethrough extension](https://github.github.com/gfm/#strikethrough-extension-) from the GFM spec. | `true` |
| `tagfilter` | Enables the [tagfilter extension](https://github.github.com/gfm/#disallowed-raw-html-extension-) from the GFM spec. | `true` |
| `table` | Enables the [table extension](https://github.github.com/gfm/#tables-extension-) from the GFM spec. | `true` |
| `autolink` | Enables the [autolink extension](https://github.github.com/gfm/#autolinks-extension-) from the GFM spec. | `true` |
| `tasklist` | Enables the [task list extension](https://github.github.com/gfm/#task-list-items-extension-) from the GFM spec. | `true` |
| `superscript` | Enables the superscript Comrak extension. | `false` |
| `header_ids` | Enables the header IDs Comrak extension. from the GFM spec. | `""` |
| `footnotes` | Enables the footnotes extension per `cmark-gfm`. | `false` |
| `description_lists` | Enables the description lists extension. | `false` |
| `front_matter_delimiter` | Enables the front matter extension. | `""` |
| `shortcodes` | Enables the shortcodes extension. | `true` |
| Name | Description | Default |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------- | ------- |
| `strikethrough` | Enables the [strikethrough extension](https://github.github.com/gfm/#strikethrough-extension-) from the GFM spec. | `true` |
| `tagfilter` | Enables the [tagfilter extension](https://github.github.com/gfm/#disallowed-raw-html-extension-) from the GFM spec. | `true` |
| `table` | Enables the [table extension](https://github.github.com/gfm/#tables-extension-) from the GFM spec. | `true` |
| `autolink` | Enables the [autolink extension](https://github.github.com/gfm/#autolinks-extension-) from the GFM spec. | `true` |
| `tasklist` | Enables the [task list extension](https://github.github.com/gfm/#task-list-items-extension-) from the GFM spec. | `true` |
| `superscript` | Enables the superscript Comrak extension. | `false` |
| `header_ids` | Enables the header IDs Comrak extension. from the GFM spec. | `""` |
| `footnotes` | Enables the footnotes extension per `cmark-gfm`. | `false` |
| `description_lists` | Enables the description lists extension. | `false` |
| `front_matter_delimiter` | Enables the front matter extension. | `""` |
| `shortcodes` | Enables the shortcodes extension. | `true` |
| `multiline_block_quotes` | Enables the multiline block quotes extension. | `false` |
| `math_dollars`, `math_code` | Enables the math extension. | `false` |

For more information on these options, see [the comrak documentation](https://github.com/kivikakk/comrak#usage).

Expand Down Expand Up @@ -202,26 +304,32 @@ If there were no errors, you're done! Otherwise, make sure to follow the comrak

## Benchmarks

Some rough benchmarks:

```
$ bundle exec rake benchmark

❯ bundle exec rake benchmark
input size = 11064832 bytes

ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
Warming up --------------------------------------
redcarpet 2.000 i/100ms
commonmarker with to_html
1.000 i/100ms
kramdown 1.000 i/100ms
Markly.render_html 1.000 i/100ms
Markly::Node#to_html 1.000 i/100ms
Commonmarker.to_html 1.000 i/100ms
Commonmarker::Node.to_html
1.000 i/100ms
Kramdown::Document#to_html
1.000 i/100ms
Calculating -------------------------------------
redcarpet 22.317 (± 4.5%) i/s - 112.000 in 5.036374s
commonmarker with to_html
5.815 (± 0.0%) i/s - 30.000 in 5.168869s
kramdown 0.327 (± 0.0%) i/s - 2.000 in 6.121486s
Markly.render_html 15.606 (±25.6%) i/s - 71.000 in 5.047132s
Markly::Node#to_html 15.692 (±25.5%) i/s - 72.000 in 5.095810s
Commonmarker.to_html 4.482 (± 0.0%) i/s - 23.000 in 5.137680s
Commonmarker::Node.to_html
5.092 (±19.6%) i/s - 25.000 in 5.072220s
Kramdown::Document#to_html
0.379 (± 0.0%) i/s - 2.000 in 5.277770s

Comparison:
redcarpet: 22.3 i/s
commonmarker with to_html: 5.8 i/s - 3.84x (± 0.00) slower
kramdown: 0.3 i/s - 68.30x (± 0.00) slower
Markly::Node#to_html: 15.7 i/s
Markly.render_html: 15.6 i/s - same-ish: difference falls within error
Commonmarker::Node.to_html: 5.1 i/s - 3.08x slower
Commonmarker.to_html: 4.5 i/s - 3.50x slower
Kramdown::Document#to_html: 0.4 i/s - 41.40x slower
```
2 changes: 2 additions & 0 deletions ext/commonmarker/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ publish = false
magnus = "0.6"
comrak = { version = "0.23", features = ["shortcodes"] }
syntect = { version = "5.2", features = ["plist-load"] }
typed-arena = "2.0"
rctree = "0.6"

[lib]
name = "commonmarker"
Expand Down
36 changes: 33 additions & 3 deletions ext/commonmarker/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,14 @@ use std::path::PathBuf;
use ::syntect::highlighting::ThemeSet;
use comrak::{
adapters::SyntaxHighlighterAdapter,
markdown_to_html, markdown_to_html_with_plugins,
markdown_to_html, markdown_to_html_with_plugins, parse_document,
plugins::syntect::{SyntectAdapter, SyntectAdapterBuilder},
ComrakOptions, ComrakPlugins,
};
use magnus::{
define_module, exception, function, r_hash::ForEach, scan_args, Error, RHash, Symbol, Value,
};
use node::CommonmarkerNode;

mod options;
use options::iterate_options_hash;
Expand All @@ -21,11 +22,36 @@ use plugins::{
syntax_highlighting::{fetch_syntax_highlighter_path, fetch_syntax_highlighter_theme},
SYNTAX_HIGHLIGHTER_PLUGIN,
};
use typed_arena::Arena;

mod node;
mod utils;

pub const EMPTY_STR: &str = "";

fn commonmark_parse(args: &[Value]) -> Result<CommonmarkerNode, magnus::Error> {
let args = scan_args::scan_args::<_, (), (), (), _, ()>(args)?;
let (rb_commonmark,): (String,) = args.required;

let kwargs =
scan_args::get_kwargs::<_, (), (Option<RHash>,), ()>(args.keywords, &[], &["options"])?;
let (rb_options,) = kwargs.optional;

let mut comrak_options = ComrakOptions::default();

if let Some(rb_options) = rb_options {
rb_options.foreach(|key: Symbol, value: RHash| {
iterate_options_hash(&mut comrak_options, key, value)?;
Ok(ForEach::Continue)
})?;
}

let arena = Arena::new();
let root = parse_document(&arena, &rb_commonmark, &comrak_options);

CommonmarkerNode::new_from_comrak_node(root)
}

fn commonmark_to_html(args: &[Value]) -> Result<String, magnus::Error> {
let args = scan_args::scan_args::<_, (), (), (), _, ()>(args)?;
let (rb_commonmark,): (String,) = args.required;
Expand Down Expand Up @@ -145,9 +171,13 @@ fn commonmark_to_html(args: &[Value]) -> Result<String, magnus::Error> {

#[magnus::init]
fn init() -> Result<(), Error> {
let module = define_module("Commonmarker")?;
let m_commonmarker = define_module("Commonmarker")?;

m_commonmarker.define_module_function("commonmark_parse", function!(commonmark_parse, -1))?;
m_commonmarker
.define_module_function("commonmark_to_html", function!(commonmark_to_html, -1))?;

module.define_module_function("commonmark_to_html", function!(commonmark_to_html, -1))?;
node::init(m_commonmarker).expect("cannot define Commonmarker::Node class");

Ok(())
}
Loading