Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of clobbering #31

Closed
4 tasks done
crazygolem opened this issue Jun 29, 2024 · 7 comments
Closed
4 tasks done

Improve handling of clobbering #31

crazygolem opened this issue Jun 29, 2024 · 7 comments
Labels
🤷 no/invalid This cannot be acted upon 👎 phase/no Post cannot or will not be acted on

Comments

@crazygolem
Copy link

Initial checklist

Problem

The way DOM clobbering prevention is implemented, this plugin leaves sanitized HTML broken: it correctly rewrites the attributes that can cause clobbering (id and name, configured by default) and by default some attributes that reference those IDs (ariaDescribedBy and ariaLabelledBy).

Notably missing is the href attribute, which can be used for internal links using a fragment which value matches the id or name attribute of an element. This breaks GFM footnotes (cf. #30, #29) but also internal links such as those targetting headings slugified by rehype-slug.

The current solution has been deemed acceptable because 1) it keeps the internal links "pretty" for the users; and 2) github likes it like that.

I think however that it is not very good, because it requires awareness of the issue otherwise the plugin just appears broken, and by default breaks the document and requires a workaround to be applied later. And the suggested workaround (the github script) also has limitations, like it wouldn't support multiple generated documents being included with different clobberPrefix to avoid collisions.

I would much prefer if the sanitizer wouldn't break my documents by default where there is no reason to. Pretty links are nice, but it should be a conscious decision to enable them as it requires extra effort to un-break the document afterwards.

Solution

I propose to improve the rewriting of attributes that reference another element via their ID as follows:

While visiting nodes:

  1. Prefix the id and name attributes as currently;
  2. Add the original value of the rewritten id and name attributes to a list of targets;
  3. If the node contains an attribute that can reference an ID (e.g. ariaDescribedBy, ariaLabelledBy but also href), add it to a list of references.

Now that the whole tree has been traversed, we know all the IDs in the sanitized document, and can start rewriting the references. Doing it after the traversal instead of during allows to be a bit smarter with the rewriting, and allows (maybe as an option) to keep references untouched when they don't target an ID within the document (if the document is later included into another one, this would allow referencing elements in the enclosing document, e.g. #top for an internal link, or some common aria labels or descriptions).

For the href attribute, I propose to only rewrite them if they contain just the fragment. So for example href="foobar.htm#section-xyz" would not be rewritten, because without extra context we can't know whether foobar.htm is the current page, but href="#section-xyz" would be rewritten to href="#user-content-section-xyz". This would at least fix the broken GFM footnotes, and generally internal links, which I think is a common case when converting from markdown.

This would require to special-case href because the assumption on the format used currently for the two aria attributes doesn't hold. Instead of that, I also propose to slightly change the format of the configuration to allow passing a function to handle the rewriting, which would let users add support for more exotic attributes, such as custom data-* attributes (not that anybody requested such a thing yet) or simply let the user handle the difficult things like the URL fragment not being on its own.

Alternatives

  • Do nothing: If github is happy, everyone's happy 😿
  • Monkey-patching the hast-util-sanitize lib, as mentioned by the reporter of Don't apply the clobber prefix when it's already present #29: Yes, it works, but we're basically all on our own. Since sanitization is security-relevant, I think it would be better to do it once right better. It would also be quite brittle, and could break with any update.
  • Write another plugin just to fix the links: This plugin would have less context to work with than the sanitize plugin, e.g. no idea which IDs have been rewritten (it would need to assume that specifically hast-util-sanitize was used and all IDs have been rewritten). Also an extra walk of the tree. And the plugin's name would be somewhat stupid: hast-util-sanitize-clobbering-fix.
  • Write another plugin to handle just clobbering: This would make the functionality in hast-util-sanitize redundant, while probably ending up lifting a big chunk of the current code, would require configuring another plugin to make it work (i.e. disabling the clobber handling in hast-util-sanitize), extra tree walk, annoying plugin name.

I hope you are at least open to the idea that my proposal can be better than the alternatives. On my part, I am also open to suggestions if my proposition doesn't quite fit the bill. I can implement it, but I can't promise to implement it fast =)

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Jun 29, 2024
@wooorm
Copy link
Member

wooorm commented Jun 30, 2024

Sounds like you wrote a long post explaining how your solution is actually rather complex and brittle, and how the current solution isn’t that bad? As you mentioned, “without extra context we can't know […]”.

As for the difference between id/name vs ariaDescribedBy/ariaLabelledBy, that is interesting indeed. I can think of division 2 ways: a) one goes in the URL indeed, so should be pretty, b) one could link “outside” the comment/post, so #top or #readme, whereas the other should likely stay “inside”.

The unique clobber prefix per post, so as not to conflict, that is interesting too. I guess GH doesn’t normally run into it because headings are not slugged in comments. But they did have this problem with footnotes.

Perhaps you can share how you run into this. What kind of markup are your users writing?

@crazygolem
Copy link
Author

Perhaps you can share how you run into this. What kind of markup are your users writing?

It's for a personal project that's not yet deployed anywhere to be seen, and I'm currently the sole user. I generate HTML from markdown.

I want to fix the GFM footnotes, and it took me long enough to figure out what was breaking them. Checking the code, I thought I might put in the extra effort to improve the sanitize plugin more than just for my use-case, and the extra effort for the extra improvements still seems reasonable for me to do.

Sounds like you wrote a long post explaining how your solution is actually rather complex and brittle, and how the current solution isn’t that bad? As you mentioned, “without extra context we can't know […]”.

It doesn't cover all possible cases because it's not possible ("without extra context"), but I don't think it's complex and brittle? But maybe I did not manage to properly convey what I meant, I'll try to be more concise and lay down my proposition differently.

The most basic change that would fix GFM footnotes and internal links, while following the assumptions that were made for the current code (I should maybe have given that as one of the alternatives too?) would be something like this:

diff --git lib/index.js lib/index.js
index 7fa8b25..1535a94 100644
--- lib/index.js
+++ lib/index.js
@@ -647,11 +647,17 @@ function propertyValuePrimitive(state, definition, key, value) {
     if (!ok) return
   }
 
-  return state.schema.clobber &&
-    state.schema.clobberPrefix &&
-    state.schema.clobber.includes(key)
-    ? state.schema.clobberPrefix + value
-    : value
+  if (!state.schema.clobberPrefix || !state.schema.clobber?.includes(key)) {
+    return value
+  }
+
+  if (key == 'href') {
+    return value.startsWith('#')
+      ? '#' + state.schema.clobberPrefix + value.substring(1)
+      : value
+  }
+
+  return state.schema.clobberPrefix + value
 }

We can still keep the pretty links by default by not adding href to the default schema, and it would behave as currently (although I think it would be less surprising the other way around).

And the following improvement areas:

  1. I find it a bit annoying to have to hardcode href, and it doesn't allow to rewrite internal links in all possible ways that they can be written, only if they are "naked". The solution to this is to make it configurable through the schema, as users do have that extra context we can't have here. And it would allow users to handle other attributes that are not just an ID or a list of IDs.
  2. This blindly rewrites the internal links, not allowing internal links outside of the generated document (which also currently holds for ariaDescribedBy/ariaLabelledBy). The solution to this would be to do it in two phases, first gathering the IDs, and at the end rewriting when we know which IDs to rewrite. References to IDs don't clobber (only name and id that create IDs do) so it doesn't lower security. It's a bit of work and I'm willing to do it. Making it an option to still rewrite all references if desired comes basically for free.

Does that seem more reasonable to you?

@wooorm
Copy link
Member

wooorm commented Jul 10, 2024

Hey again! Sorry, rather busy.

It's for a personal project that's not yet deployed anywhere to be seen, and I'm currently the sole user. I generate HTML from markdown.

Then why are you clobbering and sanitizing? You can turn things off? 🤔

I want to fix the GFM footnotes

How do you use them and what is broken? Can you make and MVP of what doesn’t work? I thought they worked by default?

because it's not possible ("without extra context")

this is what worries me, as the defaults should be good without extra context.

complex

I mean that the technique is much more involved that a singular rule, as it is now?

brittle

I mean that it doesn’t work on all hrefs; it’s also possible to use URLs outside href. the technique works for some cases but doesn’t for others.

I find it a bit annoying to have to hardcode href […] it would allow users to handle other attributes that are not just an ID or a list of IDs.

Same. But for me the proposed alternative doesn’t seem acceptable. It would get too complex for me. You’d need an allowlist of URL attributes (https://github.com/rehypejs/rehype-minify/tree/main/packages/html-url-attributes) and extensibility to that list, then you need to know the exact URL a document is shown on; however that could be several URLs. You’d need to understand many URLs (data:?).

The solution to this would be to do it in two phases, first gathering the IDs, and at the end rewriting when we know which IDs to rewrite. References to IDs don't clobber (only name and id that create IDs do) so it doesn't lower security. It's a bit of work and I'm willing to do it. Making it an option to still rewrite all references if desired comes basically for free.

This gets near the difference between linking in a document (say, footnotes scoped to a single comment, there you don’t want them to “escape”), and linking outside it (say there are comments under a readme, those could link to the headings in the readme). None of these tools have the knowldege of whether they are processing a single comment. Or a whole document. I feel like you’d need that to answer this.

@crazygolem
Copy link
Author

I want to fix the GFM footnotes

How do you use them and what is broken? Can you make and MVP of what doesn’t work? I thought they worked by default?

Sorry I wasn't very clear, the GFM footnotes plugin on its own works fine, but if you sanitize afterwards then "the footnotes appear broken". This has already been reported in #30 with an example, but closed as a duplicate of #29 (also closed). The problem I have is the same.

I am not satisfied with the workaround mentioned in the README, and propose that the sanitize plugin not break footnotes in the first place unless the user wants it that way (e.g. to have good looking URLs). Given this premise, my proposal has a solution that I think is better than the one in #29.

because it's not possible ("without extra context")

this is what worries me, as the defaults should be good without extra context.

My proposal follows the observation that the current defaults are already not that good (broken GFM footnotes and internal links in general), and could be significantly improved even without extra context (not broken GFM footnotes, as well as internal links with a varying degree of support depending on which part of the proposal is implemented).

complex

I mean that the technique is much more involved that a singular rule, as it is now?

It depends how complete you want the solution to be. I think the "most basic change" in my previous answer is simple enough to cover the obvious case, and will stop breaking the footnotes. Allowing user-configurable rewrites doesn't change the concept that much either: instead of just two hardcoded rules (single ID or list of IDs), we add another possibility with user-configured rules.

brittle

I mean that it doesn’t work on all hrefs; it’s also possible to use URLs outside href. the technique works for some cases but doesn’t for others.

You already don't cover all cases for simple ID rewriting in the default schema, for example the for attribute on the label element is not included. Why do you insist that my proposal must cover all imaginable cases?

It covers by default at minimum a common case (naked internal links, as generated by GFM footnotes), and in the next answer below I explain how it can cover all urls through configuration, in the same way the user can already add support for more attributes that the default doesn't handle through configuration (like the for attribute).

I find it a bit annoying to have to hardcode href […] it would allow users to handle other attributes that are not just an ID or a list of IDs.

Same. But for me the proposed alternative doesn’t seem acceptable. It would get too complex for me. You’d need an allowlist of URL attributes (https://github.com/rehypejs/rehype-minify/tree/main/packages/html-url-attributes) and extensibility to that list, then you need to know the exact URL a document is shown on; however that could be several URLs. You’d need to understand many URLs (data:?).

That's where making it configurable through the schema comes in handy: we don't need to make this complex on our side, in particular we don't need lists, and we can let the user handle it themself.

From the user's perspective, the configuration could look something like this:

clobber: [
    'ariaDescribedBy',
    'ariaLabelledBy',
    'id',
    'name',
    {
        'attribute': 'href',
        'rewrite': (value, state) => value.startsWith('#') ? '#' + state.schema.clobberPrefix + value.substring(1) : value
    }
]

If the user wants to take care of cases we don't cover, they can do it by specifying a different function, with the knowledge of their own context.

I haven't looked too much into this yet, but I think it shouldn't be very complex to implement handling this in the lib. As in the example, we can keep the simple attribute names for backwards compatibility. We can even keep the current default (no handling of href) if you're totally against it: although it won't fix the issue with the GFM footnotes out of the box, at least the user is not required to disable clobbering altogether in this plugin to fix it on their own afterwards.

The solution to this would be to do it in two phases, first gathering the IDs, and at the end rewriting when we know which IDs to rewrite. References to IDs don't clobber (only name and id that create IDs do) so it doesn't lower security. It's a bit of work and I'm willing to do it. Making it an option to still rewrite all references if desired comes basically for free.

This gets near the difference between linking in a document (say, footnotes scoped to a single comment, there you don’t want them to “escape”), and linking outside it (say there are comments under a readme, those could link to the headings in the readme). None of these tools have the knowldege of whether they are processing a single comment. Or a whole document. I feel like you’d need that to answer this.

Correct, and I don't propose to solve the whole thing ourselves, but rather to enable the user to do it with our support.

This part of my proposal requires far more work, and will likely require some changes to the structure of the code. Let's leave it out for now, so we can focus on the other parts of the proposal.

@wooorm
Copy link
Member

wooorm commented Jul 12, 2024

Sorry I wasn't very clear, the GFM footnotes plugin on its own works fine, but if you sanitize afterwards then "the footnotes appear broken". This has already been reported in #30 with an example, but closed as a duplicate of #29 (also closed). The problem I have is the same.

30 points to this specific comment in 29, did you see that? #29 (comment)?
mdast-util-to-hast/remark-rehype also have a clobberPrefix option: https://github.com/remarkjs/remark-rehype#options.

I am not satisfied with the workaround mentioned in the README

Which workaround? I do not think what GH does is a workaround. I think it’s quite good.
Your proposal has some problems, which what GH does, doesn’t have.

You already don't cover all cases for simple ID rewriting in the default schema, for example the for attribute on the label element is not included. Why do you insist that my proposal must cover all imaginable cases?

Please think about what this project is for and what it does.
This project is extensible. You define schemas. It defaults to GH. So that users don’t XSS you.

@crazygolem
Copy link
Author

crazygolem commented Jul 21, 2024

This project is extensible. You define schemas.

It's not extensible enough for me, and the changes I propose to implement would make it extensible in a way that could help me solve my use-case without forcing everyone to do it my way.

It defaults to GH. So that users don’t XSS you.

With all due respect, I don't think you really get what clobbering is about and how the sanitizer mitigates that issue, especially since rewriting the aria attributes as configured in the default schema has nothing to do with clobbering and all to do with fixing the document after sanitization, which was exactly the area I wanted to improve.

Please think about what this project is for and what it does.

Yeah I get it now, this project is for github.

I'll close this issue and implement my own sanitizer just for clobbering, disabling the one from this plugin.

I'm still willing to put in the effort and do it as part of this plugin, if you change your mind let me know. If you want to just continue the discussion to see if you'd be willing to accept some variation of my proposal (and even if in the end it doesn't amount to anything) I'm also willing to explore that with you, as long as you're willing to entertain the notion that github is not the only valid use-case. Cheers.

This comment has been minimized.

@wooorm wooorm added the 🤷 no/invalid This cannot be acted upon label Jul 22, 2024
@github-actions github-actions bot added 👎 phase/no Post cannot or will not be acted on and removed 🤞 phase/open Post is being triaged manually labels Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤷 no/invalid This cannot be acted upon 👎 phase/no Post cannot or will not be acted on
Development

No branches or pull requests

2 participants