Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Primitive Tainting #211

Draft
wants to merge 84 commits into
base: main
Choose a base branch
from
Draft

Adding Primitive Tainting #211

wants to merge 84 commits into from

Conversation

tmbrbr
Copy link
Contributor

@tmbrbr tmbrbr commented May 23, 2024

This PR enables tainting for numbers, and is a mergable version of the https://github.com/SAP/project-foxhound/tree/primitaint branch.

This PR adds the following features:

  • Adding tainting support for number objects by adding an dedicated slot
  • Propagation of taint labels in and out of strings (e.g. via string to number conversions)
  • An extension of the code generator to allow marking of taint sources directly in the webidl file, no need to edit the C++

There is still some work to do before the merge:

  • Enable JIT compilation (currently completely disabled). Probably want to do this Similar to how we do for Strings, ie bail out if a number is tainted, but compile if not.
  • Come up with a sensible way to combine taint flows from 2 flows, e.g. a = b + c - if b and c are tainted, how does the taint flow look for c?
  • Double check the performance impact
  • See whether existing sources can be moved to the webidl generator
  • Remove the existing https://github.com/SAP/project-foxhound/tree/primitaint branch
  • Enable generated source enable/disable via about:config options
  • Fix playwright integration
  • Fix memory leaks

@leeN
Copy link
Collaborator

leeN commented Jun 11, 2024

I noticed another thing during debugging #213. Currently, we can't disable sources related to primitive tainting in about:config.

@leeN
Copy link
Collaborator

leeN commented Aug 1, 2024

So, as an update/status recap from my point of view:

  • Playwright integration seems to be fixed due to af544cf.

  • Memory Leaks seem to be resolved due to 301b663

  • See whether existing sources can be moved to the webidl generator

I looked at adding more sources/sinks to the WebIDL generator. My first impression was this would be a fairly involved change that certainly does not work for every operation. I will investigate further to check for cost/benefit tradeoffs.

  • Enable generated source enable/disable via about:config options

This works as is; it is just user-unfriendly in its current form. This is fixed once #2 lands.

So the major remaining points seem to be:

  • Performance Impact

I think we can get a good grasp by running a) the benchmarks (e.g., Ares 6) that mach supports and perform the same tests we did for FP tracer. There, the primitive tainting was fairly fast (<15% overhead on page load times)

  • Reenable the JIT.
    I think this would generally be beneficial, although we would probably have to disable a bunch of sources by default; otherwise, we gain fairly little, as my impression is that, at some point, every script has tainted values.
    That's something we might want to discuss @tmbrbr; I think having too many sources/sinks active by default overwhelms the user and makes it difficult to store all the flows. With #2, we can make them directly present in the config, and the user can tailor her experiment to it. I will add an example of how to configure this to the Wiki.

  • Rework the taint flow storage/representation
    We had a chat about this at PETS. I spoke with Simon regarding storing taint flows in a graph database, and I was left with the impression that our use case seems not to fit perfectly well (millions of small graphs vs. large graphs) and that the question of selecting a graph database for this use case is fraught with issues. Say neo4j, the best-supported graph db according to my understanding, is fairly restrictive regarding features in the free version. The product brief is all over the place, but it seems like it can only use up to 4 CPU cores. While academic licenses are supposedly available, this probably does not help SAP/industrial users, and the getting Foxhound up and running hurdle is already fairly high. Adding "buy expensive storage engine" as step 2 seems like we'd hinder adoption.

I will look further, but I believe it might be best to decouple this change from this PR. Reworking the flow representation has benefits that go beyond primitive tainting. I certainly recall fighting this for the hand sanitizer study. As this change seems fairly disruptive (i.e., all tooling would require a significant rework), we probably have to get this right. What might be an option would be to store the flow internally as a directed graph and then offer the code to lineralize the flows to their current form. My impression is that the current Foxhound users mainly use Playwright (we can make an example available here) or extensions (same applies).
This would preserve compatibility with existing tooling while offering the benefits of storing a graph instead of a bunch of vectors. However, this impression might be completely wrong, as I have limited insight into how others use Foxhound!

So, to sum up this point: I believe we should decouple the flow representation from this PR and offer to disable flow creation for primitive sources via the configuration, similar to how we allow disabling taint sources. This would allow us to merge this PR earlier and avoid divergence between mainline Foxhound and primitaint Foxhound.

@tmbrbr
Copy link
Contributor Author

tmbrbr commented Aug 1, 2024

I agree that we should avoid diverging the main and primitaint branches, the idea of this PR is to have something we can merge to main.

I would also be in favour of splitting up the work if possible. As you suggest I think a full blown move to graph based taint flows is going to be a big effort and should be handled in a separate PR (if at all...).

leeN and others added 16 commits August 2, 2024 13:59
In this branch properties of WebIDL files can be marked as taint
sources with an attribute. This is super nice, but if one wants to
disable one of them one has to note down their names during a build.
This is pretty cumbersome. This commit tries to resolve this issue,
as it simply adds every IDL based source to the properties
defaults.
Before Array.indexOf and Array.includes had subtle differences in
behavior when it came to tainted values or keys when using said
functions.

This adds some special case for tainted numbers and unboxes them, so
that they behave the same. This requires to flag tainted numbers as not
bitwise comparable, which might have downstream effects (?)
@leeN leeN added enhancement New feature or request primitaint Tainting of primitive types (numbers, bool, etc.) labels Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request primitaint Tainting of primitive types (numbers, bool, etc.)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants