Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question/Open Discussion] #567

Open
Fidget-Spinner opened this issue Mar 5, 2025 · 5 comments
Open

[Question/Open Discussion] #567

Fidget-Spinner opened this issue Mar 5, 2025 · 5 comments
Assignees

Comments

@Fidget-Spinner
Copy link

Fidget-Spinner commented Mar 5, 2025

Hi Shopify YJIT folks. I work on CPython and its tracing JIT.

In CPython, we're seeing almost no gains at all for our baseline JIT (this statement highly depends on architecture though). When I say baseline, I mean just a straightforward translator from bytecode to machine code with just a few simple opt passes. Our inline caching is already available in the interpreter, so it is only extended by the JIT, not implemented brand new.

I understand the YJIT's "baseline" JIT is not just a one-to-one translator but does do some basic optimizations as well. I was wondering if there is any secret sauce behind the +90% speedup on railsbench that YJIT reports over its interpreter? When I read the original BBV paper, there were no heavy optimizations being done by BBV, and it was mostly type check removal, which was cheap to begin with anyways.

We have plans to make CPython tracing work more like lazy BBV by using type information to guide trace projection, but those are just plans, and not yet implemented.

Thank you!

@maximecb
Copy link

maximecb commented Mar 5, 2025

Hello Ken,

Our first JIT attempts also tried to do fairly straightforward translation from bytecode to machine code and struggled to show performance gains. One of the first things we realized is that we were often switching back and forth between interpreted and JIT code and that was very expensive. If you are also exiting to the interpreter, I would recommend implementing some counters so you can measure how often that happens, and trying to stay in JITted code as much as possible.

I was wondering if there is any secret sauce behind the +90% speedup on railsbench that YJIT reports over its interpreter When I read the original BBV paper, there were no heavy optimizations being done by BBV, and it was mostly type check removal, which was cheap to begin with anyways.

Probably a couple of things:

  1. As much as possible, don't exit to the interpreter. Stay in JITted code.
  2. Ruby calls are very heavyweight/expensive. I suspect it's the same in Python. If you are calling into C functions every time you need to set up a frame or pop a frame, it's going to kill your performance. You want to be able to generate inline code to do that, at least for the most common cases. You should see big perf gains there.
  3. Inline caches, you may want to improve your JIT handling of that, again if you're calling into C functions, try to avoid doing that and generate some inline code.

Something that's kind of "in plain sight" but might not be obvious about LBBV is that it's very good at generating stretches of linear machine code that just "fall through" There are side-exit branches but they are rarely taken, the normal execution flow is a linear path. It's also very good at dead code elimination. So you can end up with code that's very instruction-cache friendly and also works well with the CPU's pre-fetching.

Modern CPU branch prediction is quite good, but when it fails it can be very expensive, so ideally you don't want to rely on it, you want to generate machine code so that the normal execution path is linear. The default heuristic for branches where the branch predictor has no data is that backwards branches (loop edges) will be taken, and forward branches will not be taken. We've taken special care to lay out our machine code in consequence.

We have plans to make CPython tracing work more like lazy BBV by using type information to guide trace projection, but those are just plans, and not yet implemented.

I'm obviously biased but you could also try LBBV directly. The basic implementation can be quite simple. You can even do it with no type information, only relying on it for code placement and dead code elimination.

@maximecb maximecb self-assigned this Mar 5, 2025
@eregon
Copy link

eregon commented Mar 6, 2025

Regarding inline caching, since that doesn't seem directly answered, my understanding (correct me if I'm wrong) is the CRuby interpreter has single-entry (i.e. non-polymorphic) inline caches in the bytecode for method lookup and constant lookup and @ivar accesses with Shapes, and maybe a few other things (e.g. @@class_variables).
IIRC a discussion with k0kubun correctly, YJIT has its own inline caches, notably polymorphic inline caches for method lookups.
In my understanding LBBV doesn't really need inline caches in the interpreter, it can build its own by "pausing" code generation and peeking at live values (e.g. to find out the receiver's class for a method lookup) and continue to generate code based on that. The second and later entries in the cache are then I guess just side-exits.
@maximecb Is that accurate? Does YJIT reuse the inline caches from the interpreter?

@maximecb
Copy link

maximecb commented Mar 6, 2025

Yeah we don't need the interpreter's inline caches.

We relied in the interpreter's inline caches in early versions of YJIT but we ran into issues because there were often branches that hadn't been executed yet and so we had no useful information. So these caches were just not that useful in terms of providing information to the JIT.

YJIT has polymorphic dispatch chains, which can be faster than inline caches, but have the downside that they are append only, can't be rewritten. They are constructed by peeking at values at compilation time like Benoit said.

@Fidget-Spinner
Copy link
Author

Thanks for all the comments. I guess the main thing I'm confused by is the following:

  1. Laying out code in a linear fashion doesn't seem to gain us that much compared to a computed goto interpreter. Either branch predictions have gotten incredibly good, or we have small jumps that we're not that efficient at removing (probably both).
  2. The folklore is that for languages like JS and Lua, each bytecode is quite simple, so the interpreter is dominated by interpretive overhead. In those cases, a baseline compiler makes a lot of sense. In Ruby and Python, each bytecode is gigantic (as Maxime pointed out in the other issue, just method lookup alone is complicated). So the operation is not dominated by interpretive overhead. However, LBBV's success seems to disprove that folklore in the case of CRuby. Is CRuby's base interpreter not doing that much compared to LLBV? I would think not, as on benchmark games, CRuby just interpreter alone is faster than CPython interpreter.

Regarding switching between the interpreter and JIT, we get that a lot too, I think we measured a 1-2% geomean slowdown on programs that don't JIT well and switch a lot, so yes that's experience we also have, similar to YJIT.

We use copy and patch for the JIT's machine code generator. I don't know how that compares to YJIT assembler. But I'd assume it's roughly similar to a straightforward template generator.

@maximecb
Copy link

maximecb commented Mar 10, 2025

Laying out code in a linear fashion doesn't seem to gain us that much compared to a computed goto interpreter. Either branch predictions have gotten incredibly good, or we have small jumps that we're not that efficient at removing (probably both).

We had kind of the same problem earlier on but what we found was also that we were constantly side-exiting to the interpreter and that was killing performance. How often do you exit to the interpreter in large benchmarks?

Regarding switching between the interpreter and JIT, we get that a lot too, I think we measured a 1-2% geomean slowdown on programs that don't JIT well and switch a lot, so yes that's experience we also have, similar to YJIT.

For YJIT, we measured how often we side-exit. We compute a percentage of bytecode instructions executed by YJIT vs executed by the interpreter. Alternatively you can measure the average number of bytecode instructions executed by your JIT before it exits to the interpreter.

Generally I'm inclined to say: if removing dispatch overhead doesn't gain you anything, it's probably because the bottleneck is somewhere else. What kinds of benchmarks are you looking at? For YJIT, we have microbenchmarks, much larger benchmarks that try to be representative of real-world use cases, and also some large synthetic benchmarks.

If you don't have large benchmark programs that you can run, you can still write some code to generate large synthetic benchmarks. I generated a synthetic benchmark that is intentionally very large and has conditional branches to test how we did with larger pieces of code that can't all fit in the instruction cache.

Look at this synthetic benchmark:
https://raw.githubusercontent.com/Shopify/yjit-bench/refs/heads/main/benchmarks/30k_ifelse.rb

This is the code that generated said benchmark. Could easily be ported to Python:
https://github.com/Shopify/yjit-bench/blob/main/misc/gen_call_bench2.rb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants