-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question/Open Discussion] #567
Comments
Hello Ken, Our first JIT attempts also tried to do fairly straightforward translation from bytecode to machine code and struggled to show performance gains. One of the first things we realized is that we were often switching back and forth between interpreted and JIT code and that was very expensive. If you are also exiting to the interpreter, I would recommend implementing some counters so you can measure how often that happens, and trying to stay in JITted code as much as possible.
Probably a couple of things:
Something that's kind of "in plain sight" but might not be obvious about LBBV is that it's very good at generating stretches of linear machine code that just "fall through" There are side-exit branches but they are rarely taken, the normal execution flow is a linear path. It's also very good at dead code elimination. So you can end up with code that's very instruction-cache friendly and also works well with the CPU's pre-fetching. Modern CPU branch prediction is quite good, but when it fails it can be very expensive, so ideally you don't want to rely on it, you want to generate machine code so that the normal execution path is linear. The default heuristic for branches where the branch predictor has no data is that backwards branches (loop edges) will be taken, and forward branches will not be taken. We've taken special care to lay out our machine code in consequence.
I'm obviously biased but you could also try LBBV directly. The basic implementation can be quite simple. You can even do it with no type information, only relying on it for code placement and dead code elimination. |
Regarding inline caching, since that doesn't seem directly answered, my understanding (correct me if I'm wrong) is the CRuby interpreter has single-entry (i.e. non-polymorphic) inline caches in the bytecode for method lookup and constant lookup and |
Yeah we don't need the interpreter's inline caches. We relied in the interpreter's inline caches in early versions of YJIT but we ran into issues because there were often branches that hadn't been executed yet and so we had no useful information. So these caches were just not that useful in terms of providing information to the JIT. YJIT has polymorphic dispatch chains, which can be faster than inline caches, but have the downside that they are append only, can't be rewritten. They are constructed by peeking at values at compilation time like Benoit said. |
Thanks for all the comments. I guess the main thing I'm confused by is the following:
Regarding switching between the interpreter and JIT, we get that a lot too, I think we measured a 1-2% geomean slowdown on programs that don't JIT well and switch a lot, so yes that's experience we also have, similar to YJIT. We use copy and patch for the JIT's machine code generator. I don't know how that compares to YJIT assembler. But I'd assume it's roughly similar to a straightforward template generator. |
We had kind of the same problem earlier on but what we found was also that we were constantly side-exiting to the interpreter and that was killing performance. How often do you exit to the interpreter in large benchmarks?
For YJIT, we measured how often we side-exit. We compute a percentage of bytecode instructions executed by YJIT vs executed by the interpreter. Alternatively you can measure the average number of bytecode instructions executed by your JIT before it exits to the interpreter. Generally I'm inclined to say: if removing dispatch overhead doesn't gain you anything, it's probably because the bottleneck is somewhere else. What kinds of benchmarks are you looking at? For YJIT, we have microbenchmarks, much larger benchmarks that try to be representative of real-world use cases, and also some large synthetic benchmarks. If you don't have large benchmark programs that you can run, you can still write some code to generate large synthetic benchmarks. I generated a synthetic benchmark that is intentionally very large and has conditional branches to test how we did with larger pieces of code that can't all fit in the instruction cache. Look at this synthetic benchmark: This is the code that generated said benchmark. Could easily be ported to Python: |
Hi Shopify YJIT folks. I work on CPython and its tracing JIT.
In CPython, we're seeing almost no gains at all for our baseline JIT (this statement highly depends on architecture though). When I say baseline, I mean just a straightforward translator from bytecode to machine code with just a few simple opt passes. Our inline caching is already available in the interpreter, so it is only extended by the JIT, not implemented brand new.
I understand the YJIT's "baseline" JIT is not just a one-to-one translator but does do some basic optimizations as well. I was wondering if there is any secret sauce behind the +90% speedup on railsbench that YJIT reports over its interpreter? When I read the original BBV paper, there were no heavy optimizations being done by BBV, and it was mostly type check removal, which was cheap to begin with anyways.
We have plans to make CPython tracing work more like lazy BBV by using type information to guide trace projection, but those are just plans, and not yet implemented.
Thank you!
The text was updated successfully, but these errors were encountered: