Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: unbounded lookahead using adaptive predict #1714

Merged
merged 14 commits into from
Jan 8, 2022

Conversation

msujew
Copy link
Collaborator

@msujew msujew commented Nov 13, 2021

Implements an LL(*) lookahead algorithm and removes the existing LL(k) lookahead. The design is heavily inspired by Antlr's existing LL(*) solution titled "Adaptive LL(*)". In fact, the general structure of the Antlr lookahead code has been adapted to provide this feature.

There is a caveat that differentiates this approach to the LL(k) lookahead algorithm: As we don't simply iterate over a sorted list of alternatives, we can't decide on taking the "first" alternative in the alternation. Instead the LL(*) algorithm will always return the longest matching alternative, just like Antlr does.

While this PR is mostly feature complete, there are still some things that have yet to be addressed. Nevertheless, I wanted to create the PR at this stage to get some feedback on the design of the feature. The documentation for this feature is still missing and validation

@bd82
Copy link
Member

bd82 commented Nov 14, 2021

Hello @msujew
I did not dive into the algorithm yet. below are my initial impressions.

Feature Questions

  • What is the the relationship (if any) between adaptivePredict and maxLookahead ?
  • If adaptivePredict is enabled, does it mean that all lookahead logic is done using the new logic?
  • If the above is true, does a hybrid approach make sense?
  • If an hybrid approach does not make sense could maxLookahead = "*" remove the need for the new adaptivePredict config option?
  • What is the performance impact of enabling the adaptive predict? (I may check this tomorrow).

Code Organization / Beginning Code Review

  • I think it may be better to separate the new logic and tests to separate new files.
  • Also some of the functions are getting quite large, e.g the lookahead function builder should probably be split
    into multiple helper functions.
  • the use of isArray(alternatives) is multiple places is not the most readable style, we should more clearly distinguish between the different flows, and maybe consider if it is possible to avoid "overloading" the return function of the lookahead helpers (depends how distinct the LL(K) and LL("*") flows are).

@msujew
Copy link
Collaborator Author

msujew commented Nov 14, 2021

What is the the relationship (if any) between adaptivePredict and maxLookahead? If adaptivePredict is enabled, does it mean that all lookahead logic is done using the new logic?

So, basically the adaptivePredict flag indicates that after k=maxLookahead has been reached when computing the lookahead function and an ambiguous alternative has been found (i.e. LL(k) is not powerful enough to parse the alternative), it instead builds an prediction automaton. See this code. It does this on demand, the usual approach will be used if LL(k) is enough to parse the alternative. The current implementation is a hybrid approach. You can force to use the prediction automaton everywhere if you set maxLookahead=0, as I've done in a few of my test cases.

What is the performance impact of enabling the adaptive predict? (I may check this tomorrow).

Good question. I haven't tested it so far. I plan to evaluate the performance more in-depth (mostly cpu/memory consumption) for my thesis, but that will take some time. I would be grateful if you were to do some basic performance testing for this feature.

In theory the performance of this should be quite good, especially after it completed its "warmup"-phase. Compared to the usual LL(k) approach, the lookahead strategy in adaptive LL(*) is build during parsing and not during the self analysis. When it parses an alternative for the first time, it will run the possiblePathsFrom function with a lookahead of 1 for each token it parses, creating a new state for that token. Afterwards the new state is cached, which will be used on subsequent runs through that alternative. This should be faster than the usual approach of iterating over every alternative for every token encountered (up to k). The Antlr paper states that this approach is O(n) for most grammars (although O(n^4) in the worst case, with n being the number of tokens needed to identify the alternative).

Thanks for the initial review/feedback, I will get to that soon.

@bd82
Copy link
Member

bd82 commented Nov 15, 2021

Hybrid Approach

Generally this the optimal approach imho.
Some things to consider:

  • If the performance impact is not large (or even positive) we could consider only using LL(*) lookahead.
  • Alternatively we could also consider an hybrid approach (hidden from the user) where LL(1) (old) lookahead is used and LL(*) is used for anything else (similar to maxLookahead=1 + adaptivePredict = true)
  • If the end-user can choose adaptivePredict on the grammar level, we may also want to enable choosing it on a specific
    dsl production level, just like the maxlookahead can be used: https://chevrotain.io/docs/guide/resolving_grammar_errors.html#AMBIGUOUS_ALTERNATIVES

@bd82
Copy link
Member

bd82 commented Nov 15, 2021

I've tried running some benchmarks but I am getting parsing errors.
For example, for the basic JSON Grammar example:

If I change it to use "maxLookahead=0" and "adaptivePredict=true" it fails the very simple unit test.

@msujew
Copy link
Collaborator Author

msujew commented Nov 15, 2021

I've tried running some benchmarks but I am getting parsing errors.

Interesting, it seems like every example grammar does not work with that config. Weird that my tests couldn't catch that. Thanks for the info, I'll investigate it.

@msujew
Copy link
Collaborator Author

msujew commented Nov 16, 2021

@bd82 I fixed most grammars for now. The buildSingleAlternativeLookaheadFunction implementation for my DFAs was incorrect. While most simple grammars run well now, there are still a few issues:

  1. CSV parser: Empty alternatives are currently ignored.
  2. GraphQL parser: I have no idea what's currently wrong there, more investigation needed.
  3. ECMA5 parser: Due to an ambiguity the DFA never reaches an accepting state. Antlr somehow deals with this, but my implementation doesn't.

I was able to test my approach for the JSON and CSS grammars using the benchmark. However, I am not happy with the current results:

image

I will rethink my approach and take a closer look at what Antlr does.

@bd82
Copy link
Member

bd82 commented Nov 16, 2021

I will rethink my approach and take a closer look at what Antlr does.

  • is your benchmark in "parser only" mode?
  • I will start with correctness first and only then move to performance optimizations.
  • Some things areeasier to optimize if code generation is involved, I am unsure if a code generation step fits Chevrotain
    (probably not). But it may still be interensting to inspect what could be gained by generating the lookahead functions.
  • a Hybrid solution may still have value if it can make it easier to parse certain languages, particularly if the performance can be improved.

@msujew
Copy link
Collaborator Author

msujew commented Nov 16, 2021

is your benchmark in "parser only" mode?

Yes.

I would start with correctness first and only then move to performance optimizations.

Sure, I planned on correcting the issues first. Nevertheless, I think my current approach to the issue is too naive. I looked at Antlr's implementation of LL(*) (which is actually interpreted, there is no code generation done for the lookahead automaton), and I'm missing some crucial elements of the algorithm, which should be computed during the self analysis phase. For example, Antlr actually builds two separate Automaton. The lookahead automaton is build during parsing, while the other is build at startup and is used to compute the actual lookahead automaton.

a Hybrid solution may still have value if it can make it easier to parse certain languages, particularly if the performance can be improved.

That's true, however my main incentive for building this feature - besides writing about it in my masters thesis - is using it in Langium for every lookahead k > 1. That would heavily improve usability of the framework for more complex alternatives. Therefore I aim to keep performance impact at a minimum.

@bd82
Copy link
Member

bd82 commented Nov 16, 2021

is using it in Langium for every lookahead k > 1

If we can achieve good enough, performance I'd like to evaluate using the adaptive predict like this in Chevrotain
as well. Basically get rid of the old LL(K) lookahead logic, mainly for long term maintainability concerns
(fewer lookahead modes --> simpler --> easier to support).

@bd82
Copy link
Member

bd82 commented Nov 16, 2021

BTW @msujew WDYT about implementing the new lookahead logic as a separate package?

  • e.g: like the utils package.

I am trying to slowly transition Chevrotain to a mono-repo structure.
So it might make sense for significant new functionality to be added as a new sub-package.

Naturally this would only be possible if the new Lookahead logic has no depedencies on
concrete implemntations inside the "main" chevrotain package, otherwise we would have cyclic package dependencies...

@msujew
Copy link
Collaborator Author

msujew commented Nov 16, 2021

WDYT about implementing the new lookahead logic as a separate package?

Sounds reasonable. That would probably even help me quite a lot, just to basically have a "clean slate" to start the feature off. I've limited myself a bit by reusing existing interfaces as much as possible. Makes testing probably also a bit easier.

@msujew
Copy link
Collaborator Author

msujew commented Dec 5, 2021

@bd82 I have spent the last few weekends on working on a new implementation based on the original ANTLR implementation. The results for JSON, CSS and ECMA5 processed purely through LL(*) are quite promising, even without the optimization of using LL(1) where appropriate (that should improve performance noticably):

image

I assume I'll have something ready for a basic review by next weekend. :)

@msujew
Copy link
Collaborator Author

msujew commented Dec 5, 2021

I just also added LL(1) optimization, I'm a bit confused by the CSS results, but I assume this is a consequence of not caching token categories correctly in the lookahead DFA. Everything else looks as expected:

image

@bd82
Copy link
Member

bd82 commented Dec 9, 2021

I will try to have a look at this on the weekend

@msujew
Copy link
Collaborator Author

msujew commented Dec 10, 2021

@bd82 Alright, pushed everything to the upstream branch connected to this PR. 14 tests are currently failing, but that's because they either rely on old behavior (I completely removed LL(k) lookahead for now) or override the related APIs, which aren't called anymore.

Take your time with reviewing, this isn't time sensitive for Langium and my thesis doesn't depend on getting this merged.

@msujew
Copy link
Collaborator Author

msujew commented Dec 29, 2021

@bd82 Great news, I just added a small optimization, replacing the DFA cache (previously a Map) with an Object, which was used in a really hot loop. This increased performance quite drastically:

image

It is now way faster than LL(k) for more complex grammars (at least in my benchmarks). Additionally, I was able to fix a few more tests, which were failing due to a bad implementation of LL(1) on my part. Only 6 tests are still failing, which rely on overriding the LL(k) lookahead method.

Also please ignore the changed configs and debug code. I plan on cleaning this up sometime.

@bd82
Copy link
Member

bd82 commented Jan 5, 2022

Great news, I just added a small optimization, replacing the DFA cache (previously a Map) with an Object, which was used in a really hot loop. This increased performance quite drastically:

@msujew
Oh wow very impressive, Gives me more incentive to find time on the weekend to dive into this 😄

@bd82
Copy link
Member

bd82 commented Jan 7, 2022

Okay I'm finally starting to dive into this topic.
I've read the LL-star document and hopefully understood most of it (at least at the high level)

Also ran the benchmark on:

  • Chrome Canary
  • "Parser Only" flow
  • versus release 9.1.0 (should be similar (but not exact) baseline to current master)

image

It shows slightly different results than what you have posted, but they are still very good (none blocking) and there can be differences between different engines/cpus/... so that does not matter...

@bd82 bd82 changed the base branch from master to master-LL-star January 8, 2022 15:52
@bd82
Copy link
Member

bd82 commented Jan 8, 2022

Hi @msujew

I am merging the current work into new branch master-LL-star branch on this repo.
and sent you an invite for write permissions on the repo so we can more easily collaborate on this topic.

@bd82 bd82 merged commit df2a11f into Chevrotain:master-LL-star Jan 8, 2022
@bd82 bd82 mentioned this pull request Jan 12, 2022
10 tasks
msujew added a commit that referenced this pull request Jan 17, 2022
msujew added a commit to msujew/chevrotain that referenced this pull request Feb 26, 2022
msujew added a commit that referenced this pull request Feb 26, 2022
msujew added a commit that referenced this pull request Apr 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants