-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Darmo rakzati rakzitaH doesn't get correctly parsed as a vAkya #179
Comments
It does get parsed correctly for one of the two options: from sanskrit_parser import Parser
from indic_transliteration import sanscript
parser = Parser(output_encoding=sanscript.SLP1)
splits = parser.split('Darmo rakzati rakzitaH', limit=2)
for split in splits:
parses = split.parse(limit=2)
if parses is not None:
for parse in parses:
print(str(parse)) produces:
In the first split, where However, when we pass sanskrit_parser/sanskrit_parser/api.py Lines 112 to 115 in 9fbae62
This then results in no valid parse being found. After splitting, the UI uses pre-segmented mode, which results in this error in the UI. Not sure of the best way to fix this. visarga handling has always been a pain point for me. |
rakzitaH can be
In sanskrit_parser/sanskrit_parser/api.py Lines 112 to 124 in 9fbae62
we prioritize the r form over the s form if it exists during a presegmented split (used by the UI). Ideally, we should be supplying both downstream. This is one case which proves that prioritizing either one form doesn't work. |
This is my summary of a potential fix
|
Indeed, I should have said that we prioritize the |
Strict mode ( That suggests a better fix. We should expect pre-segmented input in strict mode (since it's used mostly for testing and for the UI, both of which can do that). The current s/r tests can be retained as a fallback (or removed, if you prefer). I've implemented it in the |
I have merged the PR.
How do we handle this in the command line script? Should we turn
This raises the larger question of what should be the proper domain of the parser vs what should be the proper domain of a UI (either text or graphical). Especially with an application such as ambuda, with a capable UI, shouldn't we be leaving display decisions to them? OTOH, those who would like to use the command line script shouldn't be left high-and-dry either. Earlier, #56 had opened the question of handling visargas, anuswaras etc. I would like to suggest that we refactor the functionality to split the core sandhi/parse functionality and the anusvara/visarga handling. |
IMHO, the majority of users who would want to use a parser are probably not looking to understand the nuances of a visarga that arose from sasajuSho ruH vs something else. I am in favor of just displaying one split, by collapsing the two options in this case. Later, for parsing, we could try replacing the visarga with both 'r' and 's' and return the combined results as you suggested. We can do that in a different layer, or at the API entry points.
Could you please elaborate on what you have in mind? I thought that as a result of #56, the normalization is only done at the entry points anyway, and the core parser/sandhi don't deal with it. |
Initial debug - looks like rakzitaH (and other kta forms) are not being treated fully correctly in the vAkya parser. I need to dig into my notes about why this limitation exists and figure out how to remove it.
The text was updated successfully, but these errors were encountered: