Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics for evaluating performance of lexical/morphological analyzer #84

Open
avinashvarna opened this issue Jan 9, 2018 · 9 comments
Assignees

Comments

@avinashvarna
Copy link
Collaborator

Need to develop metrics for evaluating performance of the analyzers. This would be useful if we were trying to choose between databases for looking up tags/ different approaches for lexical/morphological analysis

From #82 (comment)

Perhaps precision can be defined as the %pass in the UoHD test suite. Perhaps Recall would mean some sort of check if all the reported splits will result in the input sentence after a join?

This would be a good start. Currently we do not pay much attention to the number of pass/fail etc in the test suite. My concern is that the UoHD dataset entries are not broken down to simple roots, and we are using the splitter to split them until we get words that are in the db (as discussed before - #19 (comment)). I am not sure that this will give us an accurate representation of the performance.

We should start looking into the DCS database to see if it is more appropriate. E.g. for the Level 1 database/tag lookups, we could perhaps just start with the roots provided in the DCS database and see how many are identifiable using the level 1/tag lookup db. We can then start building the tests up to the lexical/morphological levels.

@avinashvarna
Copy link
Collaborator Author

First results from a "quick and dirty" script I wrote to evaluate word lookup accuracy (recall, if you will):
The script goes through the DCS database, and for every word tagged as a single word (i.e. no samAsa/sandhi), it checks if the word is recognized as a valid word by the two level 1 lookup options.

Inria lookup recognized 1447362 out of 2333485 words
Sanskrit data recognized 1735547 out of 2333485 words

At a first pass, looks like the sanskrit data based lookup recognized about 300k more words. I think it is definitely worthwhile to move to it. As we incorporate more and more of the Inria db into it, it will always be the better choice from a recall perspective.

It may look like the overall accuracy is quite low, but there are two mitigating factors:

  • Due to this issue, some words in the DCS which are samastapadas are currently seen as akhandapadas, which reduces the overall accuracy.
  • kriyApadas with upasargas are actually stored in the DCS as one word E.g. vyAkhyAsyAmaH. These cannot be recognized by the L1 lookup in our setting.
    So the actual accuracy may be somewhat higher.

Next steps:

  • Need to ensure that the tags from the lookup contains the annotation in the DCS.
  • We could look at a measure of avg. precision (= 1/no. candidates retrieved for each lookup). In the sanskrit data based approach, false retrieval is possible, because we try to predict if a form can be arrived at using the anta. Or we could say that a high recall with low precision may be ok for the form lookup as higher layers will filter the incorrect forms out.
  • Repeat the above for the lexical and morphological analyzer. (Will need to handle the upasarga problem at this stage).

I will clean up my "quick and dirty" script to make it more amenable for the next steps and check it in by the weekend.

@avinashvarna avinashvarna self-assigned this Jan 16, 2018
@avinashvarna
Copy link
Collaborator Author

I have added some metrics for word level accuracy on the sanskrit_util branch here - https://github.com/kmadathil/sanskrit_parser/tree/sanskrit_util/metrics

I have also started working on evaluating lexical split accuracy using the dataset as part of the project referred to in #85 . Currently planning to use the BLEU score or chrF score (from machine translation literature) to evaluate the accuracy of these splits. Please let me know if there are any other ideas for evaluating accuracy

@kmadathil
Copy link
Owner

I concur

@avinashvarna
Copy link
Collaborator Author

Scripts for evaluating lexical split accuracy added to scoring branch here - https://github.com/kmadathil/sanskrit_parser/blob/scoring/metrics/lexical_split_scores.py

@codito codito mentioned this issue Jun 1, 2018
@codito
Copy link
Collaborator

codito commented Jun 1, 2018

Adding an use case where scoring may help resolve the best split below. Can the tool choose [kaH, cit, naraH, vA, nArI] as the best output?

> python -m sanskrit_parser.lexical_analyzer.sanskrit_lexical_analyzer "kaScit naraH vA nArI" --debug --split
Input String: kaScit naraH vA nArI                                                                                                                             
Input String in SLP1: kaScit naraH vA nArI                                                                                                                     
Start Split
End DAG generation                                                                                                                                             
End pathfinding 1527393212.680358                                                                                                                              
Splits:
[kaH, cit, naraH, vAna, arI]                                                                                                                                   
[kaH, cit, naraH, vAH, nArI]                                                                                                                                   
[kaH, cit, naraH, vA, nArI]                                                                                                                                    
[kaH, cit, na, raH, vAna, arI]                                                                                                                                 
[kaH, cit, naraH, vAH, na, arI]                                                                                                                                
[kaH, cit, naraH, vA, na, arI]                                                                                                                                 
[kaH, cit, na, raH, vAH, nArI]                                                                                                                                 
[kaH, cit, naraH, vA, AnA, arI]                                                                                                                                
[kaH, cit, na, raH, vA, nArI]                                                                                                                                  
[kaH, cit, naraH, vA, A, nArI]
-----------
Performance
Time for graph generation = 0.024774s
Total time for graph generation + find paths = 0.032885s

@drdhaval2785
Copy link

I worked a lot on this problem, and can vouch that https://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words/11642687 is the best solution around.

All we need is a frequency count for lexemes. drdhaval2785/samasasplitter#3 (comment) is where some idea about frequencies will be got

@kmadathil
Copy link
Owner

@codito - Not sure how the whitespace problem and this issue are related? This is about evaluating accuracy, is it not. Your issue is picking one split over another.

@codito
Copy link
Collaborator

codito commented Jun 2, 2018

I thought this issue also tracks using a score to ensure the most likely split gets higher priority in the output. Please ignore if I confused two different things.

@gasyoun
Copy link

gasyoun commented Apr 2, 2021

An Automatic Sanskrit Compound Processing

automatics

anil.pdf

How would you classify the approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants