Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Homonyms with different pronunciations are not handled correctly #6

Open
KingSupernova31 opened this issue Aug 6, 2018 · 6 comments

Comments

@KingSupernova31
Copy link

Some homonyms have the same spelling but different pronunciations based on their context. For example it's "a unionized group of workers" but "an unionized atom". Handling these correctly would require looking at multiple words.

@EamonNerbonne
Copy link
Owner

That's a great example! the question is how far would one need to look ahead then? Doing this correctly would require some level of more advanced NLP, because you'd need to disambiguate the homographs, which can require more context; e.g. "a unionized (usually at least!) workforce" vs. "an unionized (usually at least!) atom"

@EamonNerbonne
Copy link
Owner

(I'd be willing to bet that truly ambiguous text is possible too; the a vs. an determination is almost certainly not solvable in general with complete accuracy, no matter what you do).

@EamonNerbonne
Copy link
Owner

Do you have a suggestion that might help?

I've been playing with the idea to compress the dictionary less; that might at least surface the ambiguity. That is: currently if a shorter prefix suffices, the longer one is dropped. But what's sufficient? Right now it's that the longer prefix doesn't unambiguously contradict the shorter one. Of course, it's still interesting to know that a longer one is perhaps more contradictory. I'm not sure if a longer prefix of "unionized" than "uni" is currently in the dictionary, but the raw data probably helps at least a little here.

That would probably mean an api change (to better represent "unsure"), and it might make the library much larger (to include all kinds of additional data on certainty, even if the prefix agrees with it's parent.

I'm not 100% convinced it's worth it.

@KingSupernova31
Copy link
Author

I don't have a suggestion, and I agree that solving the edge cases like this is probably not worth it. I just wanted to mention it.

@KingSupernova31
Copy link
Author

One example of truly ambiguous text (if somewhat "cheaty") would be in a quote where someone was interrupted and thus the necessary context doesn't exist.

"It's an unionized- what? No, that's not what I said."

@EamonNerbonne
Copy link
Owner

Yeah, but language is just full of that kind of stuff. I remember playing with an NLP parse a long time ago, and it's somewhat language dependant, but even reasonable sounding sentences often have thousands of parses - most of which are so weird, it takes quite some effort to understand how the machine could come up with them, because we're so practiced at implicitly using context and semantic knowledge to resolve ambiguity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants