Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintain infrequent characters' order #583

Merged
merged 3 commits into from
Feb 3, 2025
Merged

Conversation

lukhnos
Copy link
Contributor

@lukhnos lukhnos commented Feb 2, 2025

User description

This pre-populates the phrases dictionary from the entries in BPMFBase.txt such that the dictionary will maintain the original insertion order (guaranteed since Python 3.7). The dictionary entries will then be overwritten with the actual data from PhraseFreq.txt.

This fixes #582.

How Tested

A diff of the resulting data.txt shows that only infrequently used characters are affected: https://gist.github.com/lukhnos/a93d72da8c77671b2b978f8ac4a3da42


PR Type

Bug fix, Enhancement


Description

  • Pre-populates phrases dictionary with BPMFBase.txt entries to maintain order.

  • Ensures infrequent characters' order matches traditional phonetic order.

  • Fixes key mismatch errors during dictionary population.

  • Updates cook.py to handle phrases dictionary more robustly.


Changes walkthrough 📝

Relevant files
Bug fix
cook.py
Pre-populate and maintain order in `phrases` dictionary   

Source/Data/bin/cook.py

  • Added pre-population of phrases dictionary with BPMFBase.txt entries.
  • Ensured dictionary maintains insertion order for infrequent
    characters.
  • Fixed key mismatch handling during dictionary population.
  • Updated logic for processing PhraseFreq.txt data.
  • +14/-1   

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • Summary by Bito

    This pull request fixes a bug in the McBopomofo input method by ensuring the correct ordering of infrequent characters. It modifies the `cook.py` script to pre-populate the `phrases` dictionary from `BPMFBase.txt` and resolves key mismatch errors, enhancing the script's robustness. Additionally, it preserves the traditional phonetic order for rarely used characters, utilizing Python 3.7+'s dictionary insertion order guarantee.

    Unit tests added: False

    Estimated effort to review (1-5, lower is better): 2

    Copy link

    bito-code-review bot commented Feb 2, 2025

    Code Review Agent Run #adee72

    Actionable Suggestions - 1
    • Source/Data/bin/cook.py - 1
      • Consider consolidating duplicate validation blocks · Line 146-150
    Review Details
    • Files reviewed - 1 · Commit Range: 5439311..5439311
      • Source/Data/bin/cook.py
    • Files skipped - 0
    • Tools
      • Whispers (Secret Scanner) - ✔︎ Successful
      • Detect-secrets (Secret Scanner) - ✔︎ Successful
      • MyPy (Static Code Analysis) - ✔︎ Successful
      • Astral Ruff (Static Code Analysis) - ✔︎ Successful

    AI Code Review powered by Bito Logo

    Copy link

    github-actions bot commented Feb 2, 2025

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🎫 Ticket compliance analysis 🔶

    582 - Partially compliant

    Compliant requirements:

    • Ensure that infrequent characters' order in auto-selection matches the traditional phonetic order.
    • Pre-populate the phrases dictionary with entries from BPMFBase.txt to maintain the original insertion order.
    • Fix any key mismatch errors during dictionary population.

    Non-compliant requirements:

    []

    Requires further human verification:

    []

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Possible Logic Issue

    The added logic for pre-populating the phrases dictionary with BPMFBase.txt entries (lines 127-132) and subsequent overwriting (lines 144-150) should be reviewed to ensure it does not introduce unintended side effects or performance issues.

    # Populate phrases dict with entries from BPMFBase.txt first: this is so
    # that the resulting phrases dict will maintain the order from
    # BPMFBase.txt; this is important for rarely used characters.
    for key in bpmf_chars:
        phrases[key] = UNK_LOG_FREQ
    
    while True:
        line = handle.readline()
        if not line: break
        if line[0] == '#': continue
        elements = line.rstrip().split()
        mykey = elements.pop(0)
        myvalue = elements.pop(0)
        try:
            readings = bpmf_phrases[mykey]
        except:
            sys.exit('[ERROR] %s key mismatches.' % mykey)
        phrases[mykey] = myvalue
    
    for (mykey, myvalue) in phrases.items():
        try:
            readings = bpmf_phrases[mykey]
        except:
            sys.exit('[ERROR] %s key mismatches.' % mykey)
    Error Handling

    The error handling for key mismatches (lines 147-150) appears duplicated from earlier logic. Verify if this redundancy is necessary or if it can be optimized.

    try:
        readings = bpmf_phrases[mykey]
    except:
        sys.exit('[ERROR] %s key mismatches.' % mykey)

    Copy link

    github-actions bot commented Feb 2, 2025

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    General
    Avoid overwriting existing dictionary values

    Ensure that phrases[key] = UNK_LOG_FREQ is only executed if key does not already
    exist in the phrases dictionary to prevent overwriting existing values.

    Source/Data/bin/cook.py [130-131]

     for key in bpmf_chars:
    -    phrases[key] = UNK_LOG_FREQ
    +    if key not in phrases:
    +        phrases[key] = UNK_LOG_FREQ
    Suggestion importance[1-10]: 8

    Why: The suggestion ensures that existing values in the phrases dictionary are not overwritten when populating it with entries from bpmf_chars. This is a valid and impactful improvement as it prevents potential data loss and maintains the integrity of the dictionary.

    8

    Copy link

    bito-code-review bot commented Feb 2, 2025

    Changelist by Bito

    This pull request implements the following key changes.

    Key Change Files Impacted
    Bug Fix - Dictionary Order Preservation Fix

    cook.py - Modified dictionary population logic to maintain order of infrequent characters from BPMFBase.txt

    Source/Data/bin/cook.py Outdated Show resolved Hide resolved
    This pre-populates the `phrases` dictionary from the entries in
    `BPMFBase.txt` such that the dictionary will maintain the original
    insertion order (guaranteed since Python 3.7). The dictionary entries
    will then be overwritten with the actual data from `PhraseFreq.txt`.
    
    This fixes #582.
    `bpmf_phrases` is guaranteed to have entries from `bpmf_chars`, and so
    we only need to validate data when reading from `PhraseFreq.txt`.
    @lukhnos lukhnos force-pushed the dev/fix-cook-script branch from 499b363 to d776acb Compare February 2, 2025 19:01
    Copy link

    bito-code-review bot commented Feb 2, 2025

    Code Review Agent Run #b72fb0

    Actionable Suggestions - 0
    Review Details
    • Files reviewed - 1 · Commit Range: 7b710fc..d776acb
      • Source/Data/bin/cook.py
    • Files skipped - 0
    • Tools
      • Whispers (Secret Scanner) - ✔︎ Successful
      • Detect-secrets (Secret Scanner) - ✔︎ Successful
      • MyPy (Static Code Analysis) - ✔︎ Successful
      • Astral Ruff (Static Code Analysis) - ✔︎ Successful

    AI Code Review powered by Bito Logo

    Source/Data/bin/cook.py Outdated Show resolved Hide resolved
    Also removes a section of code that is now redundant, since the
    `phrases` dictionary is now pre-populated with all entries in
    `bpmf_chars`.
    @mjhsieh mjhsieh assigned mjhsieh and unassigned mjhsieh Feb 3, 2025
    @mjhsieh mjhsieh self-requested a review February 3, 2025 16:50
    Copy link
    Contributor

    @mjhsieh mjhsieh left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM thanks!

    @lukhnos lukhnos merged commit 3914357 into master Feb 3, 2025
    7 checks passed
    @lukhnos lukhnos deleted the dev/fix-cook-script branch February 3, 2025 18:49
    Comment on lines +130 to +131
    for key in bpmf_chars:
    phrases[key] = UNK_LOG_FREQ
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    @lukhnos sorry for being late to the party. +CC @mjhsieh

    I think this PR nicely demonstrated the limitation of non-AST LLM-based code review.
    and faster. I will make a new PR to emphasize improvements that should be there in my opinions.

    For example, this initialization has a famous Pythonic pattern that is slightly safer

    Suggested change
    for key in bpmf_chars:
    phrases[key] = UNK_LOG_FREQ
    phrases = {key: UNK_LOG_FREQ for key in bpmf_chars}

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    自動選字注音跟傳統注音的罕用字順序不同
    4 participants