Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintain infrequent characters' order #583

Merged
merged 3 commits into from
Feb 3, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 11 additions & 7 deletions Source/Data/bin/cook.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,12 @@

output = []

# Populate phrases dict with entries from BPMFBase.txt first: this is so
# that the resulting phrases dict will maintain the order from
# BPMFBase.txt; this is important for rarely used characters.
for key in bpmf_chars:
phrases[key] = UNK_LOG_FREQ
Comment on lines +130 to +131
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukhnos sorry for being late to the party. +CC @mjhsieh

I think this PR nicely demonstrated the limitation of non-AST LLM-based code review.
and faster. I will make a new PR to emphasize improvements that should be there in my opinions.

For example, this initialization has a famous Pythonic pattern that is slightly safer

Suggested change
for key in bpmf_chars:
phrases[key] = UNK_LOG_FREQ
phrases = {key: UNK_LOG_FREQ for key in bpmf_chars}


while True:
line = handle.readline()
if not line: break
Expand All @@ -135,8 +141,11 @@
readings = bpmf_phrases[mykey]
except:
sys.exit('[ERROR] %s key mismatches.' % mykey)
phrases[mykey] = True
# print mykey
phrases[mykey] = myvalue

for mykey, myvalue in phrases.items():
readings = bpmf_phrases.get(mykey)

if readings:
# 剛好一個中文字字的長度目前還是 3 (標點、聲調好像都是2)
if len(mykey) > 3:
Expand Down Expand Up @@ -182,11 +191,6 @@
# 很罕用的注音建議不要列入 heterophony?.list,這樣的話
# 就可以直接進來這個 condition
handle.close()
for k in bpmf_chars:
if k not in phrases:
for v in bpmf_chars[k]:
output.append((k, v, UNK_LOG_FREQ))
pass

with open(sys.argv[4]) as punctuations_file:
for line in punctuations_file:
Expand Down