-
Notifications
You must be signed in to change notification settings - Fork 0
Text Encoding
Gilberto Diaz edited this page Feb 4, 2024
·
2 revisions
- Computers only understand 1's and 0's. The letter
A
is ultimately a series of 1's and 0's. How does a computer know to displayA
,a
,à
, orあ
? By using a standardized encoding schema. - Due to various horrible and historical reasons, there is no way for computers to deterministically detect arbitrary character encodings from files.
- Automatic encoding detection is a lie. Those just use heuristics which can and will fail catastrophically eventually.
- Thus, the encodings for the text files and the console must be specified at runtime, or something might break.
- For the supported encodings see: standard-encodings. Common encodings:
-
utf-8
- If at all possible, please only useutf-8
, and use it for absolutely everything.- py3TranslateLLM uses
utf-8
as the default encoding for everything except kirikiri.
- py3TranslateLLM uses
-
shift-jis
- Required by the kirikiri game engine and many Japanese visual novels, games, programs, media, and text files in general. -
utf-16-le
- a.k.a.ucs2-bom-le
. Alternative encoding used by the kirikiri game engine. TODO: Double check this. -
cp437
- This is the old IBM/DOS code page for English that Windows with an English locale often uses by default. -
cp1252
- This is the code page for western european languages that Windows with an English locale often uses by default.
-
- Due to English locales being very common on Windows, both
cp437
andcp1252
are very often the encoding used bycmd.exe
. - On newer versions of Windows (~Win 10 1809+), consider changing the console encoding to native
utf-8
.- There is a checkbox for it in the change locale window. Check it and restart the PC for changes to take effect.
- After restarting, set the command prompt to use a font that can display utf-8 glyphs correctly, like MS Gothic.
- Historically, setting the Windows command prompt to ~utf-8 will reliably make it crash which makes having to deal with
cp437
andcp1252
inevitable. - To print the currently active code page on Windows, open a command prompt and type
chcp
- To change the code page for that session type
chcp <codepage #>
as in:chcp 1252
- To change the code page for that session type
Back to Wiki home page.
- Main Page: py3TranslateLLM
- Releases: py3TranslateLLM/releases
- TODO: Translating kirikiri games
- TODO: Importing and exporting from Translator++
- TODO: Working with dictionaries
- TODO: KoboldCPP Resources
- TODO: DeepL Resources
- TODO: Working with DeepL dictionaries
- fairseq installation guide
- TODO: Sugoi Translator Integration
- fairseq and Sugoi GPU guide
- TODO: fairseq Resources
- TODO: Uncategorized