Skip to content
/ uts46 Public

Python implementation of UTS46, Unicode IDNA Compatibility Processing. Converts internationalized domain names to and from ASCII representations.

License

Notifications You must be signed in to change notification settings

medmunds/uts46

Repository files navigation

UTS46: Unicode Compatibility Processing for Internationalized Domain Names

CI PyPI

A Python implementation of Unicode Technical Standard #46, Unicode IDNA Compatibility Processing. Converts internationalized domain names (IDNs) to and from ASCII representations (RFC 5890 A-labels, sometimes called "Punycode"1), using the UTS46 ToASCII and ToUnicode operations.

import uts46

uts46.encode("солідарні.ua")  # b'xn--80ahukbpc4oe.ua'
uts46.decode(b"xn--80ahukbpc4oe.ua")  # 'солідарні.ua'

# UTS46 converts to lowercase (Ε → ε, Ä → ä)
uts46.encode("Είναι.Är.صغير.世界")  # b'xn--kxaekss.xn--r-zfa.xn--wgbhp8d.xn--rhqv96g'
uts46.decode(b"xn--kxaekss.xn--r-zfa.xn--wgbhp8d.xn--rhqv96g")  # 'είναι.är.صغير.世界'

# UTS46 allows most symbols and emoji (unlike IDNA 2008)
uts46.encode("🧪.test")  # b'xn--0v9h.test'
uts46.decode(b"xn--0v9h.test")  # '🧪.test'

The uts46 package:

  • implements UTS46 v16.0.0
  • passes the full suite of UTS46 conformance tests
  • provides both high level encode() and decode() functions plus lower level UTS46 operations
  • optionally registers Python codecs ("тест.test".encode("uts46"))
  • also implements related IDNA algorithms from the WHATWG Url Standard
  • is written in pure Python and has no dependencies outside the standard library
  • includes type annotations
  • supports Python 3.10 and later
  • is released under the MIT License

The implementation is inspired by (and directly adapts some code from) the idna package, which provides IDNA 2008 encoding and decoding. There are differences between uts46 and using idna's uts46 option.

This Python implementation is an independent project, and is not endorsed or supported by the Unicode Consortium or WHATWG.

Installation

The uts46 package is available for installation from PyPI:

pip install uts46

Usage

For typical usage, the uts46 encode() and decode() functions convert Unicode IDNs to ASCII and back:

import uts46

uts46.encode("δοκιμή.test")  # b'xn--jxalpdlp.test'
uts46.decode(b"xn--jxalpdlp.test")  # 'δοκιμή.test'

Both functions can take either str or bytes input. (Encode non-ASCII bytes input as utf-8.) encode() returns ASCII bytes and decode() returns a Unicode str.

By default, the functions perform strict validation checks. UTS46 defines input parameters that can skip some of these checks. See the inline documentation for encode() and decode() for details.

Failed validation checks and other problems will raise uts46.Uts46Error, a subclass of UnicodeError.

Transitional processing

UTS46 is built on top of IDNA 2008, which handles a small set of characters differently from the earlier IDNA 2003 standard. UTS46 normally follows IDNA 2008 for these "deviation characters," but it also defines a "transitional processing" option that uses the IDNA 2003 mappings instead.

The uts46 package provides non-transitional processing by default:

# "ß" is an IDNA 2003/2008 "deviation character"
uts46.encode("faß.de")  # b'xn--fa-hia.de' -- non-transitional
uts46.encode("faß.de", transitional_processing=True)  # b'fass.de'

At this point, the vast majority of applications and infrastructure have updated to IDNA 2008. Newer versions of UTS46 have deprecated the transitional processing option, and it should be used only for legacy compatibility.

Unicode's Internationalized Domain Names FAQ explains more.

Codecs

The uts46 package includes codecs that work with Python's str.encode() and bytes.decode() functions. These are not installed by default. To register the codecs, import uts46.codecs.

import uts46.codecs  # registers codecs

"Próf.test".encode("uts46")  # b'xn--prf-hna.test'
b"xn--prf-hna.test".decode("uts46")  # 'próf.test'

The available encodings are:

  • "uts46" (aliases "uts-46", "UTS 46", "idna-uts46")
  • "uts46-transitional"—uses deprecated transitional processing (aliases "uts-46-transitional", "UTS 46 Transitional", "idna-uts46-transitional")

The uts46 codecs support errors="strict" (the default) and errors="ignore", but no other error handling schemes:

b"xn--oops.test".decode("uts46")  # raises Uts46Error
b"xn--oops.test".decode("uts46", errors="ignore")  # '䨿.test'

UTS46 operations

The uts46 package also provides the individual operations specified by UTS46:

See each function's inline documentation for more information.

(Avoid using any uts46 functions or variables that start with an underscore, or that are imported from a submodule beginning with an underscore. These are not part of the uts46 public API, and they may change without notice.)

WHATWG IDNA algorithms

The uts46 package provides implementations of the WHATWG URL Standard IDNA "domain to ASCII" and "domain to Unicode" algorithms in the uts46.whatwg module:

from uts46.whatwg import domain_to_ascii, domain_to_unicode

domain_to_ascii("ޓެސްޓް.test")  # 'xn--xqbfb2hvab.test'
domain_to_unicode("xn--xqbfb2hvab.test")  # 'ޓެސްޓް.test'

The uts46.whatwg functions take and return str domains (not bytes). Both functions take be_strict (default True) and transitional (default False) keyword arguments. See the domain_to_ascii() and domain_to_unicode() inline documentation for details.

Difference from idna package

The Python idna (IDNA 2008) package includes some UTS46 support through its uts46=True option. There are differences between the packages:

  • idna implements only UTS46 section 4.4, Preprocessing for IDNA 2008
  • uts46 implements the complete UTS46 specification

One noteable difference is the handling of emoji domains:

uts46.encode("☕.example")  # 'xn--53h.example'
idna.encode("☕.example", uts46=True)  # raises idna.core.InvalidCodepoint

In terms of UTS46 Conformance and conformance testing:

  • The uts46 package satisfies all three conformance clauses C1, C2 and C3, and it passes the full suite of conformance tests.
  • The idna package's uts46=True option satisfies only clause C3, and it therefore skips conformance tests involving IDNA 2008 disallowed characters.

For any domain, idna.encode(domain, uts46=True) should produce the same results as idna.encode(uts46.preprocessing_for_idna2008(domain)).

Development

The uts46 project is hosted on GitHub: https://github.com/medmunds/uts46.

Contributions are welcome, including bug reports, fixes, and improvements to the tests and documentation. Suggestions for enhancements are also welcome, keeping in mind the package's focus on precisely implementing UTS46.

Development setup

To set up a development environment:

# Clone the repository
git clone https://github.com/medmunds/uts46.git
cd uts46

# Install editable package and development dependencies
pip install -e '.[dev]'

# Set up pre-commit hooks (optional, but encouraged for PRs)
pre-commit install

This project uses ruff for linting and formatting Python code, mypy for static type checking, and several pre-commit-hooks. If you have installed pre-commit as shown above, the tools will run automatically when you commit code. To run them manually:

pre-commit run --all-files

# Or individual tools
ruff check .
ruff format .
mypy .

Building

To build a package distribution from source, use PyPA's standard build tool:

python -m build

(pip install build first if you don't have build.)

Generating _data.py

The _data.py file in the package source is generated from the UTS46 IDNA Mapping Table and other Unicode data.

If you need to rebuild it (e.g., to update to a newer Unicode version), edit the configuration in the [tool.uts46.generate-data] section of pyproject.toml and then run:

python tools/generate-data.py

The updated _data.py should be committed to the repository.

Testing

Package tests use unittest. To run all tests:

python tests
# or
python -m unittest

To run a specific test suite, such as the conformance tests:

python -m unittest tests.test_conformance

The package includes a comprehensive test suite that verifies conformance with the UTS46 specification, using the official test data provided by Unicode. The tests automatically download that data when needed, and cache it locally in the data directory (which is ignored by git).

Footnotes

  1. Technically, only the part after the "xn--" has been encoded using the Punycode algorithm. The "xn--" is an RFC 5890 "ACE Prefix," and the prefix plus the Punycode-encoded segment makes an "A-label."

About

Python implementation of UTS46, Unicode IDNA Compatibility Processing. Converts internationalized domain names to and from ASCII representations.

Topics

Resources

License

Stars

Watchers

Forks

Languages