Skip to content

Commit

Permalink
formatting test
Browse files Browse the repository at this point in the history
  • Loading branch information
emily-roth committed Feb 5, 2025
1 parent 2eda2bd commit 23fa072
Show file tree
Hide file tree
Showing 3 changed files with 30 additions and 27 deletions.
17 changes: 9 additions & 8 deletions src/content/docs/topics/writingsystems/cldr-and-sldr.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ sidebar:
order: 1510
---

# What is the CLDR?
### What is the CLDR?

From https://cldr.unicode.org/:

Expand All @@ -18,7 +18,7 @@ From https://cldr.unicode.org/:
>
> CLDR uses the XML format provided by [UTS #35: Unicode Locale Data Markup Language (LDML)][uts35]. LDML is a format used not only for CLDR, but also for general interchange of locale data, such as in Microsoft’s .NET.
# What is the SLDR?
### What is the SLDR?

The SLDR is the [SIL Locale Data Repository][sldrrepo], a repo that builds upon the structure and data of the CLDR with locale data that might not yet meet the minimum requirements for CLDR inclusion.

Expand All @@ -41,17 +41,18 @@ Other data beyond the scope of the list above may also be included in an SLDR fi

SLDR data is sourced from manually curated research, data generated (with permission) from the contents of the [Digital Bible Library][dbl], and external submissions via [ScriptSource Contributions][scrsourcontr] and [GitHub Issues][sldrissues]. While the SLDR strives to be as accurate as possible, the data within is not perfect and should not be treated as an unquestionable source of information. Corrections from external sources are extremely welcome.

# How is the SLDR Used?
### How is the SLDR Used?

### langtags.json
#### langtags.json

The `langtags.json` file is generated by the [LangTags repository][langtag] and is used to parse tag equivalence. This is explained in-depth in the [langtags documentation](https://github.com/silnrsi/langtags/blob/master/doc/langtags.md) in the langtags repo.

In addition to using the data from its parent repository, `langtags.json` also pulls autonym data from the SLDR. Specifically, the autonym data from the SLDR is listed under the field "localname" in `langtags.json`. This should not be confused with the field "localnames", which is an array featuring all of the names sourced from the [Ethnologue][ethnologue].

Whenever new data is pushed to the SLDR, `langtags.json` is automatically rebuilt through GitHub Actions.

### The LDML API
#### The LDML API

SLDR information is primarily accessed and utilized by applications via the [LDML API][ldmlapi]. This API also utilizes and distributes `langtags.json`.

Here are some examples of how the LDML API is used:
Expand All @@ -64,19 +65,19 @@ Since `langtags.json` is an important element of the LDML API, it is good practi

Examples of applications that use the SLDR via the LDML API include Bloom, Paratext, and Flex.

### Language Font Finder API
#### Language Font Finder API

The [Language Font Finder API (LFF)][lff] is an API that returns recommended fonts for a specific language tag. The font recommendations are pulled from the font data located in the SLDR file for that locale. If there is no SLDR file for the passed language tag, or if the SLDR file does not contain any font data, a predefined fallback value is returned instead.

### ScriptSource
#### ScriptSource

The [ScriptSource site][scriptsource] uses the exemplar data of locales contained within the SLDR to populate the "Symbols & Characters" sections of the pages relating to said locales.

For example, the ["Symbols & Characters" tab of the "Enga written with Latin script" page](https://scriptsource.org/cms/scripts/page.php?item_id=wrSys_detail_sym&uid=rfsnw2cbyd) contains two lists of characters- main and auxiliary- that are pulled directly from the "main" and "auxiliary" exemplars of the `enq.xml` file in the SLDR.

This is one of the most human-friendly ways that SLDR data can be accessed by the general public, as opposed to the data-driven formats of the SLDR itself and the aforementioned APIs. This is also why ScriptSource contributions are one of the most common methods used by individuals to submit corrections to the SLDR.

### CLDR Submissions
#### CLDR Submissions

If enough data is gathered in an SLDR file that it can fulfill the minimum requirements for CLDR inclusion, the locale will be submitted to the CLDR.

Expand Down
32 changes: 17 additions & 15 deletions src/content/docs/topics/writingsystems/ldml.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ sidebar:

!!!! UNFINISHED. also some of the links dont work yet do not be alarmed by that !!!

# What is LDML?
### What is LDML?

Locale Data Markup Language (LDML) is an XML format used for locale data. The most prolific user of LDML is the CLDR.

Expand Down Expand Up @@ -34,6 +34,7 @@ The specifications for LDML structure are described in [Unicode Technical Standa
</layout>
<characters>
<exemplarCharacters>[a á b c d e é f g h i í j k l m n ñ o ó p q r s t u ú ü v w x y z]</exemplarCharacters>
<exemplarCharacters type="auxiliary">[ª à ă â å ä ã ā æ ç è ĕ ê ë ē ì ĭ î ï ī º ò ŏ ô ö ø ō œ ù ŭ û ū ý ÿ]</exemplarCharacters>
<exemplarCharacters type="index">[A B C D E F G H I J K L {LL} M N Ñ O P Q R S T U V W X Y Z]</exemplarCharacters>
<exemplarCharacters type="punctuation">[\- ‐‑ – — , ; \: ! ¡ ? ¿ . … '‘’ "“” « » ( ) \[ \] § @ * / \\ \&amp; # † ‡ ′ ″]</exemplarCharacters>
</characters>
Expand Down Expand Up @@ -71,7 +72,7 @@ This is not an all-inclusive list of the potential elements that could be includ

Note that I also added the traditional separated 'LL' back into this example for the purpose of demonstration. It is no longer present as a separate multigraph in the current version of the CLDR.

# The Building Blocks of LDML
### The Building Blocks of LDML

!!!!!!!!!! THIS BIT IS UNFINISHED BTW THIS IS A PLACEHOLDER !!!!!!!!!

Expand Down Expand Up @@ -101,52 +102,53 @@ This next section will not explain in-detail the different elements of an LDML f
- [Annotations](https://unicode.org/reports/tr35/tr35-general.html#Annotations)
- [Metadata](https://unicode.org/reports/tr35/tr35-info.html#Metadata_Elements)
- References: Deprecated, but still referenced in the DTDs
- [Special]
- Special

Of the elements listed above, a handful benefit from a more in-depth description on this site:

### Identity
#### Identity

The "identity" element contains information about the locale described in the LDML file. The most important child elements are "language", "script", "territory", "variant", and the SLDR-specific "special/sil:identity".

Not all of these elements are required. Only the elements used in the locale's minimal langtag are included. For example, in the file `enq.xml`, only the language element will be included. In the file `sat_Deva_IN`, the language ("sat"), script ("Deva"), and territory ("IN") elements will all be included.

The sil:identity element is the child of a "special" element within the identity element. It contains attributes for the script and region of the locale, regardless of their inclusion in the previous elements. In addition, it contains a "source" attribute that indicates whether the file was imported from the CLDR. If there is no "source" attribute in the sil:identity element, the file is unique to the SLDR.

### Locale Display Names
#### Locale Display Names

vocab relating to the locale (lang, script, region). most important value is the autonym (name of lang in lang AND USING THE CORRECT SCRIPT). make sure to note that you do not have to list the full tag in the type attribute, even if the file is a long tag (i.e. sat_Deva has its autonym listed under sat, not sat_Deva).

### Characters
#### Characters

exemplar time, dont forget to explain the difference between main, aux, and index. and what can overlap and what cant.

### Dates
#### Dates
oh boy. someone (me)(emily) needs to track down the difference between uppercase H and lowercase h again. which one is 24 hr? i never remember.

### Collations
#### Collations

oh boy collation

### Special
#### Special

FONT DATA AND KEYBOARDS AND FUN SIL STUFF GOES HERE

## Draft Attributes
### Draft Attributes

Draft attributes are important. i took a ton of notes on this in the cldr import doc, get them and put them here. bc they are not intuitive.

# Text Formatting Tips
### Text Formatting Tips

For those who are primarily interacting with the SLDR and the data within, here are some useful tips about text formatting when manually entering and modifying data in an LDML file.

## Formatting Text in an Exemplar:
#### Formatting Text in an Exemplar:

For the most part, the contents of an LDML file follow the standard rules of an XML file. With the exception of collation (see below), the contents within the square brackets (including the square brackets themselves) are Regular Expressions (regexes).

Information about regexes can be found online in a number of places, though not all of it will be relevant to an LDML file. The most important things to know are how to escape non-ASCII characters and how to notate multigraphs and combining diacritics.

### Escaping
***Escaping***

Escaping in a regex is done by adding a backslash immediately before the character that needs escaping. You can see examples of this in the punctuation exemplar in the example above: the very first character, a hyphen (`\-`), is escaped in this way. Similarly, the backslash listed as a punctuation mark in this list is also escaped by adding a second backslash (`\\`).

Finally, a handful of characters require the whole character to be replaced with an HTML character reference, such as the ampersand, which is indicated as `\&amp;`. Notice that the escaping backslash is still present. The other two commonly-used character references are `&lt` and `&gt`, aka 'less than' (<) and 'greater than' (>). These do need to be written as their character references in an LDML file, but they do not need to be escaped.
Expand All @@ -157,7 +159,7 @@ For example, 'A' has the unicode codepoint 'U+41', aka 'U+0041'. Therefore, the

This is most commonly used when the character will not display nicely when displayed in a coding environment, such as combining diacritics or PUA characters. It's also sometimes used when working on non-latin scripts, when the person working on the file doesn't have easy access to a keyboard that types the characters and doesn't want to copy-paste for the entire list. The latter use-case isn't necessarily recommended, but it technically works the same either way.

### Multigraphs and Combining Diacritics
***Multigraphs and Combining Diacritics***

Multigraphs are an orthographic phenomenon in which two characters put together are treated as one single unit. In an LDML file, these are denoted by surrounding the grouped characters in curly brackets, such as the {LL} in the example above. This is important because the spaces between individual characters are only in these lists for human convenience; they do not indicate anything on a codified scale, nor are they required for the LDML file to function properly. To a computer, [s t] and [st] mean the same thing, so if you want to specifically indicate that "st" is a multigraph, you need to enter it as [{st}].

Expand All @@ -167,7 +169,7 @@ For example, there is no single codepoint for 'a̱'. It consists of 'a' (U+0061)

A good rule of thumb if you aren't sure if a diacritic is part of the same codepoint or not: hit the backspace after typing/copying the character. If the diacritic disappears, but the base character remains, the combined character is made of multiple codepoints. If both the base character and diacritic disappear simultaneously, they are already a single unique codepoint. Feel free to try it out with 'á' and 'a̱' right now, if you'd like. Just be sure you understand [normalization][normalization] and ensure that you are using the most composed version of the character possible (i.e. if there is a codepoint such as U+00E1 that combines the character and diacritic, prioritize using the composed one instead of placing two codepoints inside of the curly brackets).

## Formatting Text in Collation
#### Formatting Text in Collation

Collation and Sorting is a complex enough topic to require a separate page on this site, found [here][collation]. However, for the sake of this article, it should be noted that tailored coalition follows different formatting rules than most other data found within the text sections of an XML file, particularly in regard to escaping and multigraphs.

Expand Down
8 changes: 4 additions & 4 deletions src/content/docs/topics/writingsystems/locale-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ sidebar:
order: 1500
---

# What is a Locale?
### What is a Locale?

Locale, in the context of computing, is a collection of parameters that affect how information is expressed or presented within a particular group of users, generally distinguished from one another on the basis of language or location (usually country).

Expand All @@ -14,7 +14,7 @@ For example, an English in the United States using Latin Script is a different l

Locales are identified with a key called a [Language Tag][langtag]. This is a three-part key defined by [BCP 47][BCP 47], which consists of the language, script, and region. For example, English in the United States using Latin Script would have a full tag of en-Latn-US. For the purposes of this page, it is enough to be able to recognize a tag; for more information on Language Tags themselves, see the [Language Tagging][langtag] page on this site.

# What is Locale Data?
### What is Locale Data?

Locale Data refers to the data needed to present a user from a specific locale with information that would be familiar to them. This includes, but is not limited to:
- Important vocabulary
Expand All @@ -28,12 +28,12 @@ Locale Data refers to the data needed to present a user from a specific locale w

While many companies such as Meta and Microsoft often have their own internal systems for defining locale data, this site will primarily focus on the CLDR and SLDR. These repositories contain files written in LDML (Locale Data Markup Language) that define locale data for a wide range of locales.

# More on this site:
### More on this site:

- [CLDR and SLDR][cldr and sldr]
- [LDML][ldml]

# More from External Sources:
### More from External Sources:

- [UTR #35: "What is a Locale?"][unicodelocaledef]

Expand Down

0 comments on commit 23fa072

Please sign in to comment.