From 493b161c2d54564ac371ca7233fba2ed7810f07a Mon Sep 17 00:00:00 2001 From: Lorna Evans Date: Fri, 15 Nov 2024 15:18:03 -0600 Subject: [PATCH] Updated glossary to include some new words. Questions are at the top whether to include words. --- src/content/docs/reference/glossary.md | 39 +++++++++++++++++++++++--- 1 file changed, 35 insertions(+), 4 deletions(-) diff --git a/src/content/docs/reference/glossary.md b/src/content/docs/reference/glossary.md index 6d06465..6e865f6 100644 --- a/src/content/docs/reference/glossary.md +++ b/src/content/docs/reference/glossary.md @@ -3,11 +3,17 @@ title: Glossary description: Glossary sidebar: order: 9100 -lastUpdated: 2024-10-22 +lastUpdated: 2024-11-15 --- This glossary covers a wide range of terms used for discussing writing systems. +**Note:** Should we reference things like Shoebox, Lingualinks, WorldPad, PERL, SGML, SDF? or is it time to remove them? Some of them are cross referenced like PERL and "Practical Extraction and Reporting Language" so be sure to remove both if we remove them. + +**Note:** Decide whether to add these: GlyphsApp, smith, Docker, Container, Image, anvil, WSL, TeamCity, Google Fonts, + +**Note:** Added some words that need definitions added. +   |   |   |   |   | | | | | | | |   -- | -- | -- | -- | -- | -- | -- | -- | -- | --| -- | -- | -- @@ -61,6 +67,7 @@ Term | Definition character encoding form|a system for representing the codepoints associated with a particular coded character set in terms of code values of a particular datatype or size. For many situations, this is a trivial mapping: codepoints are represented by bytes with the same integer value as the codepoint. Some encoding forms may represent codepoints in terms of 16- or 32-bit values, though, and some 8-bit encoding forms may be able to represent a codespace that has more than 256 codepoints by using multiple-byte sequences. Most encoding forms are designed specifically for use in connection with a particular coded character set; e.g. UTF-8 is used specifically for encoded representation of the Universal Character Set defined by Unicode and ISO/IEC 10646. Some encoding forms may be designed for use with multiple repertoires, however. For example, the ISO 2022 encoding form supports an open collection of coded character sets and specifies changes between character sets in a data stream using escape sequences. See also [Unicode TR17 Character Encoding Form](https://www.unicode.org/reports/tr17/#CharacterEncodingForm). character encoding scheme|a character encoding form with a specific byte order serialization (relevant mainly for 16- or 32-bit encoding forms) character set encoding|a system for encoded representation of textual data that specifies the following: (1) a coded character set, (2) one or more character encoding forms and (3) one or more character encoding schemes. +Character Variant|**Add definition** charset|an identifier used to specify a set of characters. Used particularly in Microsoft Windows and TrueType fonts, and in HTML and other Internet or Web protocols to refer to identifiers for particular subsets of the Universal Character Set. CJKV (Chinese, Japanese, Korean and Vietnamese)|the significance of this grouping of languages is that all have writing systems that use Han ideographic characters. CLDR|The _Common Locale Data Repository_. An extensive repository of locale data, where a locale is a language, spoken in a particular country, written in a particular script. The CLDR is designed to provide key building blocks for software to support the world's languages, and is hosted by the Unicode Consortium. @@ -76,6 +83,7 @@ Term | Definition conjunct|a ligature, in particular, a ligature representing a consonant cluster in an Indic script. continuant|in phonetics, a speech sound which is produced without complete closure of the vocal tract. That is, any sound other than a stop, or any sound which can be articulated continuously. contrastive distribution|the relation between two or more variants of a given entity (sound, morpheme etc) which distinguish between units. For example, a pair of phonemes such as [p] and [b], which distinguish between the words _pin_ [pɪn] and _bin_ [bɪn] are said to be in contrastive distribution. Suffixes such as -ed and -s which distinguish between the past tense (e.g. _walked_) and the present tense (e.g. _walks_) are similarly said to be in contrastive distribution. The counterparts to contrastive distribution are complementary distribution and free variation. See also [Contrastive Distribution](https://en.wikipedia.org/wiki/Contrastive_distribution). +CoreText|**Add definition** creole|a fully-functioning language which has developed as a result of interaction between two (or more) parent languages. Often, a creole develops from a pidgin if the pidgin is used for long enough for a sophisticated grammar and vocabulary to evolve, and if the pidgin acquires native speakers (if children learn it as their first language). CSS|see cascading style sheets. @@ -83,16 +91,21 @@ Term | Definition Term | Definition ---- | ---------- +DBL|see Digital Bible Library. +Digital Bible Library| **Add definition** +DDL|see Digitally Disadvanged Language. +Digitally Disadvanged Language| **Add definition** dead key|a key in a particular keyboard layout that does not generate a character, but rather changes the character generated by a following keystroke. Dead keys are commonly used to enter accented forms of letters in writing systems based on Roman script. deep encoding|see semantic encoding. defective|with regard to writing systems, a writing system which does not represent all the distinctive sounds of the language it represents. -deprecate/deprecated|???write something??? +deprecate/deprecated| **Add definition** descent|the distance between the bottom of the line of text and the baseline, or the distance from the baseline to the bottom of the lowest glyph in a font. determinative|in semantics, a class of words that indicates, specifies or limits a noun, such as the definite or indefinite article, the genitive (possessive) marker, or cardinal numbers. In logographic writing systems, determinatives are one of three types of logograph, the other two being phonographs and ideographs. Determinatives generally have no spoken equivalent but perform a grammatical function to disambiguate between multiple possible interpretations of a phonograph or ideograph. diacritic|a written symbol which is structurally dependent upon another symbol; that is, a symbol that does not occur independently, but always occurs with and is visually positioned in relation to another character, usually above or below. Diacritics are also sometimes referred to as accents. For example, acute, grave, circumflex, etc. diaeresis|a diacritic mark (¨), also called tréma, commonly placed over the second of two adjacent vowels to indicate that they are to be pronounced as separate sounds rather than as a diphthong, as in the English word _naïve_. It can also be used to indicate that an otherwise unpronounced vowel is to be pronounced, as in the English name _Brontë_ or the French word _cigüe_. In Welsh orthography, it is often written on the first of two adjacent vowels to indicate that the first vowel bears stress. The same mark when used over a single vowel in Germanic writing is called an umlaut and indicates a change in vowel quality. digraph|a multigraph composed of two components. diphthong|in phonetics, a complex speech sound occupying one syllable, which begins with one vowel and ends with another. For example [eɪ] in British (RP) pronunciation of the word _lane_. See also monophthong. +DirectWrite| **Add definition** display encoding|See presentation-form encoding. distinctive|also _contrastive_. An element which makes a distinction between units. In phonology, a process or a pair of sounds, the alternation of which changes the meaning of a word. See also phoneme, minimal pair. For example, voicing is distinctive in most non-tonal languages, as illustrated by the difference between English _fan_ and _van_, or German _Kern_ and _gern_. document|a collection of information. This includes the common sense of the word, i.e. an organization of primarily textual information that can be produced by a word processing or data processing application. It goes beyond this, however, to include structured information held within an XML file. Each XML file is considered to contain one document, whatever the structure and type of that information. @@ -138,6 +151,7 @@ Term | Definition ---- | ---------- haček|See caron. heterogram|a term used mostly in the study of ancient texts, referring to a special kind of a logogram consisting of the written representation of a word in a foreign language. +HarfBuzz|**Add definition** heteronym|homographs which, although spelled the same way, are pronounced differently and have different meanings. For example, in English 'wind' (noun, as in weather) and 'wind' (verb, to coil something). homograph|one of multiple words having the same spelling but different meanings. They may be pronounced differently (for example in English 'tear: rip' and 'tear: secreted when crying'), in which case they are also heteronyms, or they may be pronounced the same (for example in American English 'tire: cause to be fatigued' and 'tire: wheel of a car'), in which case they are also homophones. homophone|one of multiple words having the same pronunciation but different meanings. They may be spelled differently (for example in English 'write' and 'right'), in which case they are called heterographs, or the same (for example in English 'bark: on a tree' and 'bark: of a dog'), in which case they are also homographs. @@ -172,6 +186,8 @@ Term | Definition LANGID|in the Microsoft Win32 API, a 16-bit integer used to identify a language or locale. A LANGID is composed of a 10-bit primary language identifier together with a 6-bit sub-language identifier (the latter being used to indicate regional distinctions for locales that use the same language). language ID|a constant value within some system used for metadata identification of the language in which information is expressed. May be numeric or character based, depending on the system. Latin script|see Roman script. +LFF|see Language Font Finder. +Language Font Finder| **Add definition** left side-bearing|the white space at the left edge of a glyph's visual representation, or more specifically, the distance between the current horizontal display position and the left edge of the glyph's bounding box. A positive left side-bearing indicates white space between the glyph and the previous one; a negative left side-bearing indicates overlap or overhang between them. ligature|a single shape or glyph that represents two or more underlying characters. See also conjunct. locale|a collection of parameters that affect how information is expressed or presented within a particular group of users, generally distinguished from one another on the basis of language or location (usually country). Locale settings affect things such as number formats, calendrical systems and date and time formats, as well as language and writing system. @@ -193,7 +209,7 @@ Term | Definition minimal pair|a pair of words distinguished by only one phoneme, for example in German /kern/ 'centre' and /gern/ 'like, with pleasure'. mnemonic keyboard|a keyboard layout based on the characters appearing on the keytops of the keyboard. See also positional keyboard. monophthong|a vowel sound which does not change in quality as it is articulated. (Contrast with diphthong.) It can be short, as in English _bed_ [b?d], or long, as in English _bead_ [bi:d]. A single short monophthong is the shortest syllable in any language. The process by which monophthongs change to diphthongs or vice versa is an important factor in language change. Diphthongization in the 15th or 16th century changed the long German monophthong [i?] to [a?], as in _Eis_ 'ice', and long [u?] to [a?] as in _Haus_ 'house'. A characteristic of Southern American English is the monophthongization of certain dipthongs such as [a?] to long [a:] in words such as _kite_. -mora|a unit of rhythmic measurement based syllable weight, which is distinctive in some languages. Japanese is one of the most well-documented of these languages. Short (or light) syllables are _monomoraic_, consisting of one mora. Long (or heavy) syllables are _bimoraic_, consisting of two morae. Some languages contain superheavy syllables, for example Hindi, in which a long vowel can be followed by a geminate consonant. These syllables are said to be _trimoraic_. The first consonant of a syllable does not represent any morae, as it does not constitute a syllable in itself. Syllable-final consonants can either form the final part of a bi- or trimoraic syllable, as is the case in Goidelic Irish, or they can represent a mora in themselves, as is the case in Japanese. Although there is a relation between syllables and morae, they are not necessarily interchangeable. For example, the Japanese word for 'photograph', [sjasin], consists of 2 syllables: sja + sin, but 3 morae: sja + si + n. (source: Jouji Miwa at [Mora and Syllable](http://sp.cis.iwate-u.ac.jp/sp/lessonj/doc/mora.html) +mora|a unit of rhythmic measurement based syllable weight, which is distinctive in some languages. Japanese is one of the most well-documented of these languages. Short (or light) syllables are _monomoraic_, consisting of one mora. Long (or heavy) syllables are _bimoraic_, consisting of two morae. Some languages contain superheavy syllables, for example Hindi, in which a long vowel can be followed by a geminate consonant. These syllables are said to be _trimoraic_. The first consonant of a syllable does not represent any morae, as it does not constitute a syllable in itself. Syllable-final consonants can either form the final part of a bi- or trimoraic syllable, as is the case in Goidelic Irish, or they can represent a mora in themselves, as is the case in Japanese. Although there is a relation between syllables and morae, they are not necessarily interchangeable. For example, the Japanese word for 'photograph', [sjasin], consists of 2 syllables: sja + sin, but 3 morae: sja + si + n. (source: Jouji Miwa at [Mora and Syllable](http://sp.cis.iwate-u.ac.jp/sp/lessonj/doc/mora.html) (**No longer there**)) multi-language enabling|see script enabling. multi-script enabling|see script enabling. multi-script encoding|an encoding implementation for some particular language that is designed to enable input to and rendering from that encoding using more than one writing system. When such an implementation is used, the different writing systems are normally based on different scripts. @@ -241,6 +257,7 @@ Term | Definition presentation-form encoding|a character encoding system in which the abstract characters that are encoded match one-for-one with the glyphs required for text display. Such encodings allow correct rendering of writing systems on 'dumb' rendering systems by having distinct codepoints for contextual forms, positional variants, etc. and are designed on the basis of rendering needs rather than on the basis of character semantics (the linguistically relevant information). Also known as glyph encoding, display encoding or surface encoding; distinguished from semantic encoding. Private Use Area (PUA)|a range of Unicode codepoints (E000 - F8FF and planes 15 and 16) that are reserved for private definition and use within an organization or corporation for creating proprietary, non-standard character definitions. For more information see The Unicode Consortium, 1996, pp. 619 ff. PUA|see Private Use Area. +Python|**Add definition** ## Q @@ -265,7 +282,7 @@ Term | Definition ---- | ---------- schema|in markup, a set of rules for document structure and content. script|a maximal collection of characters used for writing languages or for transcribing linguistic data that share common characteristics of appearance, share a common set of typical behaviours, have a common history of development, and that would be identified as being related by some community of users. Examples: Roman (or Latin) script, Arabic script, Cyrillic script, Thai script, Devanagari script, Chinese script, etc. -Script Description File (SDF)|a file describing certain kinds of complex script behaviour, used to control a rendering engine to which it has given its name. Created by Tim Erickson and used in [Shoebox](https://software.sil.org/shoebox), [LinguaLinks](https://www.sil.org/resources/publications/lingualinks), and ScriptPad. +Script Description File (SDF)|a file describing certain kinds of complex script behaviour, used to control a rendering engine to which it has given its name. Created by Tim Erickson and used in [Shoebox](https://software.sil.org/shoebox), [LinguaLinks](https://www.sil.org/resources/publications/lingualinks), and ScriptPad. **None of these products are in use anymore.** script enabling|providing the capability in software to allow documents to include text in multiple languages or scripts, and to handle input, display, editing and other text-related operations of text data in multiple languages and scripts. Script enabling has to do with the script in which language data is written, as opposed to localization, which has to do with the language and script of the user interface. SDF|see Script Description File. segmental writing system|one of two categories of phonologically-based (that is, not logographic or featural) writing systems, the other being syllabic writing systems, or syllabaries. Segmental writing systems represent consonants and vowels, rather than whole syllables, as individual units. Alphabets, abugidas, and abjads are all classed as segmental writing systems. There is potential for confusion over the inclusion of abugidas in this category, as each character in this type of script does represent a full syllable. However, individual consonants and vowels are acknowledged as discrete elements in that there is a systematic graphic similarity between characters which represent syllables sharing a particular consonant or vowel. As a test to determine whether a writing system is a syllabary or a segmental abugida, if it is syllabic there will be no systematic visual similarity between, for example, the characters or character sequences representing [ka], [ke], [ko], or [ki], [pi], [ti], but in an abugida there will be. @@ -273,13 +290,18 @@ Term | Definition SFM|see Standard Format Marker. SGML|See Standard Generalized Markup Language. side bearing|the white space at the edge of a glyph; see left side-bearing, right side-bearing. There can also be top and bottom side bearings, of use when rendering text vertically. +SIL Locale Data repository| **Add definition** +SLDR|see SIL Locale Data repository. smart font|a font capable of performing transformations on complex patterns of glyphs, above and beyond the simple character-to-glyph mapping that is a basic function of font rendering (see cmap). The information specifying the smart behavior is typically in the form of extra tables embedded in the font, and will generally allow layered transformations involving one-to-many, many-to-one, and many-to-many mappings of glyphs. +shaping engine|**See [Harfbuzz](#hb), [DirectWrite](#dw), [CoreText](#ct), [Universal Shaping Engine](#use).** smart rendering|a rendering process that uses a smart font. sort key|a sequence of numbers that when appropriately processed using a particular standard algorithm will position the corresponding string in the correct sort position in relation to other strings. The sort key need not correspond one number to one codepoint in the input string. Standard Format Marker (SFM)|an element of a proprietary format developed by SIL International and used by some linguistic software applications. A standard format marker begins with a backslash (\\); for example, `\p` would represent a paragraph tag. It is possible (and even probable) that SFMs in a single document have different character encodings. When converting to one encoding (Unicode) these must be converted with different mapping files. Standard Generalized Markup Language (SGML)|a notation for generalized markup developed by the [International Organization for Standardization](https://www.iso.org) (ISO). It separates textual information from the processing function used for formatting. It was found difficult to parse, due to the many variants possible, and so XML was developed as a subset to resolve the ambiguities and to make parsing easier. +smith|**Add definition** stop|also called a _plosive_. In phonology, a speech sound whose production involves a complete blockage of the air flow. This may include only consonants in which the air flow is blocked through both the mouth and the nose, such as [p] or [k], or those in which the air flow is blocked through the mouth only, such as [m] or [n]. Sounds in which the airflow is blocked through both the mouth and the nose cannot be articulated continuously. stress accent|one of two types of phonological accent by which one syllable is heard to be more prominent than others, its counterpart being a pitch accent. Phonetically, stress is due to a difference in length, volume, vowel quality, or a combination of these. These differences are thought to reflect a greater muscular energy in the production of the stressed syllable. The placement of stress may determine the meaning of a word, for example in the case of the two English words /conˈtent/ and /ˈcontent/. Accents may or may not be marked in writing, depending on the orthographic conventions of a particular language. +Stylistic set|**Add definition** supplementary planes|Unicode Planes 1 through 16, consisting of the supplementary code points, corresponding to codepoints U+10000 to U+10FFFF. In The Unicode Standard 3.1, characters were assigned in the supplementary planes for the first time, in Planes 1, 2 and 14. See also Basic Multilingual Plane. suprasegmental|a unit or feature whose domain extends over more than one minimal element. For example, stress is classed as a suprasegmental feature because its domain is a whole syllable, comprised of the smaller minimal elements consonants and vowels. Suprasegmental features may be marked in writing; in these cases, the area in which they are written is called the suprasegmental box. surface encoding|see presentation form encoding. @@ -294,7 +316,10 @@ Term | Definition ---- | ---------- tokenization|the process of analysing a string into a contiguous sequence of smaller units: for example, word breaking or syllable breaking or the creation of a sort key. tone|a unit belonging to a set characterized primarily by differences or changes in the levels of pitch. In a tone language, tone is used to distinguish each syllable, as illustrated by the three Ngbaka words /mà/ (with a low tone) _magic_, /mā/ (mid tone) _I_, and /má/ (high tone) _to me_. Tone can also be used in a system of intonation; for example, in English a rising tone may indicate surprise while a falling tone may indicate disappointment. +transcription|**a written representation of something spoken.** +transliteration|**converting written characters into a different script. Usually this would be from a non-Roman script to Latin IPA.** TrueType font|font format used primarily in Windows and on the Mac, allows for glyph scaling and hinting. +TypeTuner| **Add definition** ## U @@ -307,6 +332,8 @@ Term | Definition Unicode Scalar Value (USV)|a number written as a hexadecimal (base 16) value that serves as the codepoint for Unicode characters. Characters in the BMP are written with four hex digits, eg: U+0061, U+AA32. Characters in supplementary planes use five or six digits. Uniscribe (Unicode Script Processor)|due to technical limitations in OpenType, it is necessary to pre-process strings before applying OpenType smart behaviour. Microsoft uses a particular DLL (Dynamic Link Library) called Uniscribe to do this pre-processing. Uniscribe does all of the script specific, font generic processing of a string (such as reordering) leaving the font specific processing (such as contextual forms) to the OpenType lookups of a font. Universal Character Set (UCS)|the coded character set defined by Unicode and ISO/IEC 10646, intended to support all commonly used characters from all writing systems, current and past. +Universal Shaping Engine|**Add definition** +USE|see Universal Shaping Engine. USV|see Unicode Scalar Value. UTF-16|an encoding form for storing Unicode codepoints in 16-bit words. It includes the concept of surrogate pairs to encode values from U+10000 - U+10FFFF as two 16-bit words. UTF-32|an encoding form for storing Unicode codepoints in 32-bit words. Since 32 bits encompasses the entire range of Unicode, every codepoint is encoded as a single 32-bit word. See Unicode Technical Report #19. @@ -327,6 +354,10 @@ Term | Definition Term | Definition ---- | ---------- +Web Open Font Format|**Add definition** +WOFF|see Web Open Font Format. +WOFF2|see WOFF File Format 2.0. +WOFF File Format 2.0| **Add definition** writing system|an implementation of one or more scripts to form a complete system for writing a particular language. Most writing systems are based primarily upon a single script; writing systems for Japanese and Korean are notable exceptions. Many languages have multiple writing systems, however, each based on different scripts; e.g. the Mongolian language can be written using Mongolian or Cyrillic scripts. A writing system uses some subset of the characters of the script or scripts on which it is based with most or all of the behaviours typical to that script and possibly certain behaviours that are peculiar to that particular writing system. ## X