-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add short form encoding alternative for highly predictable and regular ruby #4
Comments
I have to say I really don't like this. I don't see why you shouldn't do it in your project if you want to, but it doesn't look like something I would want to recommend in the Guidelines. |
Well there are two questions about the |
On contents of Rubies (and thus @duncdrum’s Q2 & Q3)I would really like to keep That said, I can see no reason to limit the length of a string in I should make clear, though, that I see absolutely no reason to exclude on ZeroWidthSpace (U+200B, ZWSP; and thus @duncdrum’s Q4)I don’t see any strong reason to include reference to Unicode characters that should not be used in marked up documents. I suppose U+200B is not shunned the way, say, U+E0001 is; nonetheless, seems to me it is in the same general category of “a character-level way to do something when you do not have access to markup”. That said, I do not see any strong reason not to include reference to it, either. (Although I would be in favor of getting Ruby into the Guidelines first, and adding a “roads not taken” discussion of U+200B later, if discussion of it were to slow us down. And it is quite reasonable to think such a discussion should be in WD with discussion of all the various other Unicode characters that should be avoided in an XML document.) On @duncdrum’s Q1I don’t get it. This “force But more importantly, I am not sure I see how this works. How does a processor know that the (BTW, |
thank you @sydb i ll prepare another PR, using the contents of the pdf as marked up examples. If everybody, but me, thinks
No it was not, that's the point. Bopomofo was included in the original proposal, it can use whitespace characters to delimit word boundaries, CJK does not, not even punctuation marks. This doesn't really make much of a difference if you only look at a single term like The max line-length is 15 characters, so if we use This is not suitable for complex cases, but again here is 900 pages where complex doesn't occur (and there are many more like it). Even within a complex document, I would like to restrict the use of complex markup to where it is absolutely necessary (such as multiple ruby streams annotating overlaying hierarchies of semantic units.) and shout-out to @747 in the other thread. This document actually doesn't use emphasis markers, instead just the punctuation marks appear next to the characters. One could argue that the punctation marks are actually part of the rt stream, or if they should be encoded as part of the base. |
Re |
I don't think it should be warned-against; I think it should not be mentioned at all. Certainly not in the initial simple implementation/guidance intended for the coming release of the Guidelines. |
Thank you, in the spirit of encouraging consistent markup, I would suggest to include a code-snippet that shows this, currently the breaks in the various samples here appear outside of As for my Q1 this is significant, I think that the possibility to not use @knagasaki what do think about the option to assume a one-to-one relationship by default? If you see the value we can i think rather quickly come up with better elements |
The best solution depends on the trend of the programing languages (and recently their libraries) and skill sets of people who are hired in a text encoding project. And the mechanism of the Guidelines provides several solutions implicitly. At least in the initial phase, the Guidelines should not mention such a deep markup based on some small examples. After initial publication and some projects will adopt the elements actually, it should be discussed if necessary in the SIG, TEI-L, Guthub, and so on. |
@knagasaki This is exactly how I feel; I think we need to get a simple implementation and a basic prose introduction done soon -- you have all been waiting long enough for this -- and then we can start raising tickets for specific issues which are more complicated. If the current suggested prose is acceptable as far as it goes, I would like to introduce one more example which shows a longer block of text with two or three ruby instances -- there's a nice example in the original proposal (Figure 1), although someone will have to provide the transcription for me because I can't decipher the calligraphy. I think the only thing we should consider adding to the schema at this point is |
The default place of |
Because the text-directionality is established by the use of |
Thank you for your explanation. I hope both would be implemented. |
Import from main repo
add a light weight alternative means of encoding ruby, where complexity is not needed and a standoff approach could save ~ 50% of markup.
see
HZ04-004-01.pdf
. 800+ pages, ruby on every character no irregularities or special cases. Based on proposalSwitches the markup logic of the original proposal around, by not specifying anchors and segments in the ruby base. instead, assuming there to be a 1-1 relation ship by default between sequences of
rb
andrt
, only adding markup where this is not the case<supplied>, <group>
.This would greatly reduce the markup load on long regular documents. This is not intended to replace the full fledged nested example, but as an alternative for cases where more light-weight markup is desirable so as to not interfere with other markup.
Question 1: anything speaking against this, given that there is a full fledged means to deal with tricky cases by using e.g. nested
<ruby
elements.Question 2: Chunk length. technically the whole 900 page pdf could be captured by something like this:
We can leave it up to encoders to decide on acceptable chunk lengths. or make a suggestion, or even limit the max length of
<rb>
via schema (not my preferred option but it exists.<pb/>, <lb/>
etc. Can or should they go intorb
? Do we want to exclude that possibility?​
U+200B
to introduce otherwiseinvisible word separation
see ab9b7d1
The text was updated successfully, but these errors were encountered: