---------------------------------------------------------------
HOWTO-bidi.txt: Editing Bidirectional Documents With Yudit
---------------------------------------------------------------
From version 2.7 Yudit should show bidirectional text
just as any other Unicode application that implement Unicode
Bidirectional algorithm.

Paragraphs with initial directionality LR, like English
text will be aligned to the left while texts with RL
initial directionality will be aligned to the right.

As Unicode Standard allows higher level protocols to
impose Document embedding, Yudit can enforce an LR
or RL embedding on the whole document if the user
sets it with the text embedding button. This will
force left or right alignment on the whole text.
---------------------------------------------------------------
Usage
---------------------------------------------------------------
1. What is implicit bidirectional behavior?
 All characters in a Unicode belong to one of the many 
 bidirectional classes. Depending on these character properties
 all characters in the documents must be reordered into a visual
 order dictated by a rather convoluted algorithm in UAX#9. 
 Under implicit bidirectional behavior I mean the behavior
 that purely relies on the characters bidirectional class 
 property.

2. How to invoke implicit bidi?
 You don't need to do anything, just type:

   He said “سلام!‏”

 Please note that I cheated here: I added a RLM (Right Left Mark)
 U+200F at the end. I wanted to make the text more digestible in
 this English document. This mark is visible in the editor window
 but it will not appear when printing, or, when used in labels.

3. What is explicit embedding and override
 In addition to the inherent bidirectional properties of the
 characters, Unicode allows text between certain markers to
 render Left to Right or believe that the embedding context is
 Left or Right.

 These markers can be nested. The PDF (POP directional format)
 marker restores the last embedding state. 

 a) Directional Override
  Text between 
      RLO (Right to Left Override) … PDF (Pop Directional Format)
      LRO (Left to Right Override) … PDF (Pop Directional Format)
  Will have an LR or RL explicit directionality, regardless of
  their bidirectional property. However, this directional property 
  is (unfortunately) not used when the initial directionality is
  determined, so your text might not be aligned as you expect. 
  UAX#9 P2:
     In each paragraph, find the first character of type L, AL, or R.

     Because paragraph separators delimit text in this algorithm,
     this will generally be the first strong character after a 
     paragraph separator or at the very beginning of the text. 
     Note that the characters of type LRE, LRO, RLE, RLO are
     ignored in this rule. This is because typically they are used
     to indicate that the embedded text is the opposite direction
     than the paragraph level

 b) Directional Embedding
  Text between 
      RLE (Right to Left Embedding) … PDF (Pop Directional Format)
      LRE (Left to Right Embedding) … PDF (Pop Directional Format)
  are embedded. Embeddings supposed  to give some protection for
  the embedding context. The text in the embedding is (in most cases)
  rendered as if the initial, embedding of the text would be RL or LR.
  Please note that there are some characters that make this mission
  impossible: in fact it is not really possible to make use of RLE 
  or LRE if you use those characters. (Should they be forbidden?
  Read on).

 In Yudit you do not need to care about markers for a) and b),
 they are totally hidden. Your embedded text will have a brighter
 or darker ‪background‬, this way you can tell the embedding range.

 Unicode allows for 3 levels of the bidirectional algorithm:

  1. No bidirectional formatting. This implies that the system
     does not visually interpret characters from right-to-left
     scripts. 
  2. Implicit bi-directionality. The implicit bidirectional algorithm
     and the directional marks RLM and LRM are supported. 
  3. Full bi-directionality. The implicit bidirectional algorithm,
     the implicit directional marks, and the explicit directional
     embedding codes are supported: RLM, LRM, LRE, RLE, LRO, RLO, PDF.

 Yudit has now full bidirectional support (3).

4. How to do Explicit Direction Override?
 To override implicit directionality of characters press Override
 Direction <ctrl><d> to change direction. Then simply continue
 typing. You can get out of this by the cursor <ctrl><y> (Yield
 Direction) button. You can clearly distinguish the embedded text.

     I said “‮NO WAY!‬”.

5, How to do simple Explicit Embedding
 Similarly embedding a Right-Left text in a Left-Right document
 needs <ctrl><e> (Embedding Override). This is good, for instance
 if you want to say:

     He said: “‫سلام!‬”

 Without the Right-to-Left embedding this would look pretty bad in
 this English document:

     He said “سلام!”

6. I already have a text that I need to embed/un-embed. How to
 do this?

 Before embedding/un-embedding select the text. Selection can be made
 for instance with <alt> arrow keys. After selection with the keys
 keep pressing <alt> and press <d> for Direction Override or
 <alt><e> for Embedding Override. You can bring back the text to
 no embedding level with <alt><y> (Yield Embedding).

7. What is Document Text Embedding?
 Yudit can enforce an initial embedding level to the whole document.
 When Yudit is started the initial embedding is reset to none.
 The text is also saved without initial embedding enforcement tags.
 When no initial embedding is enforced, your text can show up 
 aligned to the left or to the right, depending on the natural 
 paragraph embedding level.

8. I want to embed LR text but my embedding arrow is RL.
 The direction of the embedding arrows on the tool-bar always
 point to the opposite direction of the current embedding;
 the context where the cursor is. This is to make the
 operation faster and make less errors. It is usually
 not desired to embed a text in an LR document as LR. However,
 you can do this with this trick:
 If you want to embed LR text in the document with LR embedding
 change the Document Text Embedding to the RL. Now you can make
 the LR embedding.
 
9. Notes
 In po file translations you might want to consider embedding your
 RL text with explicit RLO so that you will see what you will get
 on that label:
 Without explicit embedding:
     msgstr "سلام Gáspár,  محمد"
 With explicit embedding, you will see what the label will eventually
 show:
     msgstr "‫سلام Gáspár,  محمد‬"
 Please note that most applications do not support Explicit Embedding,
 so deal with them sparingly. Moreover, explicit embedding does not 
 save you from the effects of Unicode Bidirectional algorithm.

 You have this text:
     msgstr "‫سلام Gáspár  محمد‬"
 I put  the whole thing into RL embedding marks, because I want to see
 them this way, in my RL text label. It works. But what if I replace the
 leftmost space with a tab?
     msgstr "‫سلام Gáspár	محمد‬"
 Now try to put this in a label. (Try pressing the Document Text
 Embedding button in Yudit for the same effect). Now you
 see what you will see in that label. Well, to tell the truth nothing
 saves you from these effects of Unicode Bidirectional algorithm. If
 you want to see why this happens please read Surprise Effects in this
 document. 
 Fortunately, if you use gettext you will be able to use '\t' character
 for TAB. So when translating po file please always use '\t', like this:
     msgstr "‫سلام Gáspár\tمحمد‬"
 But in short: do not use segment separators in your po translation
 text as is. In case of a non-computer, non-gettext text you are on
 your own.

10. Comparing With Other Applications

 I tried to compare Yudit bidi to other applications but,
 the applications had problems even with this simple text:

Hello  ‫العربية 14محمد‬  ‮RLTXT‬ nothing

 I may try it again at a later time.

---------------------------------------------------------------
                     Technical Details
---------------------------------------------------------------
The current Yudit Bidi implementation is a reversible
algorithm when resolving explicit levels. This is text
embedded within LRE-PDF, RLE-PDF, LRO-PDF and RLO-PDF
pairs. The algorithm can re-create the text from the
view. This also means that superfluous embedding tags
will be dropped when saving alien (non-Yudit) texts.
These tags will be dropped from portions of the document
that were at least once viewed. I will not give you an
exhaustive list of such cases.

1. While alien Unicode stream

    <RLO>text1<PDF>text2<RLO>text3<RLO>

 will be saved the same way,

    <RLO>text1<PDF><RLO>text2<PDF>

 will be saved as

    <RLO>text1text2<PDF>

 as they are equivalent, and the latter is shorter.

2. Empty pairs of

    <RLO><PDF>

 or

    <LRO><RLE><PDF><PDF>

 will be deleted from the text, as they have no effect.

3. Spurious 

    <PDF>

 with no matching embedding marks will be deleted 
 from the document.

4. To keep the text editable, LRM and RLM zero with marks are
 displayed in the editing window, but they will not appear when
 printing or when used in non-editable places, like labels.

---------------------------------------------------------------
‪                         Surprise Effects                      ‬
---------------------------------------------------------------
---------------------------------------------------------------
The Problem Of Not Having Arabic RLM 
---------------------------------------------------------------

According to Unicode algorithm (Unicode Standard Annex #9)
 
   W2: search backward from each instance of a European number 
       until the first strong type (R, L, AL, or sor) is found.
       If an AL is found, change the type of the European number 
       to Arabic number.

 Probably nobody was thinking that sor can never be AL at the beginning
 of the line - this proves it:

    X10:  The remaining rules are applied to each run of characters 
       at the same level. For each run, determine the start-of-level-run
       (sor) and end-of-level-run (eor) type, either L or R. This depends 
       on the higher of the two levels on either side of the boundary
       (at the start or end of the paragraph, the level of the 'other'
       run is the base embedding level). If the higher level is odd,
       the type is R, otherwise it is L.

 I think this is ridiculous. In Arabic context you will get:
  Logical:                  Visual:
  -10% TEST ARABIC          TSET CIBARA -10% 
   ARABIC -10% TEST         TSET %10- CIBARA 

 So what is the solution? The standard says that Higher-Level 
 Protocols can:

   Override the number handling to use information provided by a
   broader context. For example, information from other paragraphs
   in a document could be used to conclude that the document was
   fundamentally Arabic, and that EN should generally be converted
   to AN.

 In Yudit I decided not to do this hack. The reason is this:

   When text using a higher-level protocol is to be converted to
   Unicode plain text, formatting codes should be inserted to ensure
   that the order matches that of the higher-level protocol...

 No, with Yudit I don't want to save -10% TEST ARABIC
 as <RLO>-10%<PDF>ARABIC, unless it is requested by the user. Please
 use explicit directionality markers in this case.
 
---------------------------------------------------------------
The Problem Of Characters That Have Global Effects
---------------------------------------------------------------
 What are these characters?
 Segment Separator - its effect is well defined, but surprising.
 Boundary Neutral - the location of which is not defined it can
    pop up at any place.
 So let's see what we get for at least the one, that is defined:
 Segment Separator - like Tab.
 I tried to use RLE in my translation, so that I can see what I
 will see in this Label as a label Text:

  msgstr "‫سلام Gáspár	محمد‬"

 Well as you see, I can not. If you set Yudit Editor's Document
 Text Alignment to the right, you will see what the label will
 show. Something ‪totally‬ different.

 Unfortunately the Unicode Algorithm requires me to. UAX #9 L1:
   “On each line, reset the embedding level of the following
    characters to the paragraph embedding level:"
   1. Segment Separators.”
 Well this means that regardless of having this tab embedded in
 our text I have to reset it to this English document's embedding
 level. If you use gettext, please use '\t' instead of Tab.
---------------------------------------------------------------
The Problem Of Having Only One Set Of + - / * . % Characters
---------------------------------------------------------------
 You might find it surprising, that programs conforming to
 Unicode Standard Annex #9 I must render the followings
 this way (I substituted HEBREW with ‫עברית‬ and ARABIC
 with ‫العربية‬, and I also inserted a Right to Left embedding so
 that you see what is going on):

 Surprise #1:
  Input  : HEBREW ~~~23%%% HEBREW abc
  Output : ‫עברית ~~~23%%% עברית abc‬
  Input  : ARABIC ~~~23%%% ARABIC abc
  Output : ‫العربية ~~~23%%% العربية abc‬

 Surprise #2:
  Input: HEBREW 1*5 1-5 1/5 1+5
  Output: ‫עברית 1*5 1-5 1/5 1+5‬
  Input: ARABIC 1*5 1-5 1/5 1+5
  Output: ‫العربية 1*5 1-5 1/5 1+5‬

 I have checked this with java reference code from Unicode Consortium

  http://www.unicode.org/unicode/reports/tr9/BidiReferenceJava/

 so what you see here in Yudit is correct.
 Did you expect this? I feel like there is a fundamental flaw in the
 official Unicode Bidirectional algorithm that can not be solved unless
 there are separate character pairs for
   + - / * %
 Without that all you can do is embed your mathematical equations
 with explicit direction overrides.
---------------------------------------------------------------
The Problem Of Ir-reversibility
---------------------------------------------------------------
 The Unicode Bidirectional Algorithm is irreversible. In other
 words, the logical text can be reordered into visual order, but
 there is no way to guess what the logically ordered text is,
 just by looking at the visual text.
 This is a serious problem for digital signatures. If you want
 to sign a document, what you sign is the bit-stream, but what you
 see is the text. As there is no algorithm provided you can not
 possibly imagine, what you sign if you are just looking at the
 text.
---------------------------------------------------------------
The Problem Of Stateful Encoding
---------------------------------------------------------------
 Unicode always made a laugh at other stateful encodings like
 iso-2022-x. In fact the stateliness they introduced with the
 explicit  bidirectional marks is even worse, and it would make
 binary editing of Unicode Text files with proper undo operation
 next to impossible.
---------------------------------------------------------------
Remarks
---------------------------------------------------------------
 I tested Yudit and found that it is, probably, 100% Compliant
 to the full Unicode Bidirectional UAX #9 algorithm. However

   ‪I do not think that that UAX #9 algorithm is good.‬ 

 Moreover, I think that that algorithm should be replaced with
 one that makes more sense.  My clean-room implementation of
 the implicit algorithm mostly lies in
   
   stoolkit/SBiDi.h
   stoolkit/SBiDi.cpp,
 
 You can use it in your GNU programs. If Unicode Consortium ever
 change their mind it would be very easy to replace that file.

 So how much is:
  Input: HEBREW 10-2*5
  Output: ‫עברי 10-2*5‬
  Input: ARABIC 10-2*5
  Output: ‫العربية 10-2*5‬
 It is your choice. They both have 0 values, literally.

Related documents:
  http://www.yudit.org/bidi/

Document version 1.6
Gaspar Sinai <gsinai@yudit.org>
2002-11-19
