This manual is for GuICU, a Guile internationalization library, (version 0.0, 22 December 2007). GuICU provides bindings to functions from the Internation Components for Unicode library, which provides functionality for multilingualization.
Copyright © 2007 Michael L. Gran Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License”.
--- The Detailed Node Listing ---
Introduction
Tutorial
Reference
GuICU is a Guile module that provides Unicode string functions. This manual corresponds to version 0.0.
Guile is one of GNU's implementations of the Scheme language.
Unicode is a standard that hopes to provide encoding and text processing algorithms for all the world's languages (see Unicode Consortium 2007).
Among the implementations of the Unicode standard is that of the International Components for Unicode (ICU). It provides a unicode library for C/C++ and Java (see http://www.icu-project.org/).
This library, GuICU, wraps some of the functionality of the ICU library as a Guile module.
The Unicode standard has two main ideas: how to make a binary representation for text in any language, and how do to standard string operations on those representations.
GuICU wraps a subset of the operations suggested by the Unicode standard. Here are the primary services provided by GuICU:
First and foremost, to use this library, one must have the Guile language itself. Guile must be installed before this library can be installed. GNU Guile's homepage is http://www.gnu.org/software/guile/. This library has been tested with version 1.8.2. As always, check your operating system distribution to see if it has been packaged, but, if not, the source can be downloaded from the homepage.
Second, the C-language version of the ICU library (ICU4C) must be installed. ICU's homepage is http://icu-project.org/, and the ICU4C library can be be obtained from there. This library has been tested with ICU4C version 3.8.0.
Once the prerequisites are installed, this library can be installed. Standard installation directions are found in the distribution in the file INSTALL.
GuICU, once installed, provides one public module (guicu icu)
and one private module (guicu raw)
that
gets placed in the Guile site directory, (usually
/usr/local/share/guile/site). Also, it installs a guile C
library libguile-guicu-v-0.la in the library directory,
(usually /usr/local/lib).
It may be necessary to mess about with your library search path (LTDL_LIBRARY_DIR), and your Guile search path (GUILE_LOAD_PATH) to get it to work, depending on your install.
Guile programs that use this library should invoke it with
(use-modules (guicu icu))
.1
In this chapter, a demonstration of some of the features of the library will be given. Unicode strings will be created, modified, and displayed.
For these examples to operate as expected, the terminal emulator used must be set up to display UTF8 characters. For most users of modern GNU/Linux/BSD distributions, this will probably already be true.
On my GNU/Linux distribution, I accomplish this by the
following. First, I ensure that the LANG environment variable
is set up to use UTF8. On my machine I have LANG=en_US.utf8
.
This indicates that my machine uses American English and displays
strings using the UTF8 encoding. I chose this option from the catalog
of options that my distribution had preinstalled in the
/usr/lib/locale directory. Your system will likely differ
somewhat.
For my examples, I used xterm as my terminal emulator, which
uses as default a font (-misc-fixed-*
) that contains a useful
subset of the unicode glyphs, including a fair number of the Chinese,
Japanese and Korean (CJK) glyphs. As of this date, no
commonly available terminal emulator has a font that can print all of
the glyphs. You should ensure that your terminal emulator is using a
font that many unicode characters.
For Arabic and Hebrew, I instead use the Monospace family of fonts.
(xterm -fa Monospace
).
GuICU functions typically operate on one of two types, unicode characters and unicode strings.
Unicode characters are encoded as Guile integers that range from 0 to 1114111 (aka hex 10FFFF). In general, common characters have lower values and obscure or rare characters have higher values. Unicode characters 32 to 127 are the same as the ASCII characters 32 to 127.
Since, for Guile and C, the words
char
anduchar
already have meaning, the termcodepoint
will be used in the names of functions that take unicode characters.
In this library, Unicode strings are encoded as an opaque type, called a
ustring
. Austring
is a SMOB that contains an efficient representation of a Unicode string. Most of the GuICU functions operate onustring
-typed values. To create austring
, one would use a function such asstring->ustring
, which converts a standard Guile string into the opaque type.
Here are some simple examples to demonstrate some of the characters. As noted in Types and encodings, a unicode character (a codepoint) is encoded as an integer between 0 and 10FFFF, inclusive.
The common practice for referring to a unicode letter in printed material is by using the following format.
U+00F1 latin small letter n with tildeThe first part “U+” indicates that this is a unicode character. The four digit hexadecimal number “00F1” is its number. And, the remainder is its official Unicode name, which is usually in all capital letters.
The Unicode name for a codepoint can be found by using
codepoint-name
.
(codepoint-name #x1e51) => "LATIN SMALL LETTER O WITH MACRON AND GRAVE"
A character can one of many binary properties: it can be alphabetic, numeric, a space character, a control character, etc. Consider the following properties of U+1E51 latin small letter o with macron and grave:
;; Is it alphabetic? (codepoint-alpha? #x1e51) => #t ;; Is it uppercase? (codepoint-upper? #x1e51) => #f ;; Is it a space-type character? (codepoint-blank? #x1e51) => #f
For the complete list of binary properties, Codepoints.
Codepoints can be converted between uppercase, lowercase, foldcase, and titlecase.
Titlecase letters are usualy single codepoints that can be decomposed into two characters where the first character is uppercase and the second character is lowercase. For example, some languages consider the letter pair DZ (U+01F1 latin capital letter dz) to be a single logical letter. As such dz (U+01F3 latin small letter dz) would be the lowercase form and Dz (U+01F2 latin capital letter d with small letter z) would be the titlecase form.
Converting a letter to uppercase and then to lowercase is called case folding.
As an example of the case conversion functions, consider the following statements
(codepoint-name #x0041) => "LATIN CAPITAL LETTER A" (codepoint-name (codepoint-downcase #x0041)) => "LATIN SMALL LETTER A" (codepoint-name #x0031) => "DIGIT ONE" (codepoint-name (codepoint-upcase #x0031)) => "DIGIT ONE"
In the first pair of statements, codepoint-downcase
returns the lowercase version of “A”. In the second pair of
statements, codepoint-upcase
returns its input value,
because there is no uppercase version of U+0031 digit one.
For terminal applications where monospaced fonts are used, most graphical characters will occupy one to two columns. Latin letters and numbers are usually 1 column wide. Most Chinese, Japanese, and Korean ideographs take two columns per character. Some codepoints, such as combining accents, are not meant to be used as standalone characters, and thus can be said to “occupy” zero columns.
There are a couple of functions that guess the screen width of a
character. codepoint-xterm-width
returns a value that is
appropriate for the xterm terminal emulator.
codepoint-wcwidth
uses the library function wcwidth
to
determine the screen width on the console. Whether either one will be
correct depends on the user's system.
Most East-Asian characters are wide characters, but, a few are
not. The function codepoint-east-asian-width
returns a category
for each character, indicating if it is narrow, wide, fullwidth, or
halfwidth.
For example:
;; DIGIT ONE (codepoint-xterm-width #x0031) => 1 ;; HIRAGANA LETTER A (codepoint-xterm-width #x3042) => 2 (codepoint-east-asian-width #x3042) => wide ;; HALFWIDTH KATAKANA LETTER SMALL E (codepoint-xterm-width #xff61) => 1 (codepoint-east-asian-width #xff61) => halfwidth
Most of the ICU functions operate on variable of type ustring
,
which is an opaque type used for operating with the lower-level
library. So, the first step will usually be to convert scheme data
into a ustring
.
To make a ustring
from a UTF8-encoded string, use either
string->ustring
or its shorthand version _u
(string->ustring "abc") => #<ustring 0x80c7fb8> (_u "abc") => #<ustring 0x8104f30>
The hex value in the result is the C pointer location of the
ustring
.
To make a ustring
from a string encoded is something other that
UTF-8, use codepage-string->ustring
.
(codepage-string->ustring "más" "iso-8859-1") => #<ustring 0x80c7fb8>
And to make a ustring
from a list of codepoints, use
codepoint-list->ustring
.
(codepoint-list->ustring '(#x0061 #x006C #x0067 #x00FA #x006E)) => #<ustring 0x807d050>
The reverse operations are similar.
(ustring->string (codepoint-list->ustring '(100 101 102))) => "def"
There is also a shorthand for ustring->string
, namely _s
.
This brings us to our first useful operation that the library can
provide: codepage conversion. By combining pairs of
codepage-string->ustring
and
ustring->codepage-string
, codepage conversion is simple.
In the following example, a the Spanish word más gets converted
from ISO-8859-1 to UTF-8.
;; The word más
(define str "m\xe1s")
(string->list str)
=>(#\m #\341 #\s)
(string->list (_s (codepage-string->ustring str "iso-8859-1")))
=> (#\m #\303 #\241 #\s)
String Presentation or visualization is the process of making a visual representation of the logical string. In this section, a series of increasingly complex presentation engines will be demonstrated.
For ASCII-encoded English, string presentation is not hard. An ASCII string has letters stored in logical first-to-last order. Each character in the string creates exactly one grapheme on the screen. Each grapheme is one column wide. To present the string, one must write the letters left-to-right. To find an appropriate place to wrap a line, usually one just breaks after a space.
The procedure ustring-line-break
returns the indices where
linebreaks can occur. It classifies line break possibilities into two
categories: soft and hard. Soft linebreaks are optional. Hard
linebreaks are mandatory, usually because there is a <CR> in the
string. For example,
(ustring-line-break (_u "Hi Mom!")) => ((3 7) (soft soft))
This string “Hi Mom!” can begin a new line in one of two
places: beginning with the first letter in “Mom!”, which is position
3, or after the end of the string, which is position 7. Both
possibilities are soft
because there is no <CR> or other
line-ending characters in the string.
For another example, imagine a system where the screen is 22 columns wide.2. If the first hard break possibility of a string occurs before 22 columns has passed, the string should break there. Otherwise, the string should break at the last soft break possibility before 22 columns.
Here is a fragment of a poem.3
(define poem (_u ("What passing-bells for these who die as cattle?\n"))) (ustring-line-break poem) => ((5 13 19 23 29 33 37 40 48) (soft soft soft soft soft soft soft soft hard))
The function ustring-line-break
has returned a
list of candidate line breaks. Among these candidate line breaks, on
the hypothetical 22 column display, the best place to begin a new line
would be at character 19, which is the beginning of the word “for”.
So, our first presentation engine should display characters 0 to 18,
and then repeat the wrapping process with the remainder of the string.
The function usubstring
extracts a substring from the
ustring
. Below, the first screen line of the poem is removed,
and the remaining characters are checked for the next line break.
(set! poem (usubstring poem 19)) (ustring-line-break poem) => ((4 10 14 18 21 29) (soft soft soft soft soft hard))
Here, the last soft break before our 22 columns is used up is at column 21, the beginning of the word “cattle”.
After the second screen line, only 7 characters and a <CR> remain. Thus, the remainder of the string would would fit on line three.
Now, let me present a complete toy presentation engine.
First, let's define a string.
(define str (_u (string-append "The studio was filled with the rich " "odour of roses, and when the light " "summer wind stirred amidst the trees " "of the garden,")))
In this case, our hypothetical terminal has 40 columns.
(define max-cols 40)
Now, lets create a utility function to find the last line break that fits on a line of N columns.
;; Take line break possibilities COLS and TYPE, and find ;; the best line break position for a screen of N columns (define (find-best-wrap n cols types) (let loop ((prev 0) (cols cols) (types types)) (cond ;; Reached end of line ((or (null? cols) (>= (car cols) n)) prev) ;; Reached a hard break ((eq? 'hard (car types)) (car cols)) ;; Keep looking (else (loop (car cols) (cdr cols) (cdr types))))))
Then loop over the string, printing it line by line. This
introduces a new function ustring-null?
that tests if a
ustring
contains zero codepoints.
;; Display a wrapped string given a long string STR (let loop ((str str)) (if (not (ustring-null? str)) (let* ((breaks (ustring-line-break str)) (break-cols (car breaks)) (break-types (cadr breaks)) (wrap-at (find-best-wrap n break-cols break-types))) (display (_s (usubstring str 0 (- wrap-at 1)))) (newline) (loop (usubstring str wrap-at)))))
This should return the following:
The studio was filled with the rich odour of roses, and when the light summer wind stirred amidst the trees of the garden
To expand on this to make a more complete presentation engine, it will be necessary to deal with wide and narrow characters, as well as bidirectionalization.
The procedure ustring-xterm-width-list
can be used to get the
column locations of the characters in a string,
(ustring-xterm-width-list (_u "abc")) => (1 2 3)
There is a rich set of string tranforms available.
To show their effects, a debugging function ustring-dump
is
used.4 It
verbosely prints the names of each character in a ustring
. For
the examples, the call to ustring-dump
is implied, even if it
is not included explicitly.
(ustring-dump (_u "abc")) => U+0061 LATIN SMALL LETTER A U+0062 LATIN SMALL LETTER B U+0063 LATIN SMALL LETTER C
First, there are the case transforms: ustring-upcase
,
ustring-downcase
, ustring-foldcase
,
ustring-titlecase
. Note that these are affected by the
locale.
(_s (ustring-upcase (_u "aBc"))) => ABC (_s (ustring-downcase (_u "aBc"))) => abc (_s (ustring-titlecase (_u "aBc")) => Abc
There are string normalization transforms. In Unicode, an accented letter can be represented with either a presentation form, where the letter and the combining accent have one codepoint, or in a decomposed form, where the letter and the combining codepoints have separate codepoints. Here a string is decomposed.
(ustring #x00d1) => U+00D1 LATIN CAPITAL LETTER N WITH TILDE (ustring-normalize-nfd (ustring #x00d1)) => U+004E LATIN CAPITAL LETTER N U+0303 COMBINING TILDE
Now, a string is composed, combining accents into presentation forms, when possible.
(ustring #x006e #x0303) => U+006E LATIN SMALL LETTER N U+0303 COMBINING TILDE (ustring-normalize-nfc (ustring #x006e #x0303)) => U+00F1 LATIN SMALL LETTER N WITH TILDE
There is the bidirectionalization transform, which takes a string in logical, first-to-last order and transforms it into left-to-right order. Hebrew and Arabic characters are swapped when they are found.
(ustring #x05d7 #x5d1 #x05e8) => U+05D7 HEBREW LETTER HET U+05D1 HEBREW LETTER BET U+05E8 HEBREW LETTER RESH (ustring-bidi-visualize (ustring #x05d7 #x5d1 #x05e8) *ubidi-default-ltr* *ubidi-reorder-default* *ubidi-option-default* 0) => U+05E8 HEBREW LETTER RESH U+05D1 HEBREW LETTER BET U+05D7 HEBREW LETTER HET
There is the Arabic shaping transform, where logical Arabic letters are changed to their correct cursive forms for proper presentation.
(ustring #x0627 #x0644 #x0628 #x0627 #x0628) => U+0627 ARABIC LETTER ALEF U+0644 ARABIC LETTER LAM U+0628 ARABIC LETTER BEH U+0627 ARABIC LETTER ALEF U+0628 ARABIC LETTER BEH (ustring-shape-arabic (ustring #x0627 #x0644 #x0628 #x0627 #x0628) *u-shape-letters-shape*) => U+FE8D ARABIC LETTER ALEF ISOLATED FORM U+FEDF ARABIC LETTER LAM INITIAL FORM U+FE92 ARABIC LETTER BEH MEDIAL FORM U+FE8E ARABIC LETTER ALEF FINAL FORM U+FE8F ARABIC LETTER BEH ISOLATED FORM
Okay, that was your quick tour of GuICU. There are a lot more functions listed in the reference that have been demonstrated here.
The following procecures operate on unicode characters, or codepoints.
It is invalid to pass them character constants, such as #\x
without first converting them to integers.
A codepoint is an integer between 0 and #x10ffff
. Some values
in that range are not considered Unicode characters. The single
surrogate codepoints (U+D800 to U+DFFF), byte-order marks (U+FFFE to
U+FFFF), and the unassigned range (U+FDD0 to U+FDEF) are not valid
codepoints.
Returns
#t
if the codepoint is alphabetic.
Return
#t
if char is lowercase.
Returns
#t
is char is uppercase.
Retursn
#t
if char is a number. Non-decimal numbers, like Roman numerals, are not included.
Returns
#t
if char is whitespace. This includes vertical whitespace like carriage returns, linefeeds, vertical tabs, and form feeds.
Returns
#t
if char is a control character.
Returns
#f
is char is a space, control, surrogate, or unassigned character. Essentially it returns#t
if this printing this character would uses ink.
Returns
#f
if char is vertical whitespace, a control character, a surrogate, or is unassigned.
Returns the uppercase or lowercase version of char if it exists and can be represented as a codepoint. If it does not exist or cannot be represented as a codepoint, then char is returned.
Returns the titlecase and foldcase version of char if it exists and can be represented as a codepoint. If it does not exist or cannot be represented as a codepoint, then char is returned.
Returns the foldcase of char, excluding special processing for Turkish letters “i”.
Returns the number of screen columns that the char is likely to occupy on a terminal or terminal emulator. As this is dependent on the user's setup, it may not be correct in all circumstances.
Returns the number of screen columns that char should have according to the system's
wcwidth
routine, or -1 if the width is unknown.This behavior of this procedure depends on the LC_CTYPE of the current locale.
Returns the width of char according to the Unicode East Asian width database. The return value will be one of following symbols:
wide
,narrow
,fullwidth
,halfwidth
,neutral
, orambiguous
.
Since codepoints are integers, they can be compared using the standard
operators =
, <
, etc.
There are procedures for case insensitive comparison of codepoints.
Given zero or more codepoints, do a case-insensitive comparison. If only zero or one codepoints are given, the result is
#t
. The default behavior with reference to Turkish letters “i” is assumed.
Functions that help test if a ustring has a given property.
Returns a non-false value if pred is true for any of the codepoints in
ustring
str. pred can be austring
, or it can be a procedure that takes one argument which is a codepoint and returns a boolean value.
Returns a non-false value if pred is true for all of the codepoints in
ustring
str. pred can be austring
, or it can be a procedure that takes one argument which is a codepoint and returns a boolean value.
These functions create ustring
types from other guile types.
If str is a properly-encoded UTF-8 string, a
ustring
is returned. If str is not proper UTF-8, an error is thrown.
Returns a string that contains the UTF-8 representation of the
ustring
str.
Given a string cp that is the name of a codepage, this function returns ICU's preferred name for that codepage, or
#f
if it does not understand that codepage.The returned string can be used as an available conversion locale in the functions
codepage-string->ustring
andustring->codepage-string
.
Display the codepages from which the library can convert strings to Unicode strings. An entry from this list can be used as an available conversion locale in the functions
codepage-string->ustring
andustring->codepage-string
.
Some sample codepages are
US-ASCII
ISO-8859-1
iso-8859_10-1988
iso-8859_11-2001
iso-8859_14-1988
Given a string str encoded in codepage cp, a
ustring
is returned.
Returns a string that contains the codepage-dependent representation of
ustring
str. Codepoints that cannot be represented in the indicated locale will be dropped.
Here's a tip. Try converting a string to its compatability (NFKD)
decompsition before converting a ustring
to a limited character
set, like US-ASCII or ISO-8859-1.
Returns a list of codepoints generated from the
ustring
str.
A debugging convenience function that prints to the screen the names of the characters in STR
Returns a new
ustring
extracted from str between codepoint locations start (inclusive) and end (exclusive). start will default to zero, and end will default to the number of codepoints in str.
Return a
ustring
containing all but the last k codepoints of str.
Store chr in codepoint position k of
ustring
str. The position k must be a valid position for str.
Returns
#t
if s1 and s2 (and, optionally, more ustrings) have a given lexicographic relationship. s1 and s2 will be equal if their normalized forms are the same length and have the same codepoints.
Returns
#t
if s1 and s2 (and, optionally, more ustrings) have a given lexicographic relationship. s1 and s2 will be equal if their normalized, case-folded forms are the same length and have the same codepoints.
Returns
#t
if s1 and s2 have the same UTF16 representation. No normalization occurs before comparison.
Returns the index where s2 occurs in s1, or
#f
if it is not found. If start1 and end1 are set, it restricts the search to substring of s1 between start1 (inclusive) and end1 (exclusive). If start2 and end2 are set, it tries to match that substring of s2.
For letters that have cases, these functions modify their case. Titlecase is when the first letter of each word it capitalized. Foldcase is when when a string is converted to uppercase and then back to lowercase. For some languages, this is different from the conversion to lowercase.
Note that the rules for case conversion do depend of the locale. To
check the locale, one could try (setlocale LC_ALL)
to read the
system locale. To check that GuICU has understood the system locale,
the following function is given.
Returns a string containing the underlying library's understand of the locale to use for locale-specific
ustring
functions.
Returns a
ustring
containing the uppercase of str. If start and end are set, it returns austring
of str where the codepoints from start and end have been converted to uppercase. The returnedustring
may have a different number of codepoints than the str.
Similar to
ustring-upcase
, but for the lowercase, titlecase, and case folding operations.
As above, except that the
ustring
str is modified in place. The return value is unspecified.
As above, except that it does not distinguish between dotted and dotless letters “i”.
Return a
ustring
that is formed by appending the args of typeustring
.
String normalization is the process of replacing one sequence or characters with another equivalent sequence.
For latin alphabets, there is canonical equivalence between precomposed characters, like, U+00E1 latin small letter a with acute, with their decompositions U+0061 latin small letter a and U+0301 combining acute accent.
There is compatability equivalence between characters that appear approximately the same, such as, U+FF21 fullwidth latin capital letter a and U+0041 latin capital letter a, or, U+00B5 micro sign and U+03BC greek small letter mu.
As a consequence, there are functions to transform a ustring
to
one of four normalizations.
Normalization Form D (NFD) is a canonical decomposition, typically splitting precomposed characters into a base character and a set of combining marks.
Normalization Form C (NFC) is a canonical composition. It combines base characters with combining marks into precomposed characters.
Normalization Form KD (NFKD) is a compatibility decomposition, splitting precomposed characters and replacing characters with their more common compatability equivalents.
Normalization Form KC (NFKC) is like NFKD, with the additional step that base characters with combining marks are replaced with their precomposed forms.
Given a
ustring
str, returns a newustring
that contains the normalized form of str.
Given a
ustring
str, returns a newustring
that contains the normalised form of str.
These functions return lists where a given type of break it allowed.
A grapheme is, for Latin languages, a base character with its optional combining marks. Usually a grapheme appears as a single “character” when printed, even if its representation has multiple code points. Some Hangul codepoints also combine to form single graphemes.
Note that all of these functions are affected by the curent locale.
The command setlocale
can be used to get or set the locale, and
get-icu-locale
can be used to return this library's
understanding of the current locale.
Returns a pair of lists. The first list of the locations of graphemes in
ustring
str, (optionally between positions start and end). The second list is the type of grapheme. For now, this is always the symbolgraph
.
Returns the locations and types of allowable word breaks. It returns a list of two lists. The first list is the codepoint location of a word break. The second returned list has one of the following symbols for each possible word break:
number
- for “words” that appear to be numbers,
letter
- for words that contain non-CJK letters,
kana
- for words containing kana characters,
ideo
- for words containing ideographic characters,
none
- for “words” that do not fit in any other category.
Returns the locations and types of allowable line breaks. It returns a list of two lists. The first list is the codepoint location of a line break. The second returned list has one of the following symbols for each possible line break:
soft
- where a line break is acceptable but not required
hard
- where a line break is mandatory, usually becase there is a <CR> in the string
Returns the locations and types of allowable sentence breaks. It returns a list of two lists. The first list is the codepoint location of a sentence break. The second returned list has one of the following symbols for each possible sentence break:
term
- for sentences ended by punctuation
sep
- for sentences ended by <CR>, <LF> or the end of input.
Since console applications are common, here are functions that describe the number of columns a given character or string would take when printed on a console, such as an xterm.
Return the number of columns that
ustring
str would take to print on a console. If start and end are provided, it returns the console width of the substring between start and end.While the width should be valid for common terminal emulators, your mileage may vary.
Return a list of column locations that of the codepoints of
ustring
str would take to print on a xterm-like console. If start and end are provided, it returns the console width of the substring between start and end.
Most European languages are written left to right. Arabic and Hebrew are among the languages written right to left. In both cases, Unicode strings are stored in logical order: they are not stored left-to-right in the string, but instead are stored from first letter to be output to last letter to be output.
For some windowing systems or consoles that have a preference for left-to-right text, it will be necessary to convert strings from logical order to left-to-right visual order before they are displayed.
Given a
ustring
str that represents a line of text, return a newustring
where the text is in left-to-right visual order. Hebrew and Arabic substrings will be reversed where they occur.The level indicates the underlying directional prefence for the paragraph as a whole. A level of zero will firmly set the underlying directional preference for the paragraph as left-to-right. A level of 1 will give right-to-left. A level of *ubidi-default-ltr* will try to determine the level from str, and default to left-to-right if it cannot be determined. A level of *ubidi-default-rtl* does the opposite.
The mode is one of the following constants.
*ubidi-reorder-default*
- Use default behavior.
*ubidi-reorder-numbers-special*
- When a word begins with digits and ends with right-to-left letters, the visualization will have the visualized right-to-left letters followed by digits. The default behavior would be to have digits followed by the visualized right-to-left letters.
*ubidi-reorder-group-numbers-with-r*
- Numbers will usually be visualized right-to-left unless bookended by text that is left-to-right.
The options is either of the following constants.
*ubidi-option-default*
- Use default behavior.
*ubidi-option-remove-controls*
- If the special Unicode characters U+200E left-to-right mark and U+200F right-to-left mark were used in str to clarify the ordering of a passage, this option removes those characters in the returned string.
The write-options is zero or a
logior
of zero or more of the following integers.
*ubidi-do-mirroring*
- Replace characters with their mirror-image mappings, if they have mirror-image mappings. Primarily this is for characters such a parentheses. In the visualized text right parenthesis will be replaced by left parenthesis and vice-versa.
*ubidi-keep-base-combining*
- When outputting right-to-left text, combining characters will still appear to the the “right” of the base characters in visualized strings. This option is likely necessary in console applications.
*ubidi-output-reverse*
- After all other processing is complete, reverse the codepoints in the text before output.
*ubidi-remove-bidi-controls*
- If the special Unicode characters u+200e left-to-right mark and U+200F right-to-left mark were used in str to clarify the ordering of a passage, this option removes those characters in the returned string.
This option is redundant, as
*ubidi-option-remove-controls*
could have been set in options.
This is the same as
ustring-bidi-visualize
except that the return value is a list in which the first element is the visualizedustring
, and the second element is list that provides a logical-to-visual index map. Some values in the map may be#f
if*ubidi-option-remove-controls*
was set, which indicates that a codepoint in str does not exist in the returnedustring
.
Same as
ustring-bidi-visualize
andustring-bidi-visualize-and-map
respectively.
In Arabic languages, a letter looks differently depending on its position in a word. An Arabic letter can have one of four presentations: initial, medial, final, or isolated. A Unicode string containing Arabic text will usually contain the logical letter and rely on the display to convert its presentation form. Some displays have the capability to determine the presentation form and some do not.
For systems that do not have that capability, each Arabic letters must be converted into one of four presentation forms. This is called shaping.
Also, Arabic languages have their own characters for the digits 0 through 9. In the shaping process, European digits can be replaced with Arabic digits.
Given a
ustring
containing unshaped Arabic, return a string with shaped Arabic.Somewhat oddly, if zero is used as the (default)
option
, this procedure just returns an identical copy of str: Arabic letters are untouched and digits are unmodified.options is a
logior
of the following:
*u-shape-letters-shape*
- Replace abstract letters with shaped presentations. This should usually be used.
*u-shape-letters-unshape*
- Replace shaped presentations with abstract letters.
*u-shape-letters-shape-tashkeel-isolated*
- Replace abstract letters with shaped presentations, including tashkeel forms.
*u-shape-text-direction-logical*
- Assume that the characters in the string are in logical order.
*u-shape-text-direction-visual-ltr*
- Asssume that the characters in the string are in visual left-to-right order.
*u-shape-digits-en2an*
- Replace European numbers with Arabic numbers.
*u-shape-digits-an2en*
- Replace Arabic numbers with European numbers.
*u-shape-digits-alen2an-init-lr*
- Replace European digits with Arabic digits if, by context, the surrounding text is Arabic. For digits that appear without sufficient context, European digits are assumed.
*u-shape-digits-alen2an-init-al*
- Replace European digits by Arabic digits if, by context, the surrounding text is Arabic. For digits that appear without sufficient context, Arabic digits are assumed.
*u-shape-digit-type-an*
- When converting to Arabic digits, use standard Arabic digits.
*u-shape-digit-type-an-extended*
- When converting to Arabic digits, use a digit form more common in Persian or Urdu.
*u-shape-aggregate-tashkeel*
- When an Arabic shadda appears before one of dammatan, kasratan, fatha, damma or kasra, replace it with ligature forms.
*u-shape-preserve-presentation*
- When shaping a string that already has some characters converted to presentation forms, do not alter the presentation forms.
Except when the option
*u-shape-aggregate-tashkeel*
has been chosen, the returnedustring
should have the same number of codepoints.
[Unicode Consortium 2007]
The Unicode Consortium. The Unicode Standard, Version 5.0.0, defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0) (http://www.unicode.org/version/Unicode5.0.0/).
Copyright © 2000,2001,2002 Free Software Foundation, Inc. 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
The purpose of this License is to make a manual, textbook, or other functional and useful document free in the sense of freedom: to assure everyone the effective freedom to copy and redistribute it, with or without modifying it, either commercially or noncommercially. Secondarily, this License preserves for the author and publisher a way to get credit for their work, while not being considered responsible for modifications made by others.
This License is a kind of “copyleft”, which means that derivative works of the document must themselves be free in the same sense. It complements the GNU General Public License, which is a copyleft license designed for free software.
We have designed this License in order to use it for manuals for free software, because free software needs free documentation: a free program should come with manuals providing the same freedoms that the software does. But this License is not limited to software manuals; it can be used for any textual work, regardless of subject matter or whether it is published as a printed book. We recommend this License principally for works whose purpose is instruction or reference.
This License applies to any manual or other work, in any medium, that contains a notice placed by the copyright holder saying it can be distributed under the terms of this License. Such a notice grants a world-wide, royalty-free license, unlimited in duration, to use that work under the conditions stated herein. The “Document”, below, refers to any such manual or work. Any member of the public is a licensee, and is addressed as “you”. You accept the license if you copy, modify or distribute the work in a way requiring permission under copyright law.
A “Modified Version” of the Document means any work containing the Document or a portion of it, either copied verbatim, or with modifications and/or translated into another language.
A “Secondary Section” is a named appendix or a front-matter section of the Document that deals exclusively with the relationship of the publishers or authors of the Document to the Document's overall subject (or to related matters) and contains nothing that could fall directly within that overall subject. (Thus, if the Document is in part a textbook of mathematics, a Secondary Section may not explain any mathematics.) The relationship could be a matter of historical connection with the subject or with related matters, or of legal, commercial, philosophical, ethical or political position regarding them.
The “Invariant Sections” are certain Secondary Sections whose titles are designated, as being those of Invariant Sections, in the notice that says that the Document is released under this License. If a section does not fit the above definition of Secondary then it is not allowed to be designated as Invariant. The Document may contain zero Invariant Sections. If the Document does not identify any Invariant Sections then there are none.
The “Cover Texts” are certain short passages of text that are listed, as Front-Cover Texts or Back-Cover Texts, in the notice that says that the Document is released under this License. A Front-Cover Text may be at most 5 words, and a Back-Cover Text may be at most 25 words.
A “Transparent” copy of the Document means a machine-readable copy, represented in a format whose specification is available to the general public, that is suitable for revising the document straightforwardly with generic text editors or (for images composed of pixels) generic paint programs or (for drawings) some widely available drawing editor, and that is suitable for input to text formatters or for automatic translation to a variety of formats suitable for input to text formatters. A copy made in an otherwise Transparent file format whose markup, or absence of markup, has been arranged to thwart or discourage subsequent modification by readers is not Transparent. An image format is not Transparent if used for any substantial amount of text. A copy that is not “Transparent” is called “Opaque”.
Examples of suitable formats for Transparent copies include plain ascii without markup, Texinfo input format, LaTeX input format, SGML or XML using a publicly available DTD, and standard-conforming simple HTML, PostScript or PDF designed for human modification. Examples of transparent image formats include PNG, XCF and JPG. Opaque formats include proprietary formats that can be read and edited only by proprietary word processors, SGML or XML for which the DTD and/or processing tools are not generally available, and the machine-generated HTML, PostScript or PDF produced by some word processors for output purposes only.
The “Title Page” means, for a printed book, the title page itself, plus such following pages as are needed to hold, legibly, the material this License requires to appear in the title page. For works in formats which do not have any title page as such, “Title Page” means the text near the most prominent appearance of the work's title, preceding the beginning of the body of the text.
A section “Entitled XYZ” means a named subunit of the Document whose title either is precisely XYZ or contains XYZ in parentheses following text that translates XYZ in another language. (Here XYZ stands for a specific section name mentioned below, such as “Acknowledgements”, “Dedications”, “Endorsements”, or “History”.) To “Preserve the Title” of such a section when you modify the Document means that it remains a section “Entitled XYZ” according to this definition.
The Document may include Warranty Disclaimers next to the notice which states that this License applies to the Document. These Warranty Disclaimers are considered to be included by reference in this License, but only as regards disclaiming warranties: any other implication that these Warranty Disclaimers may have is void and has no effect on the meaning of this License.
You may copy and distribute the Document in any medium, either commercially or noncommercially, provided that this License, the copyright notices, and the license notice saying this License applies to the Document are reproduced in all copies, and that you add no other conditions whatsoever to those of this License. You may not use technical measures to obstruct or control the reading or further copying of the copies you make or distribute. However, you may accept compensation in exchange for copies. If you distribute a large enough number of copies you must also follow the conditions in section 3.
You may also lend copies, under the same conditions stated above, and you may publicly display copies.
If you publish printed copies (or copies in media that commonly have printed covers) of the Document, numbering more than 100, and the Document's license notice requires Cover Texts, you must enclose the copies in covers that carry, clearly and legibly, all these Cover Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on the back cover. Both covers must also clearly and legibly identify you as the publisher of these copies. The front cover must present the full title with all words of the title equally prominent and visible. You may add other material on the covers in addition. Copying with changes limited to the covers, as long as they preserve the title of the Document and satisfy these conditions, can be treated as verbatim copying in other respects.
If the required texts for either cover are too voluminous to fit legibly, you should put the first ones listed (as many as fit reasonably) on the actual cover, and continue the rest onto adjacent pages.
If you publish or distribute Opaque copies of the Document numbering more than 100, you must either include a machine-readable Transparent copy along with each Opaque copy, or state in or with each Opaque copy a computer-network location from which the general network-using public has access to download using public-standard network protocols a complete Transparent copy of the Document, free of added material. If you use the latter option, you must take reasonably prudent steps, when you begin distribution of Opaque copies in quantity, to ensure that this Transparent copy will remain thus accessible at the stated location until at least one year after the last time you distribute an Opaque copy (directly or through your agents or retailers) of that edition to the public.
It is requested, but not required, that you contact the authors of the Document well before redistributing any large number of copies, to give them a chance to provide you with an updated version of the Document.
You may copy and distribute a Modified Version of the Document under the conditions of sections 2 and 3 above, provided that you release the Modified Version under precisely this License, with the Modified Version filling the role of the Document, thus licensing distribution and modification of the Modified Version to whoever possesses a copy of it. In addition, you must do these things in the Modified Version:
If the Modified Version includes new front-matter sections or appendices that qualify as Secondary Sections and contain no material copied from the Document, you may at your option designate some or all of these sections as invariant. To do this, add their titles to the list of Invariant Sections in the Modified Version's license notice. These titles must be distinct from any other section titles.
You may add a section Entitled “Endorsements”, provided it contains nothing but endorsements of your Modified Version by various parties—for example, statements of peer review or that the text has been approved by an organization as the authoritative definition of a standard.
You may add a passage of up to five words as a Front-Cover Text, and a passage of up to 25 words as a Back-Cover Text, to the end of the list of Cover Texts in the Modified Version. Only one passage of Front-Cover Text and one of Back-Cover Text may be added by (or through arrangements made by) any one entity. If the Document already includes a cover text for the same cover, previously added by you or by arrangement made by the same entity you are acting on behalf of, you may not add another; but you may replace the old one, on explicit permission from the previous publisher that added the old one.
The author(s) and publisher(s) of the Document do not by this License give permission to use their names for publicity for or to assert or imply endorsement of any Modified Version.
You may combine the Document with other documents released under this License, under the terms defined in section 4 above for modified versions, provided that you include in the combination all of the Invariant Sections of all of the original documents, unmodified, and list them all as Invariant Sections of your combined work in its license notice, and that you preserve all their Warranty Disclaimers.
The combined work need only contain one copy of this License, and multiple identical Invariant Sections may be replaced with a single copy. If there are multiple Invariant Sections with the same name but different contents, make the title of each such section unique by adding at the end of it, in parentheses, the name of the original author or publisher of that section if known, or else a unique number. Make the same adjustment to the section titles in the list of Invariant Sections in the license notice of the combined work.
In the combination, you must combine any sections Entitled “History” in the various original documents, forming one section Entitled “History”; likewise combine any sections Entitled “Acknowledgements”, and any sections Entitled “Dedications”. You must delete all sections Entitled “Endorsements.”
You may make a collection consisting of the Document and other documents released under this License, and replace the individual copies of this License in the various documents with a single copy that is included in the collection, provided that you follow the rules of this License for verbatim copying of each of the documents in all other respects.
You may extract a single document from such a collection, and distribute it individually under this License, provided you insert a copy of this License into the extracted document, and follow this License in all other respects regarding verbatim copying of that document.
A compilation of the Document or its derivatives with other separate and independent documents or works, in or on a volume of a storage or distribution medium, is called an “aggregate” if the copyright resulting from the compilation is not used to limit the legal rights of the compilation's users beyond what the individual works permit. When the Document is included in an aggregate, this License does not apply to the other works in the aggregate which are not themselves derivative works of the Document.
If the Cover Text requirement of section 3 is applicable to these copies of the Document, then if the Document is less than one half of the entire aggregate, the Document's Cover Texts may be placed on covers that bracket the Document within the aggregate, or the electronic equivalent of covers if the Document is in electronic form. Otherwise they must appear on printed covers that bracket the whole aggregate.
Translation is considered a kind of modification, so you may distribute translations of the Document under the terms of section 4. Replacing Invariant Sections with translations requires special permission from their copyright holders, but you may include translations of some or all Invariant Sections in addition to the original versions of these Invariant Sections. You may include a translation of this License, and all the license notices in the Document, and any Warranty Disclaimers, provided that you also include the original English version of this License and the original versions of those notices and disclaimers. In case of a disagreement between the translation and the original version of this License or a notice or disclaimer, the original version will prevail.
If a section in the Document is Entitled “Acknowledgements”, “Dedications”, or “History”, the requirement (section 4) to Preserve its Title (section 1) will typically require changing the actual title.
You may not copy, modify, sublicense, or distribute the Document except as expressly provided for under this License. Any other attempt to copy, modify, sublicense or distribute the Document is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.
The Free Software Foundation may publish new, revised versions of the GNU Free Documentation License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. See http://www.gnu.org/copyleft/.
Each version of the License is given a distinguishing version number. If the Document specifies that a particular numbered version of this License “or any later version” applies to it, you have the option of following the terms and conditions either of that specified version or of any later version that has been published (not as a draft) by the Free Software Foundation. If the Document does not specify a version number of this License, you may choose any version ever published (not as a draft) by the Free Software Foundation.
To use this License in a document you have written, include a copy of the License in the document and put the following copyright and license notices just after the title page:
Copyright (C) year your name. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled ``GNU Free Documentation License''.
If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, replace the “with...Texts.” line with this:
with the Invariant Sections being list their titles, with the Front-Cover Texts being list, and with the Back-Cover Texts being list.
If you have Invariant Sections without Cover Texts, or some other combination of the three, merge those two alternatives to suit the situation.
If your document contains nontrivial examples of program code, we recommend releasing these examples in parallel under your choice of free software license, such as the GNU General Public License, to permit their use in free software.
_s
: UStrings_u
: UStringscodepage-name
: UStringscodepage-string->ustring
: UStringscodepoint
: Types and encodingscodepoint-alnum?
: Codepointscodepoint-alpha?
: Codepointscodepoint-alpha?
: Characterscodepoint-alphabetic?
: Codepointscodepoint-blank?
: Codepointscodepoint-blank?
: Characterscodepoint-ci<=?
: Codepointscodepoint-ci<?
: Codepointscodepoint-ci=?
: Codepointscodepoint-ci>=?
: Codepointscodepoint-ci>?
: Codepointscodepoint-cntrl?
: Codepointscodepoint-control?
: Codepointscodepoint-digit?
: Codepointscodepoint-downcase
: Codepointscodepoint-east-asian-width
: Codepointscodepoint-foldcase
: Codepointscodepoint-foldcase-exclude-special-i
: Codepointscodepoint-graph?
: Codepointscodepoint-list->ustring
: UStringscodepoint-lower-case?
: Codepointscodepoint-lower?
: Codepointscodepoint-name
: Codepointscodepoint-name
: Characterscodepoint-numeric?
: Codepointscodepoint-print?
: Codepointscodepoint-punct?
: Codepointscodepoint-space?
: Codepointscodepoint-titlecase
: Codepointscodepoint-upcase
: Codepointscodepoint-upper-case?
: Codepointscodepoint-upper?
: Codepointscodepoint-upper?
: Characterscodepoint-wcwidth
: Codepointscodepoint-whitespace?
: Codepointscodepoint-xdigit?
: Codepointscodepoint-xterm-width
: Codepointscodepoint?
: Codepointsdisplay-codepage-names
: UStringsget-icu-locale
: UStringsmake-ustring
: UStringsstring->ustring
: UStringsustring
: UStringsustring
: Types and encodingsustring->codepage-string
: UStringsustring->codepoint-list
: UStringsustring->string
: UStringsustring-any
: UStringsustring-append
: UStringsustring-bidi-visualise
: UStringsustring-bidi-visualise-and-map
: UStringsustring-bidi-visualize
: UStringsustring-bidi-visualize-and-map
: UStringsustring-ci<=?
: UStringsustring-ci<?
: UStringsustring-ci=?
: UStringsustring-ci>=?
: UStringsustring-ci>?
: UStringsustring-contains
: UStringsustring-downcase
: UStringsustring-downcase!
: UStringsustring-drop
: UStringsustring-drop-right
: UStringsustring-dump
: UStringsustring-every
: UStringsustring-foldcase
: UStringsustring-foldcase!
: UStringsustring-foldcase-exclude-special-i
: UStringsustring-foldcase-exclude-special-i!
: UStringsustring-grapheme-break
: UStringsustring-length
: UStringsustring-line-break
: UStringsustring-line-break
: Stringsustring-normalise-nfc
: UStringsustring-normalise-nfd
: UStringsustring-normalise-nfkc
: UStringsustring-normalise-nfkd
: UStringsustring-normalize-nfc
: UStringsustring-normalize-nfd
: UStringsustring-normalize-nfkc
: UStringsustring-normalize-nfkd
: UStringsustring-null?
: UStringsustring-null?
: Stringsustring-raw=?
: UStringsustring-ref
: UStringsustring-sentence-break
: UStringsustring-set!
: UStringsustring-shape-arabic
: UStringsustring-take
: UStringsustring-take-right
: UStringsustring-titlecase
: UStringsustring-titlecase!
: UStringsustring-upcase
: UStringsustring-upcase!
: UStringsustring-word-break
: UStringsustring-xterm-width
: UStringsustring-xterm-width-list
: UStringsustring<=?
: UStringsustring<?
: UStringsustring=?
: UStringsustring>=?
: UStringsustring>?
: UStringsustring?
: UStringsusubstring
: UStringsusubstring
: Strings[1] What! A one sentence section? Apparently it is a requirement that there be an “invoking” section in every GNU manual. So here it is.
[2] Yeah, rockin' the VIC-20. 22 characters by 23 lines is all anyone should need.
[3] Wilfred Owen, Anthem for Doomed Youth
[4] I'd like to just display the strings, but, apparently,
texinfo
is kind of crap a non-European languages.