Unicode
Unicode.isassignedFunction
Unicode.isassigned(c) -> Bool
Returns true if the given char or integer is an assigned Unicode code point.
Examples
julia> Unicode.isassigned(101)
true
julia> Unicode.isassigned('\x01')
true
source
Unicode.normalizeFunction
Unicode.normalize(s::AbstractString, normalform::Symbol)
Normalize the string s according to one of the four "normal forms" of the Unicode standard: normalform can be :NFC, :NFD, :NFKC, or :NFKD. Normal forms C (canonical composition) and D (canonical decomposition) convert different visually identical representations of the same abstract string into a single canonical form, with form C being more compact. Normal forms KC and KD additionally canonicalize "compatibility equivalents": they convert characters that are abstractly similar but visually distinct into a single canonical choice (e.g. they expand ligatures into the individual characters), with form KC being more compact.
Alternatively, finer control and additional transformations may be be obtained by calling Unicode.normalize(s; keywords...), where any number of the following boolean keywords options (which all default to false except for compose) are specified:
-
compose=false: do not perform canonical composition -
decompose=true: do canonical decomposition instead of canonical composition (compose=trueis ignored if present) -
compat=true: compatibility equivalents are canonicalized -
casefold=true: perform Unicode case folding, e.g. for case-insensitive string comparison -
newline2lf=true,newline2ls=true, ornewline2ps=true: convert various newline sequences (LF, CRLF, CR, NEL) into a linefeed (LF), line-separation (LS), or paragraph-separation (PS) character, respectively -
stripmark=true: strip diacritical marks (e.g. accents) -
stripignore=true: strip Unicode's "default ignorable" characters (e.g. the soft hyphen or the left-to-right marker) -
stripcc=true: strip control characters; horizontal tabs and form feeds are converted to spaces; newlines are also converted to spaces unless a newline-conversion flag was specified -
rejectna=true: throw an error if unassigned code points are found -
stable=true: enforce Unicode Versioning Stability
For example, NFKC corresponds to the options compose=true, compat=true, stable=true.
Examples
julia> "μ" == Unicode.normalize("µ", compat=true) #LHS: Unicode U+03bc, RHS: Unicode U+00b5
true
julia> Unicode.normalize("JuLiA", casefold=true)
"julia"
julia> Unicode.normalize("JúLiA", stripmark=true)
"JuLiA"
source
Unicode.graphemesFunction
graphemes(s::AbstractString) -> GraphemeIterator
Returns an iterator over substrings of s that correspond to the extended graphemes in the string, as defined by Unicode UAX #29. (Roughly, these are what users would perceive as single characters, even though they may contain more than one codepoint; for example a letter combined with an accent mark is a single grapheme.)
© 2009–2019 Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and other contributors
Licensed under the MIT License.
https://docs.julialang.org/en/v1.1.1/stdlib/Unicode/