BSD style: see
license.txt
Initial release: Oct 2004
Kris
Fast Unicode transcoders. These are particularly sensitive to
minor changes on 32bit x86 devices, because the register set of
those devices is so small. Beware of subtle changes which might
extend the execution-period by as much as 200%. Because of this,
three of the six transcoders might read past the end of input by
one, two, or three bytes before arresting themselves. Note that
support for streaming adds a 15% overhead to the dchar => char
conversion, but has little effect on the others.
These routines were tuned on an Intel P4; other devices may work
more efficiently with a slightly different approach, though this
is likely to be reasonably optimal on AMD x86 CPUs also. These
algorithms would benefit significantly from those extra AMD64
registers. On a 3GHz P4, the dchar/char conversions take around
2500ns to process an array of 1000 ASCII elements. Invoking the
memory manager doubles that period, and quadruples the time for
arrays of 100 elements. Memory allocation can slow down notably
in a multi-threaded environment, so avoid that where possible.
Surrogate-pairs are dealt with in a non-optimal fashion when
transcoding between utf16 and utf8. Such cases are considered
to be boundary-conditions for this module.
There are three common cases where the input may be incomplete,
including each 'widening' case of utf8 => utf16, utf8 => utf32,
and utf16 => utf32. An edge-case is utf16 => utf8, if surrogate
pairs are present. Such cases will throw an exception, unless
streaming-mode is enabled ~ in the latter mode, an additional
integer is returned indicating how many elements of the input
have been consumed. In all cases, a correct slice of the output
is returned.
For details on Unicode processing see:
- char[] toString(char[] src, char[] dst, uint* ate = null) ¶#
-
Symmetric calls for equivalent types; these return the provided
input with no conversion
- char[] toString(wchar[] input, char[] output = null, uint* ate = null) ¶#
-
Encode Utf8 up to a maximum of 4 bytes long (five & six byte
variations are not supported).
If the output is provided off the stack, it should be large
enough to encompass the entire transcoding; failing to do
so will cause the output to be moved onto the heap instead.
Returns a slice of the output buffer, corresponding to the
converted characters. For optimum performance, the returned
buffer should be specified as 'output' on subsequent calls.
For example:
1
2
3
4
5
6
7
| char[] output;
char[] result = toString (input, output);
// reset output after a realloc
if (result.length > output.length)
output = result;
|
- wchar[] toString16(char[] input, wchar[] output = null, uint* ate = null) ¶#
-
Decode Utf8 produced by the above toString() method.
If the output is provided off the stack, it should be large
enough to encompass the entire transcoding; failing to do
so will cause the output to be moved onto the heap instead.
Returns a slice of the output buffer, corresponding to the
converted characters. For optimum performance, the returned
buffer should be specified as 'output' on subsequent calls.
- char[] toString(dchar[] input, char[] output = null, uint* ate = null) ¶#
-
Encode Utf8 up to a maximum of 4 bytes long (five & six
byte variations are not supported). Throws an exception
where the input dchar is greater than 0x10ffff.
If the output is provided off the stack, it should be large
enough to encompass the entire transcoding; failing to do
so will cause the output to be moved onto the heap instead.
Returns a slice of the output buffer, corresponding to the
converted characters. For optimum performance, the returned
buffer should be specified as 'output' on subsequent calls.
- dchar[] toString32(char[] input, dchar[] output = null, uint* ate = null) ¶#
-
Decode Utf8 produced by the above toString() method.
If the output is provided off the stack, it should be large
enough to encompass the entire transcoding; failing to do
so will cause the output to be moved onto the heap instead.
Returns a slice of the output buffer, corresponding to the
converted characters. For optimum performance, the returned
buffer should be specified as 'output' on subsequent calls.
- wchar[] toString16(dchar[] input, wchar[] output = null, uint* ate = null) ¶#
-
Encode Utf16 up to a maximum of 2 bytes long. Throws an exception
where the input dchar is greater than 0x10ffff.
If the output is provided off the stack, it should be large
enough to encompass the entire transcoding; failing to do
so will cause the output to be moved onto the heap instead.
Returns a slice of the output buffer, corresponding to the
converted characters. For optimum performance, the returned
buffer should be specified as 'output' on subsequent calls.
- dchar[] toString32(wchar[] input, dchar[] output = null, uint* ate = null) ¶#
-
Decode Utf16 produced by the above toString16() method.
If the output is provided off the stack, it should be large
enough to encompass the entire transcoding; failing to do
so will cause the output to be moved onto the heap instead.
Returns a slice of the output buffer, corresponding to the
converted characters. For optimum performance, the returned
buffer should be specified as 'output' on subsequent calls.
- dchar decode(char[] src, ref uint ate) ¶#
-
Decodes a single dchar from the given src text, and indicates how
many chars were consumed from src to do so.
- dchar decode(wchar[] src, ref uint ate) ¶#
-
Decodes a single dchar from the given src text, and indicates how
many wchars were consumed from src to do so.
- char[] encode(char[] dst, dchar c) ¶#
-
Encode a dchar into the provided dst array, and return a slice of
it representing the encoding
- wchar[] encode(wchar[] dst, dchar c) ¶#
-
Encode a dchar into the provided dst array, and return a slice of
it representing the encoding
- bool isValid(dchar c) ¶#
-
Is the given character valid?
- T[] fromString8(T)(char[] s, T[] dst) ¶#
-
Convert from a char[] into the type of the dst provided.
Returns a slice of the given dst, where it is sufficiently large
to house the result, or a heap-allocated array otherwise. Returns
the original input where no conversion is required.
- T[] fromString16(T)(wchar[] s, T[] dst) ¶#
-
Convert from a wchar[] into the type of the dst provided.
Returns a slice of the given dst, where it is sufficiently large
to house the result, or a heap-allocated array otherwise. Returns
the original input where no conversion is required.
- T[] fromString32(T)(dchar[] s, T[] dst) ¶#
-
Convert from a dchar[] into the type of the dst provided.
Returns a slice of the given dst, where it is sufficiently large
to house the result, or a heap-allocated array otherwise. Returns
the original input where no conversion is required.
- T[] cropLeft(T)(T[] s) ¶#
-
Adjust the content such that no partial encodings exist on the
left side of the provided text.
Returns a slice of the input
- T[] cropRight(T)(T[] s) ¶#
-
Adjust the content such that no partial encodings exist on the
right side of the provided text.
Returns a slice of the input