mirror of
https://github.com/tcltk/tcl.git
synced 2026-05-29 00:27:49 +08:00
252 lines
8.8 KiB
Plaintext
252 lines
8.8 KiB
Plaintext
'\"
|
|
'\" Copyright (c) 1998 Scriptics Corporation.
|
|
'\"
|
|
'\" See the file "license.terms" for information on usage and redistribution
|
|
'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
|
|
'\"
|
|
.TH encoding n "8.1" Tcl "Tcl Built-In Commands"
|
|
.so man.macros
|
|
.BS
|
|
.SH NAME
|
|
encoding \- Manipulate encodings
|
|
.SH SYNOPSIS
|
|
\fBencoding \fIoption\fR ?\fIarg arg ...\fR?
|
|
.BE
|
|
.SH INTRODUCTION
|
|
.PP
|
|
Strings in Tcl are logically a sequence of Unicode characters.
|
|
These strings are represented in memory as a sequence of bytes that
|
|
may be in one of several encodings: modified UTF\-8 (which uses 1 to 4
|
|
bytes per character), or a custom encoding start as 8 bit binary data.
|
|
.PP
|
|
Different operating system interfaces or applications may generate
|
|
strings in other encodings such as Shift\-JIS. The \fBencoding\fR
|
|
command helps to bridge the gap between Unicode and these other
|
|
formats.
|
|
.SH DESCRIPTION
|
|
.PP
|
|
Performs one of several encoding related operations, depending on
|
|
\fIoption\fR. The legal \fIoption\fRs are:
|
|
.\" METHOD: convertfrom
|
|
.TP
|
|
\fBencoding convertfrom\fR ?\fIencoding\fR? \fIdata\fR
|
|
.VS "TIP607, TIP656"
|
|
.TP
|
|
\fBencoding convertfrom\fR ?\fB-profile \fIprofile\fR? ?\fB-failindex var\fR? \fIencoding data\fR
|
|
.VE "TIP607, TIP656"
|
|
.
|
|
Converts \fIdata\fR, which should be in binary string encoded as per
|
|
\fIencoding\fR, to a Tcl string. If \fIencoding\fR is not specified, the current
|
|
system encoding is used.
|
|
.PP
|
|
.VS "TIP607, TIP656"
|
|
The \fB-profile\fR option determines the command behavior in the presence
|
|
of conversion errors. See the \fBPROFILES\fR section below for details. Any premature
|
|
termination of processing due to errors is reported through an exception if
|
|
the \fB-failindex\fR option is not specified.
|
|
.PP
|
|
If the \fB-failindex\fR is specified, instead of an exception being raised
|
|
on premature termination, the result of the conversion up to the point of the
|
|
error is returned as the result of the command. In addition, the index
|
|
of the source byte triggering the error is stored in \fBvar\fR. If no
|
|
errors are encountered, the entire result of the conversion is returned and
|
|
the value \fB-1\fR is stored in \fBvar\fR.
|
|
.VE "TIP607, TIP656"
|
|
.\" METHOD: convertto
|
|
.TP
|
|
\fBencoding convertto\fR ?\fIencoding\fR? \fIdata\fR
|
|
.TP
|
|
\fBencoding convertto\fR ?\fB-profile \fIprofile\fR? ?\fB-failindex var\fR? \fIencoding data\fR
|
|
.
|
|
Convert \fIstring\fR to the specified \fIencoding\fR. The result is a Tcl binary
|
|
string that contains the sequence of bytes representing the converted string in
|
|
the specified encoding. If \fIencoding\fR is not specified, the current system
|
|
encoding is used.
|
|
.PP
|
|
.VS "TIP607, TIP656"
|
|
The \fB-profile\fR and \fB-failindex\fR options have the same effect as
|
|
described for the \fBencoding convertfrom\fR command.
|
|
.VE "TIP607, TIP656"
|
|
.\" METHOD: dirs
|
|
.TP
|
|
\fBencoding dirs\fR ?\fIdirectoryList\fR?
|
|
.
|
|
Tcl can load encoding data files from the file system that describe
|
|
additional encodings for it to work with. This command sets the search
|
|
path for \fB*.enc\fR encoding data files to the list of directories
|
|
\fIdirectoryList\fR. If \fIdirectoryList\fR is omitted then the
|
|
command returns the current list of directories that make up the
|
|
search path. It is an error for \fIdirectoryList\fR to not be a valid
|
|
list. If, when a search for an encoding data file is happening, an
|
|
element in \fIdirectoryList\fR does not refer to a readable,
|
|
searchable directory, that element is ignored.
|
|
.\" METHOD: names
|
|
.TP
|
|
\fBencoding names\fR
|
|
.
|
|
Returns a list containing the names of all of the encodings that are
|
|
currently available.
|
|
The encodings
|
|
.QW utf-8
|
|
and
|
|
.QW iso8859-1
|
|
are guaranteed to be present in the list.
|
|
.\" METHOD: profiles
|
|
.TP
|
|
\fBencoding profiles\fR
|
|
.VS TIP656
|
|
Returns a list of the names of encoding profiles. See \fBPROFILES\fR below.
|
|
.VE TIP656
|
|
.\" METHOD: system
|
|
.TP
|
|
\fBencoding system\fR ?\fIencoding\fR?
|
|
.
|
|
Set the system encoding to \fIencoding\fR. If \fIencoding\fR is
|
|
omitted then the command returns the current system encoding. The
|
|
system encoding is used whenever Tcl passes strings to system calls.
|
|
.TP
|
|
\fBencoding user\fR
|
|
.VS TIP716
|
|
Returns the name of encoding as per the user's preferences. On Windows
|
|
systems, this is based on the user's code page settings in the registry.
|
|
On other platforms, the returned value is the same as returned by
|
|
\fBencoding system\fR.
|
|
.VE TIP716
|
|
.\" Do not put .VS on whole section as that messes up the bullet list alignment
|
|
.SH PROFILES
|
|
.PP
|
|
.VS TIP656
|
|
Operations involving encoding transforms may encounter several types of
|
|
errors such as invalid sequences in the source data, characters that
|
|
cannot be encoded in the target encoding and so on.
|
|
A \fIprofile\fR prescribes the strategy for dealing with such errors
|
|
in one of two ways:
|
|
.VE TIP656
|
|
.
|
|
.IP \(bu
|
|
.VS TIP656
|
|
Terminating further processing of the source data. The profile does not
|
|
determine how this premature termination is conveyed to the caller. By default,
|
|
this is signalled by raising an exception. If the \fB-failindex\fR option
|
|
is specified, errors are reported through that mechanism.
|
|
.VE TIP656
|
|
.IP \(bu
|
|
.VS TIP656
|
|
Continue further processing of the source data using a fallback strategy such
|
|
as replacing or discarding the offending bytes in a profile-defined manner.
|
|
.VE TIP656
|
|
.PP
|
|
The following profiles are currently implemented with \fBstrict\fR being
|
|
the default if the \fB-profile\fR is not specified.
|
|
.VS TIP656
|
|
.TP
|
|
\fBstrict\fR
|
|
.
|
|
The \fBstrict\fR profile always stops processing when an conversion error is
|
|
encountered. The error is signalled via an exception or the \fB-failindex\fR
|
|
option mechanism. The \fBstrict\fR profile implements a Unicode standard
|
|
conformant behavior.
|
|
.TP
|
|
\fBtcl8\fR
|
|
.
|
|
The \fBtcl8\fR profile always follows the first strategy above and corresponds
|
|
to the behavior of encoding transforms in Tcl 8.6. When converting from an
|
|
external encoding \fBother than utf-8\fR to Tcl strings with the \fBencoding
|
|
convertfrom\fR command, invalid bytes are mapped to their numerically equivalent
|
|
code points. For example, the byte 0x80 which is invalid in ASCII would be
|
|
mapped to code point U+0080. When converting from \fButf-8\fR, invalid bytes
|
|
that are defined in CP1252 are mapped to their Unicode equivalents while those
|
|
that are not fall back to the numerical equivalents. For example, byte 0x80 is
|
|
defined by CP1252 and is therefore mapped to its Unicode equivalent U+20AC while
|
|
byte 0x81 which is not defined by CP1252 is mapped to U+0081. As an additional
|
|
special case, the sequence 0xC0 0x80 is mapped to U+0000.
|
|
|
|
When converting from Tcl strings to an external encoding format using
|
|
\fBencoding convertto\fR, characters that cannot be represented in the
|
|
target encoding are replaced by an encoding-dependent character, usually
|
|
the question mark \fB?\fR.
|
|
.TP
|
|
\fBreplace\fR
|
|
.
|
|
Like the \fBtcl8\fR profile, the \fBreplace\fR profile always continues
|
|
processing on conversion errors but follows a Unicode standard conformant
|
|
method for substitution of invalid source data.
|
|
|
|
When converting an encoded byte sequence to a Tcl string using
|
|
\fBencoding convertfrom\fR, invalid bytes
|
|
are replaced by the U+FFFD REPLACEMENT CHARACTER code point.
|
|
|
|
When encoding a Tcl string with \fBencoding convertto\fR,
|
|
code points that cannot be represented in the
|
|
target encoding are transformed to an encoding-specific fallback character,
|
|
U+FFFD REPLACEMENT CHARACTER for UTF targets and generally `?` for other
|
|
encodings.
|
|
.VE TIP656
|
|
.SH EXAMPLES
|
|
.PP
|
|
These examples use the utility proc below that prints the Unicode code points
|
|
comprising a Tcl string.
|
|
.PP
|
|
.CS
|
|
proc codepoints s {join [lmap c [split $s {}] {
|
|
string cat U+ [format %.6X [scan $c %c]]}]
|
|
}
|
|
.CE
|
|
.PP
|
|
Example 1: convert a byte sequence in Japanese euc-jp encoding to a TCL string:
|
|
.PP
|
|
.CS
|
|
% codepoints [\fBencoding convertfrom\fR euc-jp "\exA4\exCF"]
|
|
U+00306F
|
|
.CE
|
|
.PP
|
|
The result is the unicode codepoint
|
|
.QW "\eu306F" ,
|
|
which is the Hiragana letter HA.
|
|
.VS "TIP607, TIP656"
|
|
.PP
|
|
Example 2: Error handling based on profiles:
|
|
.PP
|
|
The letter \fBA\fR is Unicode character U+0041 and the byte "\ex80" is invalid
|
|
in ASCII encoding.
|
|
.PP
|
|
.CS
|
|
% codepoints [\fBencoding convertfrom\fR -profile tcl8 ascii A\ex80]
|
|
U+000041 U+000080
|
|
% codepoints [\fBencoding convertfrom\fR -profile replace ascii A\ex80]
|
|
U+000041 U+00FFFD
|
|
% codepoints [\fBencoding convertfrom\fR -profile strict ascii A\ex80]
|
|
unexpected byte sequence starting at index 1: '\ex80'
|
|
.CE
|
|
.PP
|
|
Example 3: Get partial data and the error location:
|
|
.PP
|
|
.CS
|
|
% codepoints [\fBencoding convertfrom\fR -profile strict -failindex idx ascii AB\ex80]
|
|
U+000041 U+000042
|
|
% set idx
|
|
2
|
|
.CE
|
|
.PP
|
|
Example 4: Encode a character that is not representable in ISO8859-1:
|
|
.PP
|
|
.CS
|
|
% \fBencoding convertto\fR iso8859-1 A\eu0141
|
|
A?
|
|
% \fBencoding convertto\fR -profile strict iso8859-1 A\eu0141
|
|
unexpected character at index 1: 'U+000141'
|
|
% \fBencoding convertto\fR -profile strict -failindex idx iso8859-1 A\eu0141
|
|
A
|
|
% set idx
|
|
1
|
|
.CE
|
|
.VE "TIP607, TIP656"
|
|
.PP
|
|
.SH "SEE ALSO"
|
|
Tcl_GetEncoding(3), fconfigure(n)
|
|
.SH KEYWORDS
|
|
encoding, unicode
|
|
.\" Local Variables:
|
|
.\" mode: nroff
|
|
.\" End:
|