atom feed10 messages in org.perl.perl5-porters[PATCH] enable locale-induced UTF-8 I...
FromSent OnAttachments
Jarkko HietaniemiJan 14, 2003 7:48 am 
Rafael Garcia-SuarezJan 14, 2003 8:29 am 
Benjamin GoldbergJan 14, 2003 2:56 pm 
Jarkko HietaniemiJan 14, 2003 3:05 pm 
Jarkko HietaniemiJan 14, 2003 3:49 pm 
Jarkko HietaniemiJan 15, 2003 7:17 pm 
Abe TimmermanJan 16, 2003 12:35 pm 
Jarkko HietaniemiJan 16, 2003 12:40 pm 
Abe TimmermanJan 16, 2003 1:24 pm 
Jarkko HietaniemiJan 16, 2003 1:31 pm 
Subject:[PATCH] enable locale-induced UTF-8 I/O only if explicitly asked (Was: Re: [perl #19743] implicit utf8ification causes action-at-distance bugs)
From:Jarkko Hietaniemi (jh@iki.fi)
Date:Jan 14, 2003 7:48:56 am
List:org.perl.perl5-porters

In our previous episode we found out that there were two problems inherent in the implicit UTF-8-ification:

(1) The UTF-8 kicked in even when the user didn't ask for it. Lots of people using RH 8.0 have been bitten by this because the default locales are UTF-8.

(2) Even when and if the user wanted it, reading in malformed UTF-8 didn't do anything *immediately*. It was only later when and if further operations were attempted on the malformed data that the sad state was detected.

The issue (2) was fixed by Encode 1.84, now the <> (et alia) detect the evil data. (Though some further hacking may be required, a single UTF-8 tr/// test was broken by the Encode 1.84.)

So the issue (1) still would remain but the following patch attempts to rectify the situation, by making the UTF-8-ification explicit instead of implicit.

This patch (inlined since last time something ate my attachment) hijacks the -C switch (as suggested by Sarathy) to do the enabling of UTF-8-fied I/O. So no more implicit UTF-8 based on locale settings. (Use of the locale pragma wouldn't have worked that well since it is lexical in scope, while the UTF-8 decision is rather global in scope.) I added also an alternative way of enabling this feature: setting the $ENV{PERL_UTF8_LOCALE} to true (the -C, if present, wins).

In a perverse way going explicit is bad news since the implicit UTF-8-ification has certainly shaken many evil bugs out of the 5.8.0 tree (the B0B bug comes to mind, for example). Maybe for those platforms that have UTF-8 locales a new column of smoke testing (with env PERL_UTF8_LOCALE=1 LC_ALL=xx_YY.UTF-8) would be in order.

==== //depot/perl/embedvar.h#156 - /u/vieraat/vieraat/jhi/pp4/perl/embedvar.h
==== Index: perl/embedvar.h --- perl/embedvar.h.~1~ Tue Jan 14 17:29:10 2003 +++ perl/embedvar.h Tue Jan 14 17:29:10 2003 @@ -413,10 +413,10 @@ #define PL_utf8_toupper (vTHX->Iutf8_toupper) #define PL_utf8_upper (vTHX->Iutf8_upper) #define PL_utf8_xdigit (vTHX->Iutf8_xdigit) +#define PL_utf8locale (vTHX->Iutf8locale) #define PL_uudmap (vTHX->Iuudmap) #define PL_wantutf8 (vTHX->Iwantutf8) #define PL_warnhook (vTHX->Iwarnhook) -#define PL_widesyscalls (vTHX->Iwidesyscalls) #define PL_xiv_arenaroot (vTHX->Ixiv_arenaroot) #define PL_xiv_root (vTHX->Ixiv_root) #define PL_xnv_arenaroot (vTHX->Ixnv_arenaroot) @@ -702,10 +702,10 @@ #define PL_Iutf8_toupper PL_utf8_toupper #define PL_Iutf8_upper PL_utf8_upper #define PL_Iutf8_xdigit PL_utf8_xdigit +#define PL_Iutf8locale PL_utf8locale #define PL_Iuudmap PL_uudmap #define PL_Iwantutf8 PL_wantutf8 #define PL_Iwarnhook PL_warnhook -#define PL_Iwidesyscalls PL_widesyscalls #define PL_Ixiv_arenaroot PL_xiv_arenaroot #define PL_Ixiv_root PL_xiv_root #define PL_Ixnv_arenaroot PL_xnv_arenaroot ==== //depot/perl/gv.c#178 - /u/vieraat/vieraat/jhi/pp4/perl/gv.c ==== Index: perl/gv.c --- perl/gv.c.~1~ Tue Jan 14 17:29:10 2003 +++ perl/gv.c Tue Jan 14 17:29:10 2003 @@ -974,9 +974,15 @@ goto ro_magicalize; else break; + case '\025': + if (len > 1 && strNE(name, "\025TF8_LOCALE")) + break; + goto ro_magicalize; + case '\027': /* $^W & $^WARNING_BITS */ - if (len > 1 && strNE(name, "\027ARNING_BITS") - && strNE(name, "\027IDE_SYSTEM_CALLS")) + if (len > 1 + && strNE(name, "\027ARNING_BITS") + ) break; goto magicalize;

@@ -1793,10 +1799,13 @@ goto yes; } break; + case '\025': + if (len > 1 && strEQ(name, "\025TF8_LOCALE")) + goto yes; case '\027': /* $^W & $^WARNING_BITS */ if (len == 1 || (len == 12 && strEQ(name, "\027ARNING_BITS")) - || (len == 17 && strEQ(name, "\027IDE_SYSTEM_CALLS"))) + ) { goto yes; } ==== //depot/perl/intrpvar.h#112 - /u/vieraat/vieraat/jhi/pp4/perl/intrpvar.h
==== Index: perl/intrpvar.h --- perl/intrpvar.h.~1~ Tue Jan 14 17:29:10 2003 +++ perl/intrpvar.h Tue Jan 14 17:29:10 2003 @@ -48,7 +48,7 @@ */

PERLVAR(Idowarn, U8) -PERLVAR(Iwidesyscalls, bool) /* wide system calls */ +PERLVAR(Iutf8locale, bool) /* utf8 locale detected */ PERLVAR(Idoextract, bool) PERLVAR(Isawampersand, bool) /* must save all match strings */ PERLVAR(Iunsafe, bool) ==== //depot/perl/locale.c#10 - /u/vieraat/vieraat/jhi/pp4/perl/locale.c ==== Index: perl/locale.c --- perl/locale.c.~1~ Tue Jan 14 17:29:10 2003 +++ perl/locale.c Tue Jan 14 17:29:10 2003 @@ -475,7 +475,7 @@

#ifdef USE_PERLIO { - /* Set PL_wantutf8 to TRUE if using PerlIO _and_ + /* Set PL_utf8locale to TRUE if using PerlIO _and_ any of the following are true: - nl_langinfo(CODESET) contains /^utf-?8/i - $ENV{LC_ALL} contains /^utf-?8/i @@ -487,37 +487,44 @@ it overrides LC_MESSAGES for GNU gettext, and it also can have more than one locale, separated by spaces, in case you need to know.) - If PL_wantutf8 is true, perl.c:S_parse_body() - will turn on the PerlIO :utf8 discipline on STDIN, STDOUT, - STDERR, _and_ the default open discipline. + If PL_utf8locale and PL_wantutf8 (set by -C) are true, + perl.c:S_parse_body() will turn on the PerlIO :utf8 layer + on STDIN, STDOUT, STDERR, _and_ the default open discipline. */ - bool wantutf8 = FALSE; + bool utf8locale = FALSE; char *codeset = NULL; #if defined(HAS_NL_LANGINFO) && defined(CODESET) codeset = nl_langinfo(CODESET); #endif if (codeset) - wantutf8 = (ibcmp(codeset, "UTF-8", 5) == 0 || - ibcmp(codeset, "UTF8", 4) == 0); + utf8locale = (ibcmp(codeset, "UTF-8", 5) == 0 || + ibcmp(codeset, "UTF8", 4) == 0); #if defined(USE_LOCALE) else { /* nl_langinfo(CODESET) is supposed to correctly * interpret the locale environment variables, * but just in case it fails, let's do this manually. */ if (lang) - wantutf8 = (ibcmp(lang, "UTF-8", 5) == 0 || - ibcmp(lang, "UTF8", 4) == 0); + utf8locale = (ibcmp(lang, "UTF-8", 5) == 0 || + ibcmp(lang, "UTF8", 4) == 0); #ifdef USE_LOCALE_CTYPE if (curctype) - wantutf8 = (ibcmp(curctype, "UTF-8", 5) == 0 || - ibcmp(curctype, "UTF8", 4) == 0); + utf8locale = (ibcmp(curctype, "UTF-8", 5) == 0 || + ibcmp(curctype, "UTF8", 4) == 0); #endif if (lc_all) - wantutf8 = (ibcmp(lc_all, "UTF-8", 5) == 0 || - ibcmp(lc_all, "UTF8", 4) == 0); + utf8locale = (ibcmp(lc_all, "UTF-8", 5) == 0 || + ibcmp(lc_all, "UTF8", 4) == 0); #endif /* USE_LOCALE */ } - if (wantutf8) - PL_wantutf8 = TRUE; + if (utf8locale) + PL_utf8locale = TRUE; + } + /* Set PL_wantutf8 to $ENV{PERL_UTF8_LOCALE} if using PerlIO. + This is an alternative to using the -C command line switch + (the -C if present will override this). */ + { + char *p = PerlEnv_getenv("PERL_UTF8_LOCALE"); + PL_wantutf8 = p ? (bool) atoi(p) : FALSE; } #endif

==== //depot/perl/mg.c#246 - /u/vieraat/vieraat/jhi/pp4/perl/mg.c ==== Index: perl/mg.c --- perl/mg.c.~1~ Tue Jan 14 17:29:10 2003 +++ perl/mg.c Tue Jan 14 17:29:10 2003 @@ -662,7 +662,11 @@ ? (PL_taint_warn || PL_unsafe ? -1 : 1) : 0); break; - case '\027': /* ^W & $^WARNING_BITS & ^WIDE_SYSTEM_CALLS */ + case '\025': /* $^UTF8_LOCALE */ + if (strEQ(mg->mg_ptr, "\025TF8_LOCALE")) + sv_setiv(sv, (IV) (PL_wantutf8 && PL_utf8locale)); + break; + case '\027': /* ^W & $^WARNING_BITS */ if (*(mg->mg_ptr+1) == '\0') sv_setiv(sv, (IV)((PL_dowarn & G_WARN_ON) ? TRUE : FALSE)); else if (strEQ(mg->mg_ptr+1, "ARNING_BITS")) { @@ -679,8 +683,6 @@ } SvPOK_only(sv); } - else if (strEQ(mg->mg_ptr+1, "IDE_SYSTEM_CALLS")) - sv_setiv(sv, (IV)PL_widesyscalls); break; case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': case '&': @@ -1925,7 +1927,13 @@ PL_basetime = (Time_t)(SvIOK(sv) ? SvIVX(sv) : sv_2iv(sv)); #endif break; - case '\027': /* ^W & $^WARNING_BITS & ^WIDE_SYSTEM_CALLS */ + case '\025': /* $^UTF8_LOCALE */ + if (SvIOK(sv) ? SvIVX(sv) : sv_2iv(sv)) + PL_wantutf8 = PL_utf8locale; + else + PL_wantutf8 = FALSE; + break; + case '\027': /* ^W & $^WARNING_BITS */ if (*(mg->mg_ptr+1) == '\0') { if ( ! (PL_dowarn & G_WARN_ALL_MASK)) { i = SvIOK(sv) ? SvIVX(sv) : sv_2iv(sv); @@ -1967,8 +1975,6 @@ } } } - else if (strEQ(mg->mg_ptr+1, "IDE_SYSTEM_CALLS")) - PL_widesyscalls = (bool)SvTRUE(sv); break; case '.': if (PL_localizing) { ==== //depot/perl/perl.c#461 - /u/vieraat/vieraat/jhi/pp4/perl/perl.c ==== Index: perl/perl.c --- perl/perl.c.~1~ Tue Jan 14 17:29:10 2003 +++ perl/perl.c Tue Jan 14 17:29:10 2003 @@ -1355,10 +1355,11 @@ if (!PL_do_undump) init_postdump_symbols(argc,argv,env);

- /* PL_wantutf8 is conditionally turned on by + /* PL_utf8locale is conditionally turned on by * locale.c:Perl_init_i18nl10n() if the environment - * look like the user wants to use UTF-8. */ - if (PL_wantutf8) { /* Requires init_predump_symbols(). */ + * look like the user wants to use UTF-8. + * PL_wantutf8 is turned on by -C or by $ENV{PERL_UTF8_LOCALE}. */ + if (PL_utf8locale && PL_wantutf8) { /* Requires init_predump_symbols(). */ IO* io; PerlIO* fp; SV* sv; @@ -2156,7 +2157,7 @@ return s + numlen; } case 'C': - PL_widesyscalls = TRUE; + PL_wantutf8 = TRUE; /* Can be set earlier by $ENV{PERL_UTF8_LOCALE}. */ s++; return s; case 'F': @@ -3397,7 +3398,7 @@ for (; argc > 0; argc--,argv++) { SV *sv = newSVpv(argv[0],0); av_push(GvAVn(PL_argvgv),sv); - if (PL_widesyscalls) + if (PL_wantutf8) (void)sv_utf8_decode(sv); } } ==== //depot/perl/perlapi.h#78 - /u/vieraat/vieraat/jhi/pp4/perl/perlapi.h ==== Index: perl/perlapi.h --- perl/perlapi.h.~1~ Tue Jan 14 17:29:10 2003 +++ perl/perlapi.h Tue Jan 14 17:29:10 2003 @@ -584,14 +584,14 @@ #define PL_utf8_upper (*Perl_Iutf8_upper_ptr(aTHX)) #undef PL_utf8_xdigit #define PL_utf8_xdigit (*Perl_Iutf8_xdigit_ptr(aTHX)) +#undef PL_utf8locale +#define PL_utf8locale (*Perl_Iutf8locale_ptr(aTHX)) #undef PL_uudmap #define PL_uudmap (*Perl_Iuudmap_ptr(aTHX)) #undef PL_wantutf8 #define PL_wantutf8 (*Perl_Iwantutf8_ptr(aTHX)) #undef PL_warnhook #define PL_warnhook (*Perl_Iwarnhook_ptr(aTHX)) -#undef PL_widesyscalls -#define PL_widesyscalls (*Perl_Iwidesyscalls_ptr(aTHX)) #undef PL_xiv_arenaroot #define PL_xiv_arenaroot (*Perl_Ixiv_arenaroot_ptr(aTHX)) #undef PL_xiv_root ==== //depot/perl/pod/perlrun.pod#67 -
/u/vieraat/vieraat/jhi/pp4/perl/pod/perlrun.pod ==== Index: perl/pod/perlrun.pod --- perl/pod/perlrun.pod.~1~ Tue Jan 14 17:29:10 2003 +++ perl/pod/perlrun.pod Tue Jan 14 17:29:10 2003 @@ -266,11 +266,21 @@

=item B<-C>

-enables Perl to use the native wide character APIs on the target system. -The magic variable C<${^WIDE_SYSTEM_CALLS}> reflects the state of -this switch. See L<perlvar/"${^WIDE_SYSTEM_CALLS}">. +enables Perl to use the Unicode APIs on the target system.

-This feature is currently only implemented on the Win32 platform. +As of Perl 5.8.1, if C<-C> is used and the locale settings (the LC_ALL, +LC_CTYPE, and LANG environment variables) indicate a UTF-8 locale, +the STDIN is expected to be in UTF-8, the STDOUT and STDERR are +expected to be in UTF-8, and C<:utf8> is the default file open layer. +See L<perluniintro>, L<perlfunc/open>, and L<open> for more information. +The magic variable C<${^UTF8_LOCALE}> reflects this state, +see L<perlvar/"${^UTF8_LOCALE}">. (Another way of setting this +variable is to set the environment variable PERL_UTF8_LOCALE.) + +(In Perls earlier than 5.8.1 the C<-C> switch was a Win32-only switch +that enabled the use of Unicode-aware "wide system call" Win32 APIs. +This feature was practically unused, however, and the command line +switch was therefore "recycled".)

=item B<-c>

==== //depot/perl/pod/perlunicode.pod#113 -
/u/vieraat/vieraat/jhi/pp4/perl/pod/perlunicode.pod ==== Index: perl/pod/perlunicode.pod --- perl/pod/perlunicode.pod.~1~ Tue Jan 14 17:29:10 2003 +++ perl/pod/perlunicode.pod Tue Jan 14 17:29:10 2003 @@ -67,13 +67,6 @@ external programs, from information provided by the system (such as %ENV), or from literals and constants in the source text.

-On Windows platforms, if the C<-C> command line switch is used or the -${^WIDE_SYSTEM_CALLS} global flag is set to C<1>, all system calls -will use the corresponding wide-character APIs. This feature is -available only on Windows to conform to the API standard already -established for that platform--and there are very few non-Windows -platforms that have Unicode-aware APIs. - The C<bytes> pragma will always, regardless of platform, force byte semantics in a particular lexical scope. See L<bytes>.

@@ -1050,10 +1043,14 @@

=item *

-If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG) -contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching), -the default encodings of your STDIN, STDOUT, and STDERR, and of -B<any subsequent file open>, are considered to be UTF-8. +If your locale environment variables (LC_ALL, LC_CTYPE, LANG) +contain the strings 'UTF-8' or 'UTF8' (matched case-insensitively) +B<and> you enable using UTF-8 either by using the C<-C> command line +switch or setting the PERL_UTF8_LOCALE environment variable to a true +value, then the default encodings of your STDIN, STDOUT, and STDERR, +and of B<any subsequent file open>, are considered to be UTF-8. +See L<perluniintro>, L<perlfunc/open>, and L<open> for more +information. The magic variable C<${^UTF8_LOCALE}> will also be set.

=item *

@@ -1410,6 +1407,6 @@ =head1 SEE ALSO

L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>, -L<perlretut>, L<perlvar/"${^WIDE_SYSTEM_CALLS}"> +L<perlretut>, L<perlvar/"${^UTF8_LOCALE}">

=cut ==== //depot/perl/pod/perluniintro.pod#44 -
/u/vieraat/vieraat/jhi/pp4/perl/pod/perluniintro.pod ==== Index: perl/pod/perluniintro.pod --- perl/pod/perluniintro.pod.~1~ Tue Jan 14 17:29:10 2003 +++ perl/pod/perluniintro.pod Tue Jan 14 17:29:10 2003 @@ -172,13 +172,15 @@ to this sample program ensures that the output is completely UTF-8, and removes the program's warning.

-If your locale environment variables (C<LANGUAGE>, C<LC_ALL>, -C<LC_CTYPE>, C<LANG>) contain the strings 'UTF-8' or 'UTF8', -regardless of case, then the default encoding of your STDIN, STDOUT, -and STDERR and of B<any subsequent file open>, is UTF-8. Note that -this means that Perl expects other software to work, too: if Perl has -been led to believe that STDIN should be UTF-8, but then STDIN coming -in from another command is not UTF-8, Perl will complain about the +If your locale environment variables (C<LC_ALL>, C<LC_CTYPE>, C<LANG>) +contain the strings 'UTF-8' or 'UTF8' (matched case-insensitively) +B<and> you enable using UTF-8 either by using the C<-C> command line +switch or by setting the PERL_UTF8_LOCALE environment variable to +a true value, then the default encoding of your STDIN, STDOUT, and +STDERR, and of B<any subsequent file open>, is UTF-8. Note that this +means that Perl expects other software to work, too: if Perl has been +led to believe that STDIN should be UTF-8, but then STDIN coming in +from another command is not UTF-8, Perl will complain about the malformed UTF-8.

All features that combine Unicode and I/O also require using the new ==== //depot/perl/pod/perlvar.pod#111 -
/u/vieraat/vieraat/jhi/pp4/perl/pod/perlvar.pod ==== Index: perl/pod/perlvar.pod --- perl/pod/perlvar.pod.~1~ Tue Jan 14 17:29:10 2003 +++ perl/pod/perlvar.pod Tue Jan 14 17:29:10 2003 @@ -1109,6 +1109,16 @@ B<-T>), 0 for off, -1 when only taint warnings are enabled (i.e. with B<-t> or B<-TU>). This variable is read-only.

+=item ${^UTF8_LOCALE} + +Reflects whether the locale settings indicated the use of UTF-8 and that +the use of UTF-8 was enabled either by the C<-C> command line switch or +by setting the PERL_UTF8_LOCALE environment variable to a true value. +This variable is read-only. If true, the STDIN is expected to be in +UTF-8, the STDOUT and STDERR are in UTF-8, and C<:utf8> is the default +file open layer. See L<perluniintro>, L<perlfunc/open>, and L<open> +for more information. + =item $PERL_VERSION

=item $^V @@ -1148,21 +1158,6 @@ The current set of warning checks enabled by the C<use warnings> pragma. See the documentation of C<warnings> for more details.

-=item ${^WIDE_SYSTEM_CALLS} - -Global flag that enables system calls made by Perl to use wide character -APIs native to the system, if available. This is currently only implemented -on the Windows platform. - -This can also be enabled from the command line using the C<-C> switch. - -The initial value is typically C<0> for compatibility with Perl versions -earlier than 5.6, but may be automatically set to C<1> by Perl if the system -provides a user-settable default (e.g., C<$ENV{LC_CTYPE}>). - -The C<bytes> pragma always overrides the effect of this flag in the current -lexical scope. See L<bytes>. - =item $EXECUTABLE_NAME

=item $^X End of Patch.