| From | Sent On | Attachments |
|---|---|---|
| Jarkko Hietaniemi | Jan 14, 2003 7:48 am | |
| Rafael Garcia-Suarez | Jan 14, 2003 8:29 am | |
| Benjamin Goldberg | Jan 14, 2003 2:56 pm | |
| Jarkko Hietaniemi | Jan 14, 2003 3:05 pm | |
| Jarkko Hietaniemi | Jan 14, 2003 3:49 pm | |
| Jarkko Hietaniemi | Jan 15, 2003 7:17 pm | |
| Abe Timmerman | Jan 16, 2003 12:35 pm | |
| Jarkko Hietaniemi | Jan 16, 2003 12:40 pm | |
| Abe Timmerman | Jan 16, 2003 1:24 pm | |
| Jarkko Hietaniemi | Jan 16, 2003 1:31 pm |
| Subject: | [PATCH] enable locale-induced UTF-8 I/O only if explicitly asked (Was: Re: [perl #19743] implicit utf8ification causes action-at-distance bugs) | |
|---|---|---|
| From: | Jarkko Hietaniemi (jh...@iki.fi) | |
| Date: | Jan 14, 2003 7:48:56 am | |
| List: | org.perl.perl5-porters | |
In our previous episode we found out that there were two problems inherent in the implicit UTF-8-ification:
(1) The UTF-8 kicked in even when the user didn't ask for it. Lots of people using RH 8.0 have been bitten by this because the default locales are UTF-8.
(2) Even when and if the user wanted it, reading in malformed UTF-8 didn't do anything *immediately*. It was only later when and if further operations were attempted on the malformed data that the sad state was detected.
The issue (2) was fixed by Encode 1.84, now the <> (et alia) detect the evil data. (Though some further hacking may be required, a single UTF-8 tr/// test was broken by the Encode 1.84.)
So the issue (1) still would remain but the following patch attempts to rectify the situation, by making the UTF-8-ification explicit instead of implicit.
This patch (inlined since last time something ate my attachment) hijacks the -C switch (as suggested by Sarathy) to do the enabling of UTF-8-fied I/O. So no more implicit UTF-8 based on locale settings. (Use of the locale pragma wouldn't have worked that well since it is lexical in scope, while the UTF-8 decision is rather global in scope.) I added also an alternative way of enabling this feature: setting the $ENV{PERL_UTF8_LOCALE} to true (the -C, if present, wins).
In a perverse way going explicit is bad news since the implicit UTF-8-ification has certainly shaken many evil bugs out of the 5.8.0 tree (the B0B bug comes to mind, for example). Maybe for those platforms that have UTF-8 locales a new column of smoke testing (with env PERL_UTF8_LOCALE=1 LC_ALL=xx_YY.UTF-8) would be in order.
==== //depot/perl/embedvar.h#156 - /u/vieraat/vieraat/jhi/pp4/perl/embedvar.h
====
Index: perl/embedvar.h
--- perl/embedvar.h.~1~ Tue Jan 14 17:29:10 2003
+++ perl/embedvar.h Tue Jan 14 17:29:10 2003
@@ -413,10 +413,10 @@
#define PL_utf8_toupper (vTHX->Iutf8_toupper)
#define PL_utf8_upper (vTHX->Iutf8_upper)
#define PL_utf8_xdigit (vTHX->Iutf8_xdigit)
+#define PL_utf8locale (vTHX->Iutf8locale)
#define PL_uudmap (vTHX->Iuudmap)
#define PL_wantutf8 (vTHX->Iwantutf8)
#define PL_warnhook (vTHX->Iwarnhook)
-#define PL_widesyscalls (vTHX->Iwidesyscalls)
#define PL_xiv_arenaroot (vTHX->Ixiv_arenaroot)
#define PL_xiv_root (vTHX->Ixiv_root)
#define PL_xnv_arenaroot (vTHX->Ixnv_arenaroot)
@@ -702,10 +702,10 @@
#define PL_Iutf8_toupper PL_utf8_toupper
#define PL_Iutf8_upper PL_utf8_upper
#define PL_Iutf8_xdigit PL_utf8_xdigit
+#define PL_Iutf8locale PL_utf8locale
#define PL_Iuudmap PL_uudmap
#define PL_Iwantutf8 PL_wantutf8
#define PL_Iwarnhook PL_warnhook
-#define PL_Iwidesyscalls PL_widesyscalls
#define PL_Ixiv_arenaroot PL_xiv_arenaroot
#define PL_Ixiv_root PL_xiv_root
#define PL_Ixnv_arenaroot PL_xnv_arenaroot
==== //depot/perl/gv.c#178 - /u/vieraat/vieraat/jhi/pp4/perl/gv.c ====
Index: perl/gv.c
--- perl/gv.c.~1~ Tue Jan 14 17:29:10 2003
+++ perl/gv.c Tue Jan 14 17:29:10 2003
@@ -974,9 +974,15 @@
goto ro_magicalize;
else
break;
+ case '\025':
+ if (len > 1 && strNE(name, "\025TF8_LOCALE"))
+ break;
+ goto ro_magicalize;
+
case '\027': /* $^W & $^WARNING_BITS */
- if (len > 1 && strNE(name, "\027ARNING_BITS")
- && strNE(name, "\027IDE_SYSTEM_CALLS"))
+ if (len > 1
+ && strNE(name, "\027ARNING_BITS")
+ )
break;
goto magicalize;
@@ -1793,10 +1799,13 @@
goto yes;
}
break;
+ case '\025':
+ if (len > 1 && strEQ(name, "\025TF8_LOCALE"))
+ goto yes;
case '\027': /* $^W & $^WARNING_BITS */
if (len == 1
|| (len == 12 && strEQ(name, "\027ARNING_BITS"))
- || (len == 17 && strEQ(name, "\027IDE_SYSTEM_CALLS")))
+ )
{
goto yes;
}
==== //depot/perl/intrpvar.h#112 - /u/vieraat/vieraat/jhi/pp4/perl/intrpvar.h
====
Index: perl/intrpvar.h
--- perl/intrpvar.h.~1~ Tue Jan 14 17:29:10 2003
+++ perl/intrpvar.h Tue Jan 14 17:29:10 2003
@@ -48,7 +48,7 @@
*/
PERLVAR(Idowarn, U8) -PERLVAR(Iwidesyscalls, bool) /* wide system calls */ +PERLVAR(Iutf8locale, bool) /* utf8 locale detected */ PERLVAR(Idoextract, bool) PERLVAR(Isawampersand, bool) /* must save all match strings */ PERLVAR(Iunsafe, bool) ==== //depot/perl/locale.c#10 - /u/vieraat/vieraat/jhi/pp4/perl/locale.c ==== Index: perl/locale.c --- perl/locale.c.~1~ Tue Jan 14 17:29:10 2003 +++ perl/locale.c Tue Jan 14 17:29:10 2003 @@ -475,7 +475,7 @@
#ifdef USE_PERLIO { - /* Set PL_wantutf8 to TRUE if using PerlIO _and_ + /* Set PL_utf8locale to TRUE if using PerlIO _and_ any of the following are true: - nl_langinfo(CODESET) contains /^utf-?8/i - $ENV{LC_ALL} contains /^utf-?8/i @@ -487,37 +487,44 @@ it overrides LC_MESSAGES for GNU gettext, and it also can have more than one locale, separated by spaces, in case you need to know.) - If PL_wantutf8 is true, perl.c:S_parse_body() - will turn on the PerlIO :utf8 discipline on STDIN, STDOUT, - STDERR, _and_ the default open discipline. + If PL_utf8locale and PL_wantutf8 (set by -C) are true, + perl.c:S_parse_body() will turn on the PerlIO :utf8 layer + on STDIN, STDOUT, STDERR, _and_ the default open discipline. */ - bool wantutf8 = FALSE; + bool utf8locale = FALSE; char *codeset = NULL; #if defined(HAS_NL_LANGINFO) && defined(CODESET) codeset = nl_langinfo(CODESET); #endif if (codeset) - wantutf8 = (ibcmp(codeset, "UTF-8", 5) == 0 || - ibcmp(codeset, "UTF8", 4) == 0); + utf8locale = (ibcmp(codeset, "UTF-8", 5) == 0 || + ibcmp(codeset, "UTF8", 4) == 0); #if defined(USE_LOCALE) else { /* nl_langinfo(CODESET) is supposed to correctly * interpret the locale environment variables, * but just in case it fails, let's do this manually. */ if (lang) - wantutf8 = (ibcmp(lang, "UTF-8", 5) == 0 || - ibcmp(lang, "UTF8", 4) == 0); + utf8locale = (ibcmp(lang, "UTF-8", 5) == 0 || + ibcmp(lang, "UTF8", 4) == 0); #ifdef USE_LOCALE_CTYPE if (curctype) - wantutf8 = (ibcmp(curctype, "UTF-8", 5) == 0 || - ibcmp(curctype, "UTF8", 4) == 0); + utf8locale = (ibcmp(curctype, "UTF-8", 5) == 0 || + ibcmp(curctype, "UTF8", 4) == 0); #endif if (lc_all) - wantutf8 = (ibcmp(lc_all, "UTF-8", 5) == 0 || - ibcmp(lc_all, "UTF8", 4) == 0); + utf8locale = (ibcmp(lc_all, "UTF-8", 5) == 0 || + ibcmp(lc_all, "UTF8", 4) == 0); #endif /* USE_LOCALE */ } - if (wantutf8) - PL_wantutf8 = TRUE; + if (utf8locale) + PL_utf8locale = TRUE; + } + /* Set PL_wantutf8 to $ENV{PERL_UTF8_LOCALE} if using PerlIO. + This is an alternative to using the -C command line switch + (the -C if present will override this). */ + { + char *p = PerlEnv_getenv("PERL_UTF8_LOCALE"); + PL_wantutf8 = p ? (bool) atoi(p) : FALSE; } #endif
==== //depot/perl/mg.c#246 - /u/vieraat/vieraat/jhi/pp4/perl/mg.c ==== Index: perl/mg.c --- perl/mg.c.~1~ Tue Jan 14 17:29:10 2003 +++ perl/mg.c Tue Jan 14 17:29:10 2003 @@ -662,7 +662,11 @@ ? (PL_taint_warn || PL_unsafe ? -1 : 1) : 0); break; - case '\027': /* ^W & $^WARNING_BITS & ^WIDE_SYSTEM_CALLS */ + case '\025': /* $^UTF8_LOCALE */ + if (strEQ(mg->mg_ptr, "\025TF8_LOCALE")) + sv_setiv(sv, (IV) (PL_wantutf8 && PL_utf8locale)); + break; + case '\027': /* ^W & $^WARNING_BITS */ if (*(mg->mg_ptr+1) == '\0') sv_setiv(sv, (IV)((PL_dowarn & G_WARN_ON) ? TRUE : FALSE)); else if (strEQ(mg->mg_ptr+1, "ARNING_BITS")) { @@ -679,8 +683,6 @@ } SvPOK_only(sv); } - else if (strEQ(mg->mg_ptr+1, "IDE_SYSTEM_CALLS")) - sv_setiv(sv, (IV)PL_widesyscalls); break; case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': case '&': @@ -1925,7 +1927,13 @@ PL_basetime = (Time_t)(SvIOK(sv) ? SvIVX(sv) : sv_2iv(sv)); #endif break; - case '\027': /* ^W & $^WARNING_BITS & ^WIDE_SYSTEM_CALLS */ + case '\025': /* $^UTF8_LOCALE */ + if (SvIOK(sv) ? SvIVX(sv) : sv_2iv(sv)) + PL_wantutf8 = PL_utf8locale; + else + PL_wantutf8 = FALSE; + break; + case '\027': /* ^W & $^WARNING_BITS */ if (*(mg->mg_ptr+1) == '\0') { if ( ! (PL_dowarn & G_WARN_ALL_MASK)) { i = SvIOK(sv) ? SvIVX(sv) : sv_2iv(sv); @@ -1967,8 +1975,6 @@ } } } - else if (strEQ(mg->mg_ptr+1, "IDE_SYSTEM_CALLS")) - PL_widesyscalls = (bool)SvTRUE(sv); break; case '.': if (PL_localizing) { ==== //depot/perl/perl.c#461 - /u/vieraat/vieraat/jhi/pp4/perl/perl.c ==== Index: perl/perl.c --- perl/perl.c.~1~ Tue Jan 14 17:29:10 2003 +++ perl/perl.c Tue Jan 14 17:29:10 2003 @@ -1355,10 +1355,11 @@ if (!PL_do_undump) init_postdump_symbols(argc,argv,env);
- /* PL_wantutf8 is conditionally turned on by
+ /* PL_utf8locale is conditionally turned on by
* locale.c:Perl_init_i18nl10n() if the environment
- * look like the user wants to use UTF-8. */
- if (PL_wantutf8) { /* Requires init_predump_symbols(). */
+ * look like the user wants to use UTF-8.
+ * PL_wantutf8 is turned on by -C or by $ENV{PERL_UTF8_LOCALE}. */
+ if (PL_utf8locale && PL_wantutf8) { /* Requires init_predump_symbols(). */
IO* io;
PerlIO* fp;
SV* sv;
@@ -2156,7 +2157,7 @@
return s + numlen;
}
case 'C':
- PL_widesyscalls = TRUE;
+ PL_wantutf8 = TRUE; /* Can be set earlier by $ENV{PERL_UTF8_LOCALE}. */
s++;
return s;
case 'F':
@@ -3397,7 +3398,7 @@
for (; argc > 0; argc--,argv++) {
SV *sv = newSVpv(argv[0],0);
av_push(GvAVn(PL_argvgv),sv);
- if (PL_widesyscalls)
+ if (PL_wantutf8)
(void)sv_utf8_decode(sv);
}
}
==== //depot/perl/perlapi.h#78 - /u/vieraat/vieraat/jhi/pp4/perl/perlapi.h ====
Index: perl/perlapi.h
--- perl/perlapi.h.~1~ Tue Jan 14 17:29:10 2003
+++ perl/perlapi.h Tue Jan 14 17:29:10 2003
@@ -584,14 +584,14 @@
#define PL_utf8_upper (*Perl_Iutf8_upper_ptr(aTHX))
#undef PL_utf8_xdigit
#define PL_utf8_xdigit (*Perl_Iutf8_xdigit_ptr(aTHX))
+#undef PL_utf8locale
+#define PL_utf8locale (*Perl_Iutf8locale_ptr(aTHX))
#undef PL_uudmap
#define PL_uudmap (*Perl_Iuudmap_ptr(aTHX))
#undef PL_wantutf8
#define PL_wantutf8 (*Perl_Iwantutf8_ptr(aTHX))
#undef PL_warnhook
#define PL_warnhook (*Perl_Iwarnhook_ptr(aTHX))
-#undef PL_widesyscalls
-#define PL_widesyscalls (*Perl_Iwidesyscalls_ptr(aTHX))
#undef PL_xiv_arenaroot
#define PL_xiv_arenaroot (*Perl_Ixiv_arenaroot_ptr(aTHX))
#undef PL_xiv_root
==== //depot/perl/pod/perlrun.pod#67 -
/u/vieraat/vieraat/jhi/pp4/perl/pod/perlrun.pod ====
Index: perl/pod/perlrun.pod
--- perl/pod/perlrun.pod.~1~ Tue Jan 14 17:29:10 2003
+++ perl/pod/perlrun.pod Tue Jan 14 17:29:10 2003
@@ -266,11 +266,21 @@
=item B<-C>
-enables Perl to use the native wide character APIs on the target system. -The magic variable C<${^WIDE_SYSTEM_CALLS}> reflects the state of -this switch. See L<perlvar/"${^WIDE_SYSTEM_CALLS}">. +enables Perl to use the Unicode APIs on the target system.
-This feature is currently only implemented on the Win32 platform. +As of Perl 5.8.1, if C<-C> is used and the locale settings (the LC_ALL, +LC_CTYPE, and LANG environment variables) indicate a UTF-8 locale, +the STDIN is expected to be in UTF-8, the STDOUT and STDERR are +expected to be in UTF-8, and C<:utf8> is the default file open layer. +See L<perluniintro>, L<perlfunc/open>, and L<open> for more information. +The magic variable C<${^UTF8_LOCALE}> reflects this state, +see L<perlvar/"${^UTF8_LOCALE}">. (Another way of setting this +variable is to set the environment variable PERL_UTF8_LOCALE.) + +(In Perls earlier than 5.8.1 the C<-C> switch was a Win32-only switch +that enabled the use of Unicode-aware "wide system call" Win32 APIs. +This feature was practically unused, however, and the command line +switch was therefore "recycled".)
=item B<-c>
==== //depot/perl/pod/perlunicode.pod#113 -
/u/vieraat/vieraat/jhi/pp4/perl/pod/perlunicode.pod ====
Index: perl/pod/perlunicode.pod
--- perl/pod/perlunicode.pod.~1~ Tue Jan 14 17:29:10 2003
+++ perl/pod/perlunicode.pod Tue Jan 14 17:29:10 2003
@@ -67,13 +67,6 @@
external programs, from information provided by the system (such as %ENV),
or from literals and constants in the source text.
-On Windows platforms, if the C<-C> command line switch is used or the -${^WIDE_SYSTEM_CALLS} global flag is set to C<1>, all system calls -will use the corresponding wide-character APIs. This feature is -available only on Windows to conform to the API standard already -established for that platform--and there are very few non-Windows -platforms that have Unicode-aware APIs. - The C<bytes> pragma will always, regardless of platform, force byte semantics in a particular lexical scope. See L<bytes>.
@@ -1050,10 +1043,14 @@
=item *
-If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG) -contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching), -the default encodings of your STDIN, STDOUT, and STDERR, and of -B<any subsequent file open>, are considered to be UTF-8. +If your locale environment variables (LC_ALL, LC_CTYPE, LANG) +contain the strings 'UTF-8' or 'UTF8' (matched case-insensitively) +B<and> you enable using UTF-8 either by using the C<-C> command line +switch or setting the PERL_UTF8_LOCALE environment variable to a true +value, then the default encodings of your STDIN, STDOUT, and STDERR, +and of B<any subsequent file open>, are considered to be UTF-8. +See L<perluniintro>, L<perlfunc/open>, and L<open> for more +information. The magic variable C<${^UTF8_LOCALE}> will also be set.
=item *
@@ -1410,6 +1407,6 @@ =head1 SEE ALSO
L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>, -L<perlretut>, L<perlvar/"${^WIDE_SYSTEM_CALLS}"> +L<perlretut>, L<perlvar/"${^UTF8_LOCALE}">
=cut
==== //depot/perl/pod/perluniintro.pod#44 -
/u/vieraat/vieraat/jhi/pp4/perl/pod/perluniintro.pod ====
Index: perl/pod/perluniintro.pod
--- perl/pod/perluniintro.pod.~1~ Tue Jan 14 17:29:10 2003
+++ perl/pod/perluniintro.pod Tue Jan 14 17:29:10 2003
@@ -172,13 +172,15 @@
to this sample program ensures that the output is completely UTF-8,
and removes the program's warning.
-If your locale environment variables (C<LANGUAGE>, C<LC_ALL>, -C<LC_CTYPE>, C<LANG>) contain the strings 'UTF-8' or 'UTF8', -regardless of case, then the default encoding of your STDIN, STDOUT, -and STDERR and of B<any subsequent file open>, is UTF-8. Note that -this means that Perl expects other software to work, too: if Perl has -been led to believe that STDIN should be UTF-8, but then STDIN coming -in from another command is not UTF-8, Perl will complain about the +If your locale environment variables (C<LC_ALL>, C<LC_CTYPE>, C<LANG>) +contain the strings 'UTF-8' or 'UTF8' (matched case-insensitively) +B<and> you enable using UTF-8 either by using the C<-C> command line +switch or by setting the PERL_UTF8_LOCALE environment variable to +a true value, then the default encoding of your STDIN, STDOUT, and +STDERR, and of B<any subsequent file open>, is UTF-8. Note that this +means that Perl expects other software to work, too: if Perl has been +led to believe that STDIN should be UTF-8, but then STDIN coming in +from another command is not UTF-8, Perl will complain about the malformed UTF-8.
All features that combine Unicode and I/O also require using the new
==== //depot/perl/pod/perlvar.pod#111 -
/u/vieraat/vieraat/jhi/pp4/perl/pod/perlvar.pod ====
Index: perl/pod/perlvar.pod
--- perl/pod/perlvar.pod.~1~ Tue Jan 14 17:29:10 2003
+++ perl/pod/perlvar.pod Tue Jan 14 17:29:10 2003
@@ -1109,6 +1109,16 @@
B<-T>), 0 for off, -1 when only taint warnings are enabled (i.e. with
B<-t> or B<-TU>). This variable is read-only.
+=item ${^UTF8_LOCALE} + +Reflects whether the locale settings indicated the use of UTF-8 and that +the use of UTF-8 was enabled either by the C<-C> command line switch or +by setting the PERL_UTF8_LOCALE environment variable to a true value. +This variable is read-only. If true, the STDIN is expected to be in +UTF-8, the STDOUT and STDERR are in UTF-8, and C<:utf8> is the default +file open layer. See L<perluniintro>, L<perlfunc/open>, and L<open> +for more information. + =item $PERL_VERSION
=item $^V @@ -1148,21 +1158,6 @@ The current set of warning checks enabled by the C<use warnings> pragma. See the documentation of C<warnings> for more details.
-=item ${^WIDE_SYSTEM_CALLS} - -Global flag that enables system calls made by Perl to use wide character -APIs native to the system, if available. This is currently only implemented -on the Windows platform. - -This can also be enabled from the command line using the C<-C> switch. - -The initial value is typically C<0> for compatibility with Perl versions -earlier than 5.6, but may be automatically set to C<1> by Perl if the system -provides a user-settable default (e.g., C<$ENV{LC_CTYPE}>). - -The C<bytes> pragma always overrides the effect of this flag in the current -lexical scope. See L<bytes>. - =item $EXECUTABLE_NAME
=item $^X End of Patch.
-- Jarkko Hietaniemi <jh...@iki.fi> http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen





