atom feed2 messages in org.perl.perl5-portersExtended Unicode materials now available
FromSent OnAttachments
Tom ChristiansenJul 8, 2011 8:41 am 
Ricardo SignesJul 8, 2011 9:56 am 
Subject:Extended Unicode materials now available
From:Tom Christiansen (
Date:Jul 8, 2011 8:41:30 am

(ALSO: Feel free to please pass along this message to whomever you think it might help.)

Anyone interested is welcome to fetch the materials, including both slides and scripts, for two of my OSCON talks on Perl and Unicode from here:

My two talks are:

ⅰ. Perl Unicode Essentials

Nᴏᴛᴀ Bᴇɴᴇ: I’d proposed a 6‐hour talk, but they gave me just 3 hours.

ⅱ. Unicode in Perl Regexes

Nᴏᴛᴀ Bᴇɴᴇ: I’d proposed a 6‐hour talk, but they gave me a mere 40 minutes.

Both come in three equivalent forms, with the first canonical and the other two derived from the first:

⒜ source doc in regular old Perl pod. ⒝ HTML slideshow derived from ⒜. ⒞ PDF to print multislides per page, also derived from ⒜.

Also included is a directory loaded with scripts. All are about Unicode. Many I use every day, as some of you have seen. Most have Unicode in them, and some even use verboten Unicode identifiers:

hypertest: my @ὑπέρμεγας = ( leo my $ʇndʇno = uʍopəpᴉƨdn($input); nunez my $SI_IMPORTAN_MARCAS_DIACRÍTICAS = 0; nunez next unless @resultados || $INCLUÍR_NINGUNOS; nunez $cmáx => !$déjà_imprimée++ && encomillar($aldea), uniquote sub commaʼd_list {

Please note that the last one is not “comma'd_list”!!

A whatis(1)‐style description of the contents of the scripts directory follows. It’s divided into 9 sections, with the more important sections toward the top.

Commments, corrections, kvetches, complaints, catcalls, cris‐de‐cœur, cool‐beans, krikeys, carambas, kyries, and copriloquies all welcome. :)


Contents of tchrist-unicode-scripts directory, grouped:

GENERAL unichars – show which code points match arbitrary criteria uniprops – show which props a code point has (by number or name, etc) uninames – intelligrep the now‐excised NameList.txt (included)

2. REWRITES OF CRITICAL UNIX PROGRAMS: uniquote – replacement for od(1) or -v option to cat(1), but for Unicode tcgrep – very ancient grep(1) replacment, needs rewrite but now supports
named character unilook – look(1) rewrite but with grep and agrep support; require included
words.utf8 file ucsort – sort(1) rewrite using the UCA, includes Unicode locales, and
inteligent --pre stuff unifmt – fmt(1) rewrite rename – ancient rewrite of Larry’s old rename(1) rewrite; might help
Unicode filesyssues uniwc – wc(1) rewrite for Unicode, includes \R support, graphemes, etc;
needs refactoring

3. PROGRAMS FOR NORMALIZATION FILTERS, CHECKER nfd, nfc, nfkd, nfkc – Unicode normalization filters nfcheck – report which which of NF{,K}[DC} apply to any given file % nfcheck leo hantest nunez tc macroman leo: NFC NFD hantest: NFC nunez: NFC NFKC tc: NFC NFKC NFD NFKD

4. (RE)CASING FILTER PROGRAMS: lc – filter to do the Unicode toLower casemapping % echo "Filter to Convert a Title's Words to the Right Case" | lc filter to convert a title's words to the right case tc – filter to do the Unicode toTitle casemapping (intelligently) % echo "filter to convert a title's words to the right case" | tc Filter To Convert A Title's Words To The Right Case titulate – converts string args to English **HEADLINE** case (NB: headline
!= titlecase) % titulate "filter to convert a title's words to the right case" Filter to Convert a Title's Words to the Right Case uc – filter to do the Unicode toUpper casemapping % echo "filter to convert a title's words to the right case" | uc FILTER TO CONVERT A TITLE'S WORDS TO THE RIGHT CASE

5. FONT GAME PROGRAMS: leo – uʍopəpᴉsdn sƃuᴉɥʇ əʇᴉɹʍ oʇ ɹəʇlᴉɟ unifont – filter for showing all Unicode “alternate font” letters % echo hic sunt data unicodica | unifont Double‐Struck: 𝕙𝕚𝕔 𝕤𝕦𝕟𝕥 𝕕𝕒𝕥𝕒 𝕦𝕟𝕚𝕔𝕠𝕕𝕚𝕔𝕒 Monospace: 𝚑𝚒𝚌 𝚜𝚞𝚗𝚝 𝚍𝚊𝚝𝚊 𝚞𝚗𝚒𝚌𝚘𝚍𝚒𝚌𝚊 Sans‐Serif: 𝗁𝗂𝖼 𝗌𝗎𝗇𝗍 𝖽𝖺𝗍𝖺 𝗎𝗇𝗂𝖼𝗈𝖽𝗂𝖼𝖺 Sans‐Serif Italic: 𝘩𝘪𝘤 𝘴𝘶𝘯𝘵 𝘥𝘢𝘵𝘢 𝘶𝘯𝘪𝘤𝘰𝘥𝘪𝘤𝘢 Sans‐Serif Bold: 𝗵𝗶𝗰 𝘀𝘂𝗻𝘁 𝗱𝗮𝘁𝗮 𝘂𝗻𝗶𝗰𝗼𝗱𝗶𝗰𝗮 Sans‐Serif Bold Italic: 𝙝𝙞𝙘 𝙨𝙪𝙣𝙩 𝙙𝙖𝙩𝙖 𝙪𝙣𝙞𝙘𝙤𝙙𝙞𝙘𝙖 Script: 𝒽𝒾𝒸 𝓈𝓊𝓃𝓉 𝒹𝒶𝓉𝒶 𝓊𝓃𝒾𝒸ℴ𝒹𝒾𝒸𝒶 Italic: h𝑖𝑐 𝑠𝑢𝑛𝑡 𝑑𝑎𝑡𝑎 𝑢𝑛𝑖𝑐𝑜𝑑𝑖𝑐𝑎 Bold: 𝐡𝐢𝐜 𝐬𝐮𝐧𝐭 𝐝𝐚𝐭𝐚 𝐮𝐧𝐢𝐜𝐨𝐝𝐢𝐜𝐚 Bold Italic: 𝒉𝒊𝒄 𝒔𝒖𝒏𝒕 𝒅𝒂𝒕𝒂 𝒖𝒏𝒊𝒄𝒐𝒅𝒊𝒄𝒂 Fraktur: 𝔥𝔦𝔠 𝔰𝔲𝔫𝔱 𝔡𝔞𝔱𝔞 𝔲𝔫𝔦𝔠𝔬𝔡𝔦𝔠𝔞 Bold Fraktur: 𝖍𝖎𝖈 𝖘𝖚𝖓𝖙 𝖉𝖆𝖙𝖆 𝖚𝖓𝖎𝖈𝖔𝖉𝖎𝖈𝖆 unicaps – Fɪʟᴛᴇʀ ᴛᴏ ᴄᴏɴᴠᴇʀᴛ ᴛᴏ sᴍᴀʟʟ ᴄᴀᴘs unisubs, unisupers – filter to show subscripted₁₉₈₇ and ˢᵘᵖᵉʳˢᶜʳⁱᵖᵗᵉᵈ
versions unititle – prototype to over/underline things (real version in progress) uniwide, uninarrow – reversable filters for converting to FULLWIDTH equivs

6. TEST AND DEMO PROGRAMS: macroman – show mapping between MacRoman and Uncidoe byte2uni – early prototype of general‐purpose version of the macroman DEMO: byte2uni -a -ecp1252 es-sort – how to do fancy UCA sorts, using Spanish city names hantest – demo of Unihan stuff and Unicode::{LineBreak, GCString} havshpx – vs lbh unir gb nfx, lbh qb abg jnag gb xabj hypertest – demo support trans‐Unicode code point support nunez – demo accent‐insensitive searches; very well commented vowel-sigs – show how to create your own properties; also, regex

7. MODULES – C<no Underscore> forbids unlocalized $_ access – tries to sort text items with numbers, including Roman,
intelligently, includes support for Unicode Romans, and for Romans written
in Latin script, but requires module for the latter. Falls
back to the UCA. – EGAD! I talked them into making
most of this functionality part of JDK7.

8. LIBRARIES: unicore/{all,html,uwords} – a forgotten charnames facility

9. FILES: words.utf8 – sorted dictionary list of UTF-8 words for unilook

WARNING: I *always* have my PERL_UNICODE envariable set to "S" (and only turn that off on a rare, one‐shot basis), so may of these malfunction otherwise. Some may sometimes also need "A", and tcgrep may sometimes also need "D" if not reading from a pipe.)


On Licences

I basically want to get all this stuff out there so people understand all these things better. It’s really important. So please consider all the tchrist_unicode_scripts/ files to carry, even if not stated:


Copyright 2011 Tom Christiansen.

This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.

The slides licence I haven’t thought about. I think online redistribution is pretty much fine so long as you don?t pretend you wrote them instead of me. :)

*However*, be warned that the imminent 4ᵗʰ edition of the Camel contains some of these specific examples and wordings, so you probably should please ask before inserting them verbatim and uncredited into your own books.

In other words: Just be cool, ok?