rjbs forgot what he was saying

not logged in (root) | by date | tagcloud | help | login

I rewrote uni

by rjbs, created 2014-10-21 10:49
last modified 2014-10-25 08:36
tagged with: @markup:md journal perl programming unicode

For years, I've used Audrey Tang's uni program for stupid things. It helps you find Unicode characters:

$ uni ☺
263A ☺ WHITE SMILING FACE

# Only on Perl 5.14+
$ uni wry
1F63C 😼 CAT FACE WITH WRY SMILE

I've never been super happy with searching. All of the args to the program are joined on space and searched for. That means that uni roman five won't find ROMAN NUMERAL FIVE. I also used to like using Carl Masak's little HTML+JavaScript unicode deconstructor. This thing would let you type in a string, and it would display each codepoint. I've long since lost it… and anyway, I didn't want to use a web browser. I thought that maybe Tom Christiansen's Unicode-Tussle tools would have the answer, but nothing quite did what I wanted.

After fidgeting unhappily for about ten minutes, I realized that I could've used those ten minutes to write my own solution. I'm sure it's awful in some way, but I'm very pleased with it, and maybe someone else will be, too.

It has four modes:

Single Character Mode

$ uni SINGLE-CHAR

This will print out the name and codepoint of the character.

$ uni ¿
¿ - U+000BF - INVERTED QUESTION MARK

Name Search Mode

$ uni SOME /SEARCH/ TERMS

This one will look for codepoints where each term appears as a (\b-bounded) word in the name. If the term is bounded by slashes, it's treated as a regular expression and is used to filter candidate codepoints by name.

$ uni roman five
Ⅴ - U+02164 - ROMAN NUMERAL FIVE
Ⅾ - U+0216E - ROMAN NUMERAL FIVE HUNDRED
ⅴ - U+02174 - SMALL ROMAN NUMERAL FIVE
ⅾ - U+0217E - SMALL ROMAN NUMERAL FIVE HUNDRED
ↁ - U+02181 - ROMAN NUMERAL FIVE THOUSAND

String Decomposition

$ uni -c SOME STRINGS

This prints out the codepoints in each string, with a blank line between each argument's codepoints.

$ uni -c Hey リコ
H - U+00048 - LATIN CAPITAL LETTER H
e - U+00065 - LATIN SMALL LETTER E
y - U+00079 - LATIN SMALL LETTER Y

リ- U+030EA - KATAKANA LETTER RI
コ- U+030B3 - KATAKANA LETTER KO

Lookup By Codepoint

$ uni -u NUMBERS IN HEX

This prints out the codepoint for each given hex value.

$ uni -u FF 1FF 10FF
ÿ - U+000FF - LATIN SMALL LETTER Y WITH DIAERESIS
ǿ - U+001FF - LATIN SMALL LETTER O WITH STROKE AND ACUTE
ჿ - U+010FF - GEORGIAN LETTER LABIAL SIGN

My uni program is now on GitHub (update: and now on the CPAN) or, for those who are curious, but not curious enough to click a link, it's right here:

#!perl
use 5.12.0;
use warnings;

use charnames ();
use Encode qw(decode);
use Unicode::GCString;

binmode STDOUT, ':encoding(utf-8)';

my $todo;
$todo = \&split_string if @ARGV && $ARGV[0] eq '-c';
$todo = \&codepoints   if @ARGV && $ARGV[0] eq '-u';

shift @ARGV if $todo;

die "only one swich allowed!\n" if grep /\A-/, @ARGV;

@ARGV = map {; decode('UTF-8', $_) } @ARGV;

$todo //= @ARGV == 1 && length $ARGV[0] == 1
        ? \&one_char
        : \&search_chars;

$todo->(@ARGV);

sub one_char {
  print_chars(@_);
}

sub split_string {
  my (@args) = @_;

  while (my $str = shift @args) {
    my @chars = split '', $str;
    print_chars(@chars);

    say '' if @args;
  }
}

sub print_chars {
  my (@chars) = @_;
  for my $c (@chars) {
    my $c2 = Unicode::GCString->new($c);
    my $l  = $c2->columns;

    # I'm not 100% sure why I need this in all cases.  It would make sense in
    # some, since for example COMBINING GRAVE beginning a line becomes its
    # own extended grapheme cluster (right?), but why does INVISIBLE TIMES at
    # the beginning of a line take up a column despite being printing width
    # zero?  The world may never know.  Until Tom tells me.
    # -- rjbs, 2014-10-04
    $l = 1 if $l == 0; # ???

    # Yeah, probably there's some insane %*0s$ invocation of printf to use
    # here, but... just no. -- rjbs, 2014-10-04
    my $p = $c . (' ' x (2 - $l));

    my $chr  = ord($c);
    my $name = charnames::viacode($chr);
    printf "%s- U+%05X - %s\n", $p, $chr, $name;
  }
}

sub codepoints {
  my (@points) = @_;

  my @chars = map {; chr hex s/\Au\+//r } @points;
  print_chars(@chars);
}

sub search_chars {
  my @terms = map {; s{\A/(.+)/\z}{$1} ? qr/$_/i : qr/\b$_\b/i } @_;

  my $corpus = require 'unicore/Name.pl';
  die "somebody beat us here" if $corpus eq '1';

  my @lines = split /\cJ/, $corpus;
  my @chars;
  LINE: for my $line (@lines) {
    my $i = index($line, "\t");
    next if rindex($line, " ", $i) >= 0; # no sequences

    $line =~ $_ || next LINE for @terms;

    push @chars, chr hex substr $line, 0, $i;
  }

  print_chars(@chars);
}