Meet udump: the dumper with a difference.
Sometimes you just need to see what characters are lurking inside a
Unicode encoded text file. Your garden variety dump utility (like
the venerable od in UNIX systems and the Windows standard hex dump -
oh, wait, there is no hex dump utility included with Windows, sorry)
only shows you the plain bytes, so you have to head over to
unicode.org
to find out what they mean. But first you need to decode UTF-8 to get the
actual code points, or grok UTF-16 LE or BE, and so on. We won't have that,
no matter how fun it might be.
The udump utility shows you a nice list of character names, together with their offsets in the file. Currently udump only handles UTF-8, so the offset is calculated based on the UTF-8 length of the character.
To get udump, get the source code. (Old versions: none at the moment)
Here is an example of the udump display:
$ python udump.py testfile2 Using Unicode 3.2.0 data Read 15 characters 00000000: U+000030 DIGIT ZERO 00000001: U+000020 SPACE 00000002: U+0020AC EURO SIGN 00000005: U+00003A COLON 00000006: U+000020 SPACE 00000007: U+00006E LATIN SMALL LETTER N 00000008: U+00006F LATIN SMALL LETTER O 00000009: U+000074 LATIN SMALL LETTER T 0000000A: U+000020 SPACE 0000000B: U+000062 LATIN SMALL LETTER B 0000000C: U+000061 LATIN SMALL LETTER A 0000000D: U+000064 LATIN SMALL LETTER D 0000000E: U+000021 EXCLAMATION MARK 0000000F: U+000020 SPACE 00000010: U+00000A (unnamed character)
Well, that also goes for a usage example. As you can see, udump is a Python script. You will require Python 2.3 or later to use it.
You will be interested to know that udump was developed using jEdit 4.2 and Python 2.3.5 running in Mac OS X 10.4.4 on a Mac mini 1.42 GHz. Why? Because Mac OS X is so splendid, jEdit is such a workhorse, and Python just plain rocks. There, I've said it. But I'm still a Java guy at work.
Of course udump isn't perfect. It only handles UTF-8, and does not know about surrogate characters, because Python doesn't (yet). But that will do for now. Improvement ideas include:
More on the subject matter:
unicodedata moduleThe idea had been kicking around for a couple of years. I did an initial sketch with Java, complete with different encodings (Java SE is very good at that), but I felt it was too complicated. At some point I had a similar Python sketch that even attempted to handle surrogates, but I seem to have lost the source code, so I started from scratch. Now that I know Python better it turned out to be almost trivial.
While looking for something else entirely I discovered
John Walker's
unum utility. It is a handy Unicode
and HTML entity lookup tool, highly recommended.
As usual, udump is provided with NO WARRANTY for any purpose whatsoever. Share and enjoy. Unicode is a trademark of Unicode, Inc.
Feedback & suggestions to <cone (at) iki dot fi>.
Last updated: 2006-03-25