udump

Meet udump: the dumper with a difference.

Sometimes you just need to see what characters are lurking inside a Unicode encoded text file. Your garden variety dump utility (like the venerable od in UNIX systems and the Windows standard hex dump - oh, wait, there is no hex dump utility included with Windows, sorry) only shows you the plain bytes, so you have to head over to unicode.org to find out what they mean. But first you need to decode UTF-8 to get the actual code points, or grok UTF-16 LE or BE, and so on. We won't have that, no matter how fun it might be.

The udump utility shows you a nice list of character names, together with their offsets in the file. Currently udump only handles UTF-8, so the offset is calculated based on the UTF-8 length of the character.

Get it

To get udump, get the source code. (Old versions: none at the moment)

Use it

Here is an example of the udump display:

$ python udump.py testfile2
Using Unicode 3.2.0 data
Read 15 characters
00000000: U+000030 DIGIT ZERO
00000001: U+000020 SPACE
00000002: U+0020AC EURO SIGN
00000005: U+00003A COLON
00000006: U+000020 SPACE
00000007: U+00006E LATIN SMALL LETTER N
00000008: U+00006F LATIN SMALL LETTER O
00000009: U+000074 LATIN SMALL LETTER T
0000000A: U+000020 SPACE
0000000B: U+000062 LATIN SMALL LETTER B
0000000C: U+000061 LATIN SMALL LETTER A
0000000D: U+000064 LATIN SMALL LETTER D
0000000E: U+000021 EXCLAMATION MARK
0000000F: U+000020 SPACE
00000010: U+00000A (unnamed character)

Well, that also goes for a usage example. As you can see, udump is a Python script. You will require Python 2.3 or later to use it.

Do it

You will be interested to know that udump was developed using jEdit 4.2 and Python 2.3.5 running in Mac OS X 10.4.4 on a Mac mini 1.42 GHz. Why? Because Mac OS X is so splendid, jEdit is such a workhorse, and Python just plain rocks. There, I've said it. But I'm still a Java guy at work.

Improve it

Of course udump isn't perfect. It only handles UTF-8, and does not know about surrogate characters, because Python doesn't (yet). But that will do for now. Improvement ideas include:

Learn it

More on the subject matter:

History of it

The idea had been kicking around for a couple of years. I did an initial sketch with Java, complete with different encodings (Java SE is very good at that), but I felt it was too complicated. At some point I had a similar Python sketch that even attempted to handle surrogates, but I seem to have lost the source code, so I started from scratch. Now that I know Python better it turned out to be almost trivial.

While looking for something else entirely I discovered John Walker's unum utility. It is a handy Unicode and HTML entity lookup tool, highly recommended.


As usual, udump is provided with NO WARRANTY for any purpose whatsoever. Share and enjoy. Unicode is a trademark of Unicode, Inc.

Feedback & suggestions to <cone (at) iki dot fi>.

Last updated: 2006-03-25