This project is read-only.

How did you do it?

Mar 29, 2014 at 2:52 AM
Edited Mar 29, 2014 at 2:54 AM
Hi Gavin,

I'm amazed at your decomposition data. However, I can't even begin to think how you have gone about it. I'm a programmer too but my field of expertise is a long, long way from this area - operating systems and high-frequency trading. I've changed jobs how though. I'm currently in Taiwan learning Mandarin to follow my interest in Chinese Buddhism.

So, I'm a complete beginner when it comes to Unicode and the like.

I don't understand how you go from a Unicode codepoint, which doesn't have a graphical representation, to a graphical decomposition. There has to be some intermediate graphical information that you analyse somehow. What is that graphical information? The only candidate that I know of is a font but that has obvious problems if you're trying to extract composition data.

I hope you don't mind me asking about your methods.

Anders.
Mar 29, 2014 at 3:37 PM
I had to visually inspect every glyph to decide how to decompose it. Of course, I had to write lots of short programming scripts to analyse the data and ensure its continuing integrity. There's about 10,000 intermediate codes (represented by numerics) not themselves in Unicode, but I suspect there's lots of duplicates and cyclic repeats which I can eliminate with further analysis.
Mar 30, 2014 at 4:07 AM
Wow. That's incredible.

Thank you.

On 29/03/14 22:37, gavingrover wrote:

From: gavingrover

I had to visually inspect every glyph to decide how to decompose it. Of course, I had to write lots of short programming scripts to analyse the data and ensure its continuing integrity. There's about 10,000 intermediate codes (represented by numerics) not themselves in Unicode, but I suspect there's lots of duplicates and cyclic repeats which I can eliminate with further analysis.

May 27, 2014 at 2:39 AM
This is great, thanks for the work. Do you plan on publishing the scripts you wrote to generate the data? I'm thinking of writing some analyses on it in Haskell and thought I'd see what code you'd written to process it.
May 31, 2014 at 3:24 AM
copumpkin wrote:
This is great, thanks for the work. Do you plan on publishing the scripts you wrote to generate the data? I'm thinking of writing some analyses on it in Haskell and thought I'd see what code you'd written to process it.
I don't have those scripts anymore. Most of them were very short and quick to write, usually generating a file which I'd quickly edit, often just deleting lines, which was read into another script which updated the main file. It was much less work to write scripts from scratch than to find pre-written programs to do what I want, which was the original purpose of scripting languages like Perl, Python, and Ruby anyway.

I've just posted, however, a script in Clojure which reads in the data, checks its integrity in various ways, then outputs some reports. You can see it at https://github.com/gavingroovygrover/cjkdecomp/blob/master/decomp.clj