Some questions about the data...

Aug 27, 2012 at 7:09 AM

I posted this a few days ago as a comment to the page describing the data format. Maybe this is a better place for it (?)

 

Hello Gavin,

First I have to say I am very impressed with all this data, I have a hard time understanding how you put that together and how many hours you must have spent on this.

I have a few questions.
- This latest version seem to have at least one new code (built), is that any different from "lock" ? Is there any other notable changes from the previous version ?
- It seems that you have replace a lot of the characters defined as "bd" or "ba" with something else, especially just "w"... I thought the previous way has somewhat more information on how the character was constructed.... (?)
- It seems to me the definitions of "s" and "w" are actually very similar to each other... ?

I am not really an expert in any of this so all this is more for my curiosity than anything else.

If you are interested, I am trying to use this to find characters that are (very) similar looking to each other. I'm having only very mild success so far... but have still quite a few things to try.

Thanks!

Coordinator
Oct 8, 2012 at 9:34 AM

First I have to say I am very impressed with all this data, I have a hard time understanding how you put that together and how many hours you must have spent on this.

djense- Sorry but I only just noticed your comment and discussion. The email notification for this new website didn't happen.

Most of the time I spent generating this data was while I was watching movies. Either activity would have been boring on its own but when done together enabled me to spend long stretches of time on it. It involved frequent text munging in Groovy or Ruby.

I have a few questions.
- This latest version seem to have at least one new code (built), is that any different from "lock" ? Is there any other notable changes from the previous version ?

I put in the "built" code to handle the ㇂ stem. The best way to analyze characters like 戈 and 戋 and 曳 and 𢎯 etc seemed to be to hang stuff of the ㇂ stem. "lock" was just a temporary code for characters I wanted to come back to.

- It seems that you have replace a lot of the characters defined as "bd" or "ba" with something else, especially just "w"... I thought the previous way has somewhat more information on how the character was constructed.... (?)
- It seems to me the definitions of "s" and "w" are actually very similar to each other... ?

I don't remember merging any "bd" and "ba" characters. I agree with you so I'll investigate this. The only difference between "s" and "w" is which constituent is listed first so perhaps I'll merge them.

I am not really an expert in any of this so all this is more for my curiosity than anything else.

If you are interested, I am trying to use this to find characters that are (very) similar looking to each other. I'm having only very mild success so far... but have still quite a few things to try.

Thanks!

There weren't really too many changes overall, besides fixing up the extension C and D decomps. I really just wanted another release just to put it on a new website to separate this project from anything related to "groovy". I do intend to come back to these decomps very soon and will look at cleaning it all up. I notice the font I recommended has >8000 extra characters encoded via variation selectors so maybe I'll encode those too.

As for finding similar-looking characters, I've had to do that as part of the data entry process but the (Groovy) code for it all was just one-off quickies which would be harder to find than to recode. Good luck.

Mar 1, 2014 at 7:32 AM
Hello,
I have had a look at your decomposition file. I would like to have some remarks about your very impressive work.
First of all, you have replaced the 80000+ CJK characters by a private 80000+ coding system with nested definitions . If I am seeking for a character of which I only know a few components, diving into your multiple layers of code numbers is very time consuming and not user-friendly.

For example you have decomposed 𧲇 as 23969 + 豕 ( with 23969 being 33361 + 殳, and 33361 being 33840 + 口, and finally 33840 = 士 +⺆: why not simply 嗀 + 豕, and 嗀 as 士冖口殳殼𡉉吺 as Wenlin does ? Besides, the ⺆ in your decompsition is not as accurate as 冖 in Wenlin's components.

Another remark is there already exist a CDL for Chinese characters (used by Wenlin) , and U+2FF0-2FFF ( Ideographic Description Characters 12 chars ) also are already used to describe Chinese characters (used by Babelmap). Why reinvent the wheel with your functions?

Excuse me for being so critical.

Thank you for your approach of a not so easy task.
Coordinator
Mar 3, 2014 at 9:28 AM
Hi cpngnt
First of all, you have replaced the 80000+ CJK characters by a private 80000+ coding system with nested definitions . If I am seeking for a character of which I only know a few components, diving into your multiple layers of code numbers is very time consuming and not user-friendly.
You really need to write some short scripts in a language like Python or Ruby or similar to query the data in a more time-efficient and/or user-friendly way. I used Groovy and later Clojure a lot for this when I created the data, but I didn't provide any scripts online because it's quite easy for anyone to whip up a recursive function in a scripting language to better query or view the data. Even a visual tool like Access can do it easily.
For example you have decomposed 𧲇 as 23969 + 豕 ( with 23969 being 33361 + 殳, and 33361 being 33840 + 口, and finally 33840 = 士 +⺆: why not simply 嗀 + 豕, and 嗀 as 士冖口殳殼𡉉吺 as Wenlin does ?
Besides the 75,000 Unihan characters from Unicode, there's another 10,000 or so intermediate components I've represented with numbers (though I suspect if I removed duplicates and cyclic repetition, that number would be a lot less).

Representing 嗀 as 士冖口殳殼𡉉吺 loses information such as that the left component and 殳 are structured acrosswards, and a simple script can query what other characters that left component is in. When creating the data, I wanted to preserve that type of information.
Besides, the ⺆ in your decompsition is not as accurate as 冖 in Wenlin's components.
These decompositions are a first pass thru the data, which took me a few months to do about 3 yrs ago, and maybe someday I'll go thru it again checking and correcting stuff, but for now it's available for anyone to use. The licenses you can choose from have all been added as a result of email requests from various people over the years.
Another remark is there already exist a CDL for Chinese characters (used by Wenlin) , and U+2FF0-2FFF ( Ideographic Description Characters 12 chars ) also are already used to describe Chinese characters (used by Babelmap). Why reinvent the wheel with your functions?
I listed the decomposition codes I used on http://cjkdecomp.codeplex.com/wikipage?title=cjk-decomp which you may have missed. The Unicode IDC don't represent some codes I have such as listed in http://groovy.codeplex.com/wikipage?title=Reflection%20and%20repetitions%20in%20the%20CJK%20decomposition%20data

Besides, the Unicode IDC are intended by the Unicode Consortium as pictorial descriptor characters, not control descriptors which is what my codes really function as.

Hope you find use for the data.