[xgws-user] UTF8 Problems and Utf8Reader

Aleksander Slominski aslom_at_cs.indiana.edu
Tue, 22 May 2007 09:40:21 -0400


Martin D. Pedersen wrote:
> Hi 
>
> I have a simple soap function taking a string as input.
> I try to parse utf8 encoded input and it works, in most cases.
> But with some strings the decoding on the server side gets it wrong.
>
> Eg. the string : メーリングリスト  (dont know if it will show properly so here is the bytes \u00e3\u0083\u00a1\u00e3\u0083\u00bc\u00e3\u0083\u00aa\u00e3\u0083\u00b3\u00e3\u0082\u00b0\u00e3\u0083\u00aa\u00e3\u0082\u00b9\u00e3\u0083\u0088)
> Will show up as : ッッッッもッもッ (something different :) )
>
> I traced it down to xsul.xservo_soap_http.HttpBasedServices::serviceXml
> Here a custom Utf8Reader is used and it seems to be the problem.
> If I remove the Utf8Reader and always go with the "standard" approach of parsing the encoding name to the pull parser, it works fine.
> I dont know why it is implemented this way, because of performance?
>
> As a hunch I thought it had something to do with the fact that the string is 3 bytes per character utf encoding.
>
> So I dived into Utf8Reader::read :)
> Without the full understanding of Utf8 and its cusins i managed to fix it by replacing some of the code in the read function.
>
> [...]
> else if  ( (bb & 0xF0) == 0xE0 ) { // enter 3 byte encoding
> [lots of bit masking going on dont know how much is still needed]
>   // 1110xxxx 10xxxxxx 10xxxxxx
>   //                value = bb & 0x0F;
>   //                value = (value << 6) | temp1;
>   //                value = (value << 6) | temp2;
>   value = ((bb & 0x0F) << 12) + ((b1 & 0x3f) << 6) + ((b2 & 0x3f));
>   cbuf[ off++ ] = (char) value;
> }
> [...]
>
>
> Maybe I should file a bug report instead of posting here, but I thought I would start here :)
>   
hi Martin,

i will be checking into it.

and thanks for reporting the problem.

best,

alek
BTW: the reason to use utf8reader is that it is (or was in past) faster
than built-in utf8 decoder (and i had some problems in past with JDK
default encoding getting into way but that seems to be fixed)