[xgws-user] UTF8 Problems and Utf8Reader
Martin D. Pedersen
mdp_at_visanti.com
Tue, 22 May 2007 14:41:00 +0200
Hi
I have a simple soap function taking a string as input.
I try to parse utf8 encoded input and it works, in most cases.
But with some strings the decoding on the server side gets it wrong.
Eg. the string : メーリングリスト (dont know if it will show properly so here is the bytes \u00e3\u0083\u00a1\u00e3\u0083\u00bc\u00e3\u0083\u00aa\u00e3\u0083\u00b3\u00e3\u0082\u00b0\u00e3\u0083\u00aa\u00e3\u0082\u00b9\u00e3\u0083\u0088)
Will show up as : ッッッッもッもッ (something different :) )
I traced it down to xsul.xservo_soap_http.HttpBasedServices::serviceXml
Here a custom Utf8Reader is used and it seems to be the problem.
If I remove the Utf8Reader and always go with the "standard" approach of parsing the encoding name to the pull parser, it works fine.
I dont know why it is implemented this way, because of performance?
As a hunch I thought it had something to do with the fact that the string is 3 bytes per character utf encoding.
So I dived into Utf8Reader::read :)
Without the full understanding of Utf8 and its cusins i managed to fix it by replacing some of the code in the read function.
[...]
else if ( (bb & 0xF0) == 0xE0 ) { // enter 3 byte encoding
[lots of bit masking going on dont know how much is still needed]
// 1110xxxx 10xxxxxx 10xxxxxx
// value = bb & 0x0F;
// value = (value << 6) | temp1;
// value = (value << 6) | temp2;
value = ((bb & 0x0F) << 12) + ((b1 & 0x3f) << 6) + ((b2 & 0x3f));
cbuf[ off++ ] = (char) value;
}
[...]
Maybe I should file a bug report instead of posting here, but I thought I would start here :)
Best regards
Martin Pedersen
--
Mobil: +45 27 28 5314 | mdp_at_visanti.com | www.visanti.com
Visanti A/S | Håndværkervej 1 | DK-9700 Brønderslev | Tel: +45 70 23 0304