[xgws-user] UTF8 Problems and Utf8Reader

Martin D. Pedersen mdp_at_visanti.com
Tue, 22 May 2007 14:41:00 +0200


Hi 

I have a simple soap function taking a string as input.
I try to parse utf8 encoded input and it works, in most cases.
But with some strings the decoding on the server side gets it wrong.

Eg. the string : メーリングリスト  (dont know if it will show properly so here is the bytes \u00e3\u0083\u00a1\u00e3\u0083\u00bc\u00e3\u0083\u00aa\u00e3\u0083\u00b3\u00e3\u0082\u00b0\u00e3\u0083\u00aa\u00e3\u0082\u00b9\u00e3\u0083\u0088)
Will show up as : ッッッッもッもッ (something different :) )

I traced it down to xsul.xservo_soap_http.HttpBasedServices::serviceXml
Here a custom Utf8Reader is used and it seems to be the problem.
If I remove the Utf8Reader and always go with the "standard" approach of parsing the encoding name to the pull parser, it works fine.
I dont know why it is implemented this way, because of performance?

As a hunch I thought it had something to do with the fact that the string is 3 bytes per character utf encoding.

So I dived into Utf8Reader::read :)
Without the full understanding of Utf8 and its cusins i managed to fix it by replacing some of the code in the read function.

[...]
else if  ( (bb & 0xF0) == 0xE0 ) { // enter 3 byte encoding
[lots of bit masking going on dont know how much is still needed]
  // 1110xxxx 10xxxxxx 10xxxxxx
  //                value = bb & 0x0F;
  //                value = (value << 6) | temp1;
  //                value = (value << 6) | temp2;
  value = ((bb & 0x0F) << 12) + ((b1 & 0x3f) << 6) + ((b2 & 0x3f));
  cbuf[ off++ ] = (char) value;
}
[...]


Maybe I should file a bug report instead of posting here, but I thought I would start here :)



Best regards
  Martin Pedersen

--
Mobil: +45 27 28 5314 | mdp_at_visanti.com | www.visanti.com 

Visanti A/S | Håndværkervej 1 | DK-9700 Brønderslev | Tel: +45 70 23 0304