[SoapRMI] Setting parsing position in PullParser
Aleksander Slominski
aslom_at_cs.indiana.edu
Tue, 05 Feb 2002 09:05:26 -0500
John Morrow wrote:
> Instead of printing the each <Member>'s caption each time on each row,
> common ones are printed in a single cell spanning the appropriate number of
> rows. To produce this kind of table in html, when I'm outputting the
> headings for row no1 I need to know that the "Bachelors Degree" cell's
> rowspan is 4 and that the "Female" cell has rowspan 2 and "Married" has
> rowspan 1. However, it's not until I get to the 5th tuple and see that the
> Education Level member has changed from "Bachelors Degree" to "Graduate
> Degree" that I can determine that the rowspan for "Bachelors Degree" is 4,
> so I need to read ahead a few tuples. Then when I want to output the second
> row which contains only 1 new cell ("Single"), I need to access this data
> from tuple no 2 but I've already read past it.
hi,
as far as i understand your design you will not be able to output HTML until you
have seen enough tuples to know row span for _each_ interesting category (like
singe/married). so i would say that the simples way to do it is to use XPP2 XmlNode
to represent each row tuple and keep list of them in Vector (or List). you will
need also an additional data structure such as hash table: category -> (value,
counter) to know how many times given value was seen and keep increasing it. now
when you see that value for the category is the same the you can replace part of
current tuple in category with null in XmlNode and modify the first tuple that
begins category to include row spanning information.
for example if initial tuples are:
<tuple><a>1</a><b>1</b></tuple>
<tuple><a>1</a><b>2</b></tuple>
after pass through this algorithm you would get that is very easy to output as HTML
table:
<tuple><a rowspan=2>1</a><b>1</b></tuple>
<tuple>NULL<b>2</b></tuple>
this is possible to do with XmlNode as it allow to keep any Java objects inside and
to easily modify nodes (like adding attribute rowspan). to read node for one tuple
you need ot set parse at start tag of tuple and then
XmlNode tuple =factory.newNode();
pp.readNode(tuple)
if you do use Vector of XmlNode then there will be no need ot do double XML parsing
and buffering huge amounts of XML - to do rewindPosition to parkedPosition you need
to keep all XML in between and then do _second_ XML parsing - that *is* big
overhead!
> One option I have is to save the information about tuples 2 3 and 4 when I'm
> reading ahead to tuple no 5 and then for row 2 just use my saved data. In
> the above case that's not a big overhead, however, some olap tables can be
> extreemly large and have cells with very big spans and many levels so this
> could take up a lot of memory.
so putting null in place of unused value should allow to garbage collect unused
data.
> A second option I was trying to get working was to rewind back in the String
> after I've done my read ahead. When I create my PullParser, I call
> etInput( reader ), passing in a StringReader object. What I've tried to do
> (and this is what my question is about!) is to call mark() on the
> StringReader at the end of reading the tuple for the current row and then
> read ahead however many tuples necessary to figure out the rowspan and then
> call reset() on the StringReader so that it's ready for the following row.
> This probably isn't the most stable thing to do as I get errors later on
> saying </Tuple> tags were found where </Member> tags were expected etc. I
> then tried also calling reset() on the pull parser but I then get errors:
> org.gjt.xpp.XmlPullParserException: only whitespace content allowed
> outside root element at line 2 and column 17 seen ">\n "...
> (parser state CONTENT)
that will not work as parser is buffering input and when you call set mark() on
input it will typically be already at least few hundreds characters ahead.
> Does anyone have any experience of dealing with a similar parsing
> situation?, or know if reset()ting / rewinding can be done in this way. Or,
> any ideas on a better way of solving this would be greatly appreciated.
if it is required you have two things to do. first try and run it with two pull
parsers on the same input (you will need to fork input reader) - one for reading
ahead and gathering information and another parser to produce actual HTML output.
then you can modify XmlTokenizer to add markPosition() method that would keep
request to remember current position and keep buffering output and rewindPosition()
to restore both tokenizer and parser position.
however please try to use first two parsers approach and see how much more overhead
it is compared to using XmlNode - as in example below you will need to remember all
previously seen tuples to determine row spans and i think it is not possible just
with two pass parsing and no remembering seen tuples as you need to remember where
to add rowspans, ex:
<tuple><a>1</a><b>1</b></tuple>
<tuple><a>1</a><b>1</b></tuple>
<tuple><a>1</a><b>2</b></tuple>
<tuple><a>1</a><b>2</b></tuple>
<tuple><a>1</a><b>2</b></tuple>
<tuple><a>1</a><b>1</b></tuple>
<tuple><a>1</a><b>2</b></tuple>
<tuple><a>1</a><b>2</b></tuple>
in thos example a has rowspan 8 but b has rowspan 2,3,1,2 - you would need to
remember this information in some datastructure/hashtable and this can get complex
as you need also to remember position of each rowspan start. so basically after
first pass parsing you must maintain for each row and each category how much row
span it has (by default 1 and - means do not show it). and then you do second pass
and actually output data - so you increase output time twice as XML parsing will be
done twice but it is constant overhead and it may be fine for your applications.
i would compare approach that uses XmlNode/Tupe and two pursers and see differences
in speed and memory utilization (try real hard cases like requiring rowspans of
10000s rows). results may (or may not) be surprising :-)
hope it helps,
alek