Deep SIXX with XMLPullParser

At our company, we develop our GLASS apps in Pharo and then deploy to a GLASS repository on one of our servers, so we sometimes need to copy model objects from one environment to the other. One of our applications also performs regular imports from a third-party database, so we fetch it into a 32-bit Squeak image via ODBC and push it up into the GLASS repository from there.

We needed a platform-independent serialization format, and SIXX fit the bill. It works in Squeak/Pharo right out of the box, and there’s an official GemStone port courtesy of Dale Henrichs and Norbert Hartl.

The only problem we’ve had is that SIXX reading consumes a lot of memory. In Pharo, we have sometimes had to raise the maximum VM heap size. In GemStone, we were bumping up against the default VM temporary object memory ceiling. Dale deals with this issue in general in an excellent blog post. The size limit on temporary object memory is configurable, but the real solution is … well, not use so much temporary memory.

For SIXX in particular, Dale modified the SIXX reader to use a persistent root for storage of SIXX objects during read, and he posted a script[1] to the mailing list that auto-commits when you approach the ceiling. This moves your temporary objects to permanent storage, kind of like using swap space. t’s like using swap space. You’re out of RAM? The OS will save some of your pages to disk and load them on demand.

OK, I know it’s more complicated than that, but that’s the basic idea.

Even using this approach, our ODBC import process was still hitting temporary memory limits. I confess that I didn’t spend much time analyzing the situation. Instead, I decided to throw a new tool at the problem: XMLPullParser.

XMLPullParser

Now before I go further, I should mention that XML is not exactly a passion of mine. When I have spare brain cycles, I don’t spend them on this sort of thing. There are XML-related acronyms that I couldn’t even define for you, much less explain. So if I get some details wrong here, please correct me in case somebody else cares about them.

Antony Blakey built an XML parser for VisualWorks with a pull-based API. He describes it in detail in his blog post, so I won’t go into much detail here. Essentially, your application drives the parsing process (that’s the “pull”) rather than having the parser try to notify you of what it found (“push”). The application pulls events from the parser, where events are things like “tag opened”, “text”, “tag closed”, like having a kid read the XML to you one piece at a time.

“What’s next, Johnny?”
“Uh, </person>”
“OK, if the next is a <politician>, skewer it.”

It’s a depth-first traversal, and it can be done on the fly without first loading the entire DOM. This means that you can read arbitrarily large XML files without high parser overhead.

Now, in Antony’s implementation, he simply wrapped the VisualWorks SAX parser’s output with a stream. This got him the API he wanted, but his hope was to eventually “really pull, without the SAX hack”.

With his permission, I ported XMLPullParser to Squeak, and it’s now available on SqueakSource. In my port, I mashed his work together with the XMLTokenizer class from YAXO, so the Squeak version really does pull.

The implementation is probably incomplete, but it’s parsed everything I’ve thrown at it so far. If you find a missing capability, you can probably just copy a method from XMLTokenizer — simply change senders of “next” to “nextChar”.

There are a few simple test cases in the package, but please don’t look to them for a good example of how to drive the parser. They use the lowest-level “what’s next” API to test the tokenizing only. Real-world usage of the parser involves higher-level operators like match:take:, if:peek:, etc.

parseResponseFrom: stream
  | parser |
  parser := XMLPullParser parse: stream.
  parser next.
  parser match: 'Response'
    take:
      [parser if: 'Errors'
        take:
          [parser while: 'Error' take: [errors add: parser text].
          ^self].
      parser whileAnyTake: [:tag | ... ]].

There are better examples out there, but hopefully this gives a little taste of what the pull parsing API feels like.

Adapting to SIXX

Back the problem at hand: how to attach this to SIXX. The SIXX code is fairly indifferent to the actual XML parser used, with all parser-specific details handled through a subclass of SixxXmlParserAdapter. But the entire SIXX framework expects that you’ll be dealing with fleshed out DOM nodes, so I had no choice but to modify some core parts of SIXX itself.

My goals were to keep the SIXX damage modifications to a minimum, so I had to make some tradeoffs. But with the changes described below, I was able to get all of the SIXX features working except one: truncated XML recovery. And the unit tests indicate that it still works when running against YAXO.

The current version of this SIXX fork is on SqueakSource in the XMLPullParser project.

Initial Results

Let’s get pathological for a few minutes here. I have a 98MB SIXX file (standard SIXX mode, not compact) representing model objects from a small application in Pharo. If we log free space at several points during a simple read, we can tell a little about the actual memory used:

|rs root|
"1" rs := SixxReadStream readOnlyFileNamed: 'models.sixx'
"2" [root := rs next] ensure: [rs close].
"3" rs := nil. "4"

If we take the free space at “1” as our baseline, then we can use the following rough interpretations:

  • Baseline minus free space at “2” is the DOM and stream overhead
  • Baseline minus free space at “3” is the space used by the DOM, root model and stream
  • Baseline minus free space at “4” is the actual memory consumed by the root model we loaded.

Run 1: Pharo/YAXO: DOM and stream overhead is 441 MB (!), the load took 14 minutes on my 2.33GHz Core 2 Duo laptop. It turns out that the root model consumes about 18MB in Pharo. Yes, SIXX in standard mode turns this into 98MB, which is a pretty low signal to noise ratio.

Run 2: Pharo/XMLPullParser: DOM and stream overhead is 2KB, and the load took 17 minutes. We took 3 minutes longer, which may come from the more spotty I/O (we’re not reading the entire file at once) and the extra compare/become phase (see below). But it saved us 440 MB of memory.

One other note on Run 2: The root model consumed a little over 24MB instead of the 18MB it took in Run 1. This is a consequence of the way we build collections in my tweaked version of SIXX; each growable collection has more empty space. More details below.

In GemStone, the test isn’t quite as simple, because the situation isn’t quite so simple.

Run 3: GemStone/YAXO: Dale’s script loads the XML string on the server, then launches the SIXX reader. I ran it and analyzed the memory usage using statmonitor/vsd.

VSD graph: SIXX Load with YAXO

 

What you see here is a graph of the VM temporary memory usage (in red) and auto-commit occurrences (spikes in cyan). The process took almost exactly 10 minutes.

Run 4: GemStone/XMLPullParser: With the XMLPullParser, we can use the same XML string load and auto-commit handler from Dale’s script but replace the guts of the SIXX load with the following: 

rs := SixxReadStream on: (ReadStream on: (UserGlobals at: #SIXX_LOAD_STRING)).
(UserGlobals at: #SIXX_PERSISTENCE_ARRAY) add: rs contextDictionary.
System commitTransaction ifFalse: [ nil error: 'Failed commit - persisting cached objects' ].
rootObject := rs next.

(Putting the SIXX context dictionary in a persistent root is the same trick the current GemStone port uses when you use Object class>>readSixFrom:persistentRoot:. The object graph gets saved 

The statmonitor/vsd analysis now looks like this:

Graph of memory usage with SIXX/XmlPullParser

 

Things started out similarly while we loaded the file, but then memory usage climbed in a much more tame pattern, just as we expected. Auto-commits occur when the size of the model itself is too large to hold entirely in temporary memory. Also, the whole load happened in 9 minutes instead of 10. Why is this? Somebody who knows more about GemStone internals will have to answer specifically, but it no doubt involves the overhead of moving objects back and forth.

Conclusion

The benefits of using this sort of parsing approach are pretty obvious. In both environments, you can load a much larger object graph using SIXX this way without either raising memory ceilings or “swapping” to permanent storage. For my pathological case, the swapping was still necessary but far less of it was needed.

If anyone is interested in the GemStone port of this work, I’ll put it up on GemSource. Since all of my initial work was done in Pharo, and the GemStone port of SIXX has departed from SIXX 0.3 in several key ways, bringing my branch into GemStone has been an adventure. It works for me, but it has a couple of key test failures that I haven’t had a chance to fix yet.

Gory Details

As I mentioned above, SIXX delegates the actual XML element interpretation to a subclass of SixxXmlParserAdapter. Messages sent to SixxXmlUtil class forward to the parser adapter as needed.

This is a good start, but it assumes that you’ve already got fleshed out DOM element nodes in hand. In fact, the entire SIXX architecture expects this, with the parser adapters doing little more than return sub-elements from them, fetch attributes from them, etc.

All of the SIXX methods for instance creation and population take an argument, called “sixxElement”, representing the DOM element in whatever parser framework you use. In my case, I chose to use the entire parser as the sixxElement. The parser knows the current element, so implementation of the forwarders for element name and attribute access were easy enough.

Next, I had to add hooks for tag consumption, essentially letting the SIXX framework indicate when it was done processing a particular tag event. Other parser adapters does nothing with these, but the XMLPullParser adapter advances its stream upon receipt of these messages. There were only a couple of places in the core SIXX framework where I had to hook these in.

SixxReadStream expected to stream over a whole collection of top-level DOM elements, so it had to be replaced. I built a custom SixxXppReadStream and augmented the parser adapter framework to allow for custom read stream classes. SixxXppReadStream allows every operation that SixxReadStream does except #size. Many streams can’t tell you their size anyway, so I didn’t consider this a major loss.

Next, I had to get rid of any place where SIXX asked for all sub-elements of the current node. In most cases, the pattern was something like:

(SixxXmlUtil subElementsFrom: sixxElement)
  do: [:each | ... ]

This was converted to a more stream-friendly pattern of #subElementsFrom:do:, which the XMLPullParser could implement as a further traversal, but other cases weren’t so straightforward.

When SIXX creates an object, it first instantiates it, registers it in a dictionary by ID (for later reference by other objects), then populates it. This lets SIXX deal with circular references, but it creates a problem for on-the-fly creation of collections. In the happy world of fully-populated DOM elements, the creation step can create a collection that’s the proper size by counting sub-elements. Then during the population step, it uses #add: or #at:put: to fill it in.

We don’t have the luxury of being able to look down the DOM tree twice, so in this case I have the instantiation step return an empty collection. If we’re dealing with a growable collection (Set, OrderedCollection, etc) then all is good. But if this is an Array, for example, the population step can optionally return a different object — the real object. If we detect that it’s different from the original, we use #become: to convert references from the empty object to the fully populated one.

Why do it this way? In GemStone 2.3, self become: other is not allowed, which is why the #become: is triggered based on the return value instead of being implemented in the collection population method itself. This means that every populating method needs to return self, and we pay performance penalties for the identity check and #become:.

The other consequence is that our collections aren’t created with perfectly-tuned sizes (e.g. Set new: 25). Instead, they grow like normal, so they will inevitably have more internal “empty space”. In my Pharo tests, the model was 38% bigger. To me, this isn’t a very big deal; these collections will likely grow in the future anyway. We could solve it by more complex creation (e.g. store all elements in a temporary collection, then create final objects using #withAll: and such), but the extra code doesn’t seem worth it.

Credits

Everything useful that I’ve ever learned about GemStone has come from the documentation (which is excellent) or has been spoon-fed to me by Dale Henrichs and Joseph Bacanskas. Thanks everyone.

Advertisement

7 Comments

Filed under GLASS, Seaside, Smalltalk

7 responses to “Deep SIXX with XMLPullParser

  1. Pingback: Deep SIXX with XMLPullParser « (gem)Stone Soup

  2. Ken, If you have a copy of one (or both) of the vsd files, I could poke around and see if there is anything obvious going on when doing the pull…

  3. You may also want to look at vtd-xml, the latest and most advanced XML processing
    API

    vtd-xml

  4. Bernhard Pieber

    I wanted to try your pull parser and loaded the latest version kdt.9 into the latest Squeak trunk image and also the latest Pharo image.

    When I run the test I get a doesNotUnderstand upToAndSkipThroughAll:. I just wanted to let you know.

    • Good catch. That’s a Seaside extension method (at least in Seaside 2.8), since platforms have different semantics for upToAll: (some leave you positioned before the “all”, some after). It was a handy way to make this work on both Squeak and GemStone (where I mainly do Seaside development).

      You can replace #upToAndSkipThroughAll: with #upToAll: in Squeak, since Squeak’s implementation leaves you positioned after the “all”.

  5. This is a fantastic post on dumping an arbitrarily sized graph in a human-readable fashion.

    For the record, I’m quoting Ken from an offline discussion we had about how to deal with cases that expect the xml element and get the parser instead:

    “The primary strategy is to convert the code to subElementsFrom:do:, but it doesn’t work for everything. I described the challenges and strategies in the blog post under the “Gory Details” heading.

    In general, you’ll probably need to find a way to split instantiation from state restoration. Instantiate the object first (when you first see the tag), then restore its state as you traverse sub-elements of the XML.”

  6. Pingback: Serializing large graphs with SIXX in GemStone | Mariano Martinez Peck

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s