Output XPaths for XML grep

A small improvement to the xmlgrep.py program. It now has an option for specifying output XPaths separately from the search XPath.

The option is -v (or --value). -v already has a meaning in traditional grep, to a way of specify that only unmatching lines should be output. But XML does not have the concept of lines, and the hierarchical nature of it means that the concept of non-matching nodes is probably useless. A search for elements not matching “/thing/test” would logically return the whole document (the root node), the element “thing”, and any children of “test”, but not “test” itself. It’s hard to see how that would be useful, especially when the entire contents of “test” will be included as children of the root regardless.

The --value option takes a single XPath argument, which is evaluated relative to the matching node. The option can be repeated. For each matching node, a line is output with the results of the value XPaths separated by commas.

If no -v options are specified, the matching node is output just as in the original program.

As an example, here are my published XML-related posts and their approximate word counts:

xmlgrep.py -h "//item[post_type = 'post' and status = 'publish' and count(category[@nicename = 'xml']) > 0]" ejrh.wordpress.2012-05-04.xml -v "link/text()" -v "title/text()" -v "string-length(normalize-space(encoded)) - string-length(translate(normalize-space(encoded), ' ', ''))"

https://ejrh.wordpress.com/2011/05/10/xml-grep/, XML grep, 651.0
https://ejrh.wordpress.com/2012/02/27/xml-viewing-and-diffing/, XML viewing and diffing, 2077.0
https://ejrh.wordpress.com/2012/04/23/xml-in-the-database/, XML in the database, 1079.0

The query is a bit messy, but that’s XPath’s fault more than it is xmlgrep.py’s. ;-)  The formula for approximate word count assumes words are separated by spaces, and works by:

  1. Normalising the white space in the text, replacing all blocks of whitespace with a single space.
  2. Translating all space characters into empty strings.
  3. Comparing the difference in text length before and after the previous step, i.e. counting how many blocks of white space were removed.

Although it’s a mess, it’s pretty neat that you can do that with something as limited as the XPath string functions.

The program xmlgrep.py is a standalone file hosted on Google Code. The latest version is at http://code.google.com/p/ejrh/source/browse/trunk/utils/xmlgrep.py.

This entry was posted in Programming and tagged . Bookmark the permalink.

One Response to Output XPaths for XML grep

  1. Pingback: More blogging statistics | EJRH

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s