XML grep

Do you occasionally have to search for particular needles in a haystack of XML files? Here’s a little script that might help.

It’s an “XML grep”: give it an XPath expression and a set of files and it will print all occurrences of the expression in each file.

There are some output control options:

-l print only the name of each file with at least one match
-L print only the names of each file without a match
-h don’t print the file name.

There is also a --parallel N option (default 4), which specifies how many parallel processes to use. XML parsing and searching can be CPU intensive, so the benefit of parallelism is significant. Each process will be given a subset of the files to search; results are then sent back to the master for output.

It requires libxml2 to work; parallelism requires the multiprocessing module (Python 2.6 or later).

Examples

Text of all comments left by me on this blog:

xmlgrep.py "//comment_content[../comment_author = 'ejrh']/text()" wordpress.2011-05-10.xml

Unique names in a set of documents:

xmlgrep.py -h "/item/name/text()" *.xml | sort | uniq

All documents that don’t contain the word “fringe”:

xmlgrep.py -L "contains(., 'fringe')" *.xml

Possible improvements

  • Allow input to be treated as a set of individual XML documents, one per line.
  • Shorthands for common XPath operations (like concatenation).
  • Seperate patterns for searching and output control (allowing, for instance, concatenation on each match, not on merely the first in each document).
  • Allow input to be translated with a style sheet before searching.
  • Optional removal or normalisation of white space; one line per match.
  • Namespace handling (namespaces are currently ignored).

Source

import sys
import libxml2
from glob import glob
from optparse import OptionParser

try:
    s = set()
except NameError:
    from sets import Set as set

try:
    from multiprocessing import Pool
except ImportError:
    import itertools
    class Pool(object):
        def __init__(self, num, init_func, args):
            init_func(*args)
        def imap(self, work_func, l):
            return itertools.imap(work_func, l)

def get_file(filename):
    doc = libxml2.readFile(filename, None, 0)

    def remove_ns(c):
        c.setNs(None)
        c = c.get_children()
        while c != None:
            remove_ns(c)
            c = c.next

    remove_ns(doc)

    return doc

def grep_file(filename, path, options):
    r = []
    doc1 = get_file(filename)

    nodes = doc1.xpathEval(path)
    if options.matching:
        if nodes != []:
            r.append('%s' % filename)
    elif options.unmatching:
        if nodes == []:
            r.append('%s' % filename)
    else:
        if type(nodes) in [int, float, str]:
            l = ''
            if not options.no_filename:
                l = l + '%s: ' % filename
            l = l + '%s' % nodes
            r.append(l)
        else:
            for n in nodes:
                l = ''
                if not options.no_filename:
                    l = l + '%s:%d: ' % (filename, n.lineNo())
                l = l + '%s' % n
                r.append(l)

    doc1.freeDoc()
    return r

def worker_init(path, options):
    globals()['path'] = path
    globals()['options'] = options

def worker_grep_file(f):
    try:
        return grep_file(f, path, options)
    except Exception, e:
        print >>sys.stderr, 'Failed on %s with %s' % (f, e)
        return False

def parse_command_line(argv = None):
    parser = OptionParser(usage="%prog [options] PATTERN FILENAME...\n       %prog -? (for help)", add_help_option=False)
    parser.add_option("-?", "--help", action='help',
                      help="display info about program usage, including options")
    parser.add_option("-l", "--matching", default=False, action='store_true',
                      help="print the name of each matching file")
    parser.add_option("-L", "--unmatching", default=False, action='store_true',
                      help="print the name of each unmatching file")
    parser.add_option("-h", "--no-filename", default=False, action='store_true',
                      help="suppress the prefixing filename on output")
    parser.add_option("--parallel", default=4, action='store',
                      help="number of grep processes to run in parallel")
    (options, args) = parser.parse_args(argv[1:])

    if len(args) == 0:
        parser.error('No pattern specified')

    if options.no_filename and (options.matching or options.unmatching):
        parser.error('-h cannot be specified with -l or -L')

    options.parallel = int(options.parallel)

    return options, args

def main(argv=None):
    if argv is None:
        argv = sys.argv

    options, args = parse_command_line(argv)

    path = args[0]
    filenames = [b for a in args[1:] for b in glob(a)]

    pool = Pool(options.parallel, worker_init, (path, options))
    results = pool.imap(worker_grep_file, filenames)

    for r in results:
        if r is not False:
            for l in r:
                print l

if __name__ == "__main__":
    sys.exit(main())
Advertisements
This entry was posted in Programming and tagged , , . Bookmark the permalink.

2 Responses to XML grep

  1. Pingback: XML viewing and diffing | EJRH

  2. Pingback: Output XPaths for XML grep | EJRH

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s