1

There are some files that are encoded in Markdown, MediaWiki syntax, Creole, source code and also plain text.

These files may contain a stray XML element. When I say stray, they are in files that are not XML like the following:

  • QUnit has <reference path=""/> in unit tests
  • Javadoc contain XML elements

How do I extract this element in the most reliable way? It is not a XML document but the XML element itself is well-formed

I have been playing with sed to extract the contents of the element:

gsed  -n '/<myelement>/,/<\/myelement>/p' < test.txt > output.txt

This simply removes all the non-XML from the file and leaves my custom elements behind. This doesn't let me process each one individually. I could then run xmlstarlet on the resulting file but this doesn't tell me where the element appeared in the source document.

What is the best way to do this? How can I modify the sed to match one at a time (which I can replace myself).

Would it be better to just read the whole file into a root element and then process the file as if it were a semi-structured XML file with XML tools and then handle the replacement in the XML parsing?

5
  • the question is not clear. Do you want to extract all xml text with corresponding positions within the input? Commented Dec 15, 2012 at 15:47
  • Yes. I want to know at what lines and what columns the original XML element appears. Now I have been working on this problem it seems that wrapping up the entire files as XML and then just using XML itself to provide the replacement rather than trying to do it myself. Doesn't really help diffs though. Commented Dec 16, 2012 at 0:26
  • If you just want to know where it appears, using grep is probably what you want to do. Or do you need to do something with the contents? Commented Dec 16, 2012 at 0:41
  • Just post some sample input and expected output so we're not guessing. Commented Dec 16, 2012 at 20:07
  • @EdMorton: imagine a post or comment such as this one where I suddenly decide I want to include a file, I write <include file="path">. That's what I mean. Commented Jan 1, 2013 at 20:15

2 Answers 2

2

If gsed (regex-based) solution extracts correct xml text then you could extend the solution to include start/end positions assuming <myelement> is not nested:

$ perl -0777 -ne 'print "start: $-[0], end: $+[0], xml: {{{$&}}}\n" while /<myelement>.*?<\/myelement>/gs' < input > output

Input

some arbitrary text
A well-formed xml:

<myelement>
... xml here
</myelement>

some arbitrary text follows more elements: <myelement>... xml</myelement> the end

Output

start: 40, end: 77, xml: {{{<myelement>
... xml here
</myelement>}}}
start: 122, end: 152, xml: {{{<myelement>... xml</myelement>}}}

Here's a Python solution that builds regex that matches some xml elements in plain text assuming each root element is not nested and it is not in comments or cdata based on Matching patterns in Python:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import sys
from xml.etree import ElementTree as etree

# build regex that matches xml element
# xml_element = start_tag <anything> end_tag
#             | self_close_tag
xml_element = '(?xs) {start_tag} (?(self_close) |.*? {end_tag})'

# start_tag = '<' name  *attr '>'
# self_close_tag = '<' name *attr '/>'
ws = r'[ \t\r\n]*'  # whitespace
start_tag = '< (?P<name>{name}) {ws} (?:{attr} {ws})* (?P<self_close> / )? >'
end_tag = '</ (?P=name) >'
name = '[a-zA-Z]+'  # note: expand if necessary but the stricter the better
attr = '{name} {ws} = {ws} "[^"]*"'  # match attribute
                                     #  - fragile against missing '"'
                                     #  - no “'” support
assert '{{' not in xml_element
while '{' in xml_element: # unwrap definitions
    xml_element = xml_element.format(**vars())

# extract xml from stdin
all_text = sys.stdin.read()
for m in re.finditer(xml_element, all_text):
    print("start: {span[0]}, end: {span[1]}, xml: {begin}{xml}{end}".format(
            span=m.span(), xml=m.group(), begin="{{{", end="}}}"))
    # assert well-formness of the matched xml text by parsing it
    etree.XML(m.group())

There is a trade-off between matching larger variety of xml elements and avoiding false positives.

A more robust solution should take into account the format of the input i.e., QUnit, Javadoc lexers/parsers could help to extract xml fragments that could be fed into an xml parser later.

Beware:

Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

Can you provide some examples of why it is hard to parse XML and HTML with a regex?

Sign up to request clarification or add additional context in comments.

Comments

1

There was no need to extract the elements manually. You can take advantage of the comprehensive XML ecosystem by wrapping up your data in a root node during processing.

For example, a Java sourcefile or a Javascript file is technically XML if it is inside a root element.

You can then use tools designed for the purpose such as XPath or SAX. I used xmlstarlet.

5 Comments

You can't wrap arbitrary text with '<root>%s</root>' % text and expect it to be a well-formed xml. For example, a Java sourcefile or a Javascript file can have text (e.g., standalone <) that make them ill-formed xml if wrapped with a root element.
Sebastian. You are correct. This is why I escape these elements. You can either encode these are entities(xml esc) or wrap the block as CDATA.
if you escape the text or use CDATA then you loose internal xml elements: you'll just have the root element and a chunk of unstructured text in it.
That's if you are creating a <root>%s</root> then adding text to it in a CData node or escaping as you insert it. If you do a dumb append and prepend of the tags, they'll just be internal XML nodes and that's fine. In my case, I generally add my internal nodes after I do the wrapping so I usually put in CDATA nodes myself. (&lt; everywhere is ugly.) That is, it's not possible or unlikely for there to be special nodes when I first do the dumb wrapping. I guess it would be cool to simply encode it, put it in, then look for the parseable 'special' nodes and unencode them.
I have a project in the pipeline called 'markupcontrol', it's a markup language aggregator - you can use whatever generator you want to use but also use your custom substitutions. (txt2tags, pandoc etc). Will post github link soon.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.