If gsed (regex-based) solution extracts correct xml text then you could extend the solution to include start/end positions assuming <myelement> is not nested:
$ perl -0777 -ne 'print "start: $-[0], end: $+[0], xml: {{{$&}}}\n" while /<myelement>.*?<\/myelement>/gs' < input > output
Input
some arbitrary text
A well-formed xml:
<myelement>
... xml here
</myelement>
some arbitrary text follows more elements: <myelement>... xml</myelement> the end
start: 40, end: 77, xml: {{{<myelement>
... xml here
</myelement>}}}
start: 122, end: 152, xml: {{{<myelement>... xml</myelement>}}}
Here's a Python solution that builds regex that matches some xml elements in plain text assuming each root element is not nested and it is not in comments or cdata based on
Matching patterns in Python:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import sys
from xml.etree import ElementTree as etree
# build regex that matches xml element
# xml_element = start_tag <anything> end_tag
# | self_close_tag
xml_element = '(?xs) {start_tag} (?(self_close) |.*? {end_tag})'
# start_tag = '<' name *attr '>'
# self_close_tag = '<' name *attr '/>'
ws = r'[ \t\r\n]*' # whitespace
start_tag = '< (?P<name>{name}) {ws} (?:{attr} {ws})* (?P<self_close> / )? >'
end_tag = '</ (?P=name) >'
name = '[a-zA-Z]+' # note: expand if necessary but the stricter the better
attr = '{name} {ws} = {ws} "[^"]*"' # match attribute
# - fragile against missing '"'
# - no “'” support
assert '{{' not in xml_element
while '{' in xml_element: # unwrap definitions
xml_element = xml_element.format(**vars())
# extract xml from stdin
all_text = sys.stdin.read()
for m in re.finditer(xml_element, all_text):
print("start: {span[0]}, end: {span[1]}, xml: {begin}{xml}{end}".format(
span=m.span(), xml=m.group(), begin="{{{", end="}}}"))
# assert well-formness of the matched xml text by parsing it
etree.XML(m.group())
There is a trade-off between matching larger variety of xml elements and avoiding false positives.
A more robust solution should take into account the format of the input i.e., QUnit, Javadoc lexers/parsers could help to extract xml fragments that could be fed into an xml parser later.
Beware:
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
Can you provide some examples of why it is hard to parse XML and HTML with a regex?