Regex Contains in the XML element

Question

How can I use "contains" in the regex ("Contains" or "%like%")?

I have a regex to match the XML node with exact text:

<([\w]+)[^>]*>sample<\/\1>

It yields the exact Node name, but I want to apply the regex like in C# and SQL (%LIKE%).

Text:

    <Part>this is sample part</Part>
    <Remarks>this is sample remark</Remarks>
    <Notes>this is sample notes</Notes>
    <Desc>sample</Desc>

Expected regex result should return all the above nodes, but currently it returns only the last node.

I created a sample here to test.

Wrong tool for the job. Regex is not an XML parser, nor can it ever be. — spender
– spender, Commented Jun 6, 2017 at 10:33
Why don't you use XPath? //*[contains(text(), "sample")]/local-name() — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jun 6, 2017 at 10:33
Another note on the XML part: consider a file where the XML is not nicely formatted with multiple lines but instead all nodes are in a single line... or similarily, a XML node content spanning multiple lines. If you think you got a working regex for both cases, lets do some nesting: "<Notes>this is <SubNote>i'm a hacky sample</SubNote> sample notes</Notes>". — grek40
– grek40, Commented Jun 6, 2017 at 12:03
Use xml linq and use a where that has a string contain to do your search. Always use a string method before using Regex. Always parse xml with eXmlDocument class, XDocument class, XmlReader, or XmlSerialization. — jdweng
– jdweng, Commented Jun 6, 2017 at 12:24

Wiktor Stribiżew · Accepted Answer · 2017-06-06 11:30:05Z

2

You may use XDocument to parse XML like this:

var s = @"<?xml version=""1.0""?>
  <root>
    <Part>this is sample part</Part>
    <Remarks>this is sample remark</Remarks>
    <Notes>this is sample notes</Notes>
    <Desc>sample</Desc>
  </root>";
var document = XDocument.Parse(s);
var names = document.Descendants()
               .Elements()
               .Where(x => x.Value.Contains("sample")) // all nodes with text having sample
               .Select(a => a.Name.LocalName); // return the local names of the nodes
Console.WriteLine(string.Join("\n", names));

It prints:

The same can be achieved with an XPath:

var names2 = document.Root.XPathSelectElements("//*[contains(text(), \"sample\")]");
var results = names2.Select(x => x.Name.LocalName));

To fall back to regex in case the XML is not valid, use

<(?:\w+:)?(\w+)[^<]*>[^<]*?sample[^<]*</(?:\w+:)?\1>

See the regex demo. Note the (?:\w+:)? matches arbitrary namespace in the open and close tag nodes. [^<] matches any char but <, so it won't overflow to the next node.

edited Jun 6, 2017 at 11:30

answered Jun 6, 2017 at 11:20

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

grek40 Over a year ago

To bad the question is specifically about regex... still this approach is so much more suitable for the job I have to +1 it anyway :)

Spen D Over a year ago

@wiktor just a quick question,? performance wise which is the best option ? Linq/Regex/Xpath. since am handling a huge set of XML files to search the text

Wiktor Stribiżew Over a year ago

When you deal with valid XML files, I'd rather use an XML parser with LINQ. If you have to deal with XML files that can be valid or invalid, regex can help and the speed will depend on the contents, XML size, and luck. Note I have to deal with invalid XML every day and I use regex with XML - but it is not a regular XML, it is TMX file format, and I have a special parser built manually for them. And the performance is fine.

Spen D Over a year ago

@WiktorStribiżew same here, some time we receive some invalid XML formats , thats the reason to choose regex to match the search string.,let me go ahead with you regex. thanks a ton

Anton Sorokin · Accepted Answer · 2017-06-06 10:39:36Z

1

You are looking for exact match of the "sample" string inside any tag not the string containing "sample" as substring. You can fix your expression as following to get all the lines:

<([\w]+)[^>]*>[a-zA-Z ]*sample[a-zA-Z ]*<\/\1>

answered Jun 6, 2017 at 10:39

Anton Sorokin

4011 gold badge7 silver badges10 bronze badges

4 Comments

grek40 Over a year ago

I'd rather use [^<] instead of the [a-zA-Z ] placeholders... or just non-greedy accept anything. Still that's just a fix for the given examples. With arbitrary XML, any regex will fail somewhere.

Wiktor Stribiżew Over a year ago

Once there is a digit, or punctuation before sample, there won't be any match due to [a-zA-Z ]*.

Anton Sorokin Over a year ago

I agree with you, it of course doesn't cover all the cases - for instance there could also be punctuation symbols etc. - but it gives an idea where the problem is and how to cover particular input provided in a question.

Spen D Over a year ago

@grek40 it did a trick regardless for any characters, thanks for the input <([\w]+)[^>]*>[^<]*sample[^<]*<\/\1>

Collectives™ on Stack Overflow

Regex Contains in the XML element

2 Answers 2

4 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related