2

How can I use "contains" in the regex ("Contains" or "%like%")?

I have a regex to match the XML node with exact text:

<([\w]+)[^>]*>sample<\/\1>

It yields the exact Node name, but I want to apply the regex like in C# and SQL (%LIKE%).

Text:

    <Part>this is sample part</Part>
    <Remarks>this is sample remark</Remarks>
    <Notes>this is sample notes</Notes>
    <Desc>sample</Desc>

Expected regex result should return all the above nodes, but currently it returns only the last node.

I created a sample here to test.

5
  • 2
    Wrong tool for the job. Regex is not an XML parser, nor can it ever be. Commented Jun 6, 2017 at 10:33
  • 3
    Why don't you use XPath? //*[contains(text(), "sample")]/local-name() Commented Jun 6, 2017 at 10:33
  • @WiktorStribiżew thanks, am trying with Xpath Commented Jun 6, 2017 at 11:27
  • Another note on the XML part: consider a file where the XML is not nicely formatted with multiple lines but instead all nodes are in a single line... or similarily, a XML node content spanning multiple lines. If you think you got a working regex for both cases, lets do some nesting: "<Notes>this is <SubNote>i'm a hacky sample</SubNote> sample notes</Notes>". Commented Jun 6, 2017 at 12:03
  • Use xml linq and use a where that has a string contain to do your search. Always use a string method before using Regex. Always parse xml with eXmlDocument class, XDocument class, XmlReader, or XmlSerialization. Commented Jun 6, 2017 at 12:24

2 Answers 2

2

You may use XDocument to parse XML like this:

var s = @"<?xml version=""1.0""?>
  <root>
    <Part>this is sample part</Part>
    <Remarks>this is sample remark</Remarks>
    <Notes>this is sample notes</Notes>
    <Desc>sample</Desc>
  </root>";
var document = XDocument.Parse(s);
var names = document.Descendants()
               .Elements()
               .Where(x => x.Value.Contains("sample")) // all nodes with text having sample
               .Select(a => a.Name.LocalName); // return the local names of the nodes
Console.WriteLine(string.Join("\n", names));

It prints:

enter image description here

The same can be achieved with an XPath:

var names2 = document.Root.XPathSelectElements("//*[contains(text(), \"sample\")]");
var results = names2.Select(x => x.Name.LocalName));

To fall back to regex in case the XML is not valid, use

<(?:\w+:)?(\w+)[^<]*>[^<]*?sample[^<]*</(?:\w+:)?\1>

See the regex demo. Note the (?:\w+:)? matches arbitrary namespace in the open and close tag nodes. [^<] matches any char but <, so it won't overflow to the next node.

Sign up to request clarification or add additional context in comments.

4 Comments

To bad the question is specifically about regex... still this approach is so much more suitable for the job I have to +1 it anyway :)
@wiktor just a quick question,? performance wise which is the best option ? Linq/Regex/Xpath. since am handling a huge set of XML files to search the text
When you deal with valid XML files, I'd rather use an XML parser with LINQ. If you have to deal with XML files that can be valid or invalid, regex can help and the speed will depend on the contents, XML size, and luck. Note I have to deal with invalid XML every day and I use regex with XML - but it is not a regular XML, it is TMX file format, and I have a special parser built manually for them. And the performance is fine.
@WiktorStribiżew same here, some time we receive some invalid XML formats , thats the reason to choose regex to match the search string.,let me go ahead with you regex. thanks a ton
1

You are looking for exact match of the "sample" string inside any tag not the string containing "sample" as substring. You can fix your expression as following to get all the lines:

<([\w]+)[^>]*>[a-zA-Z ]*sample[a-zA-Z ]*<\/\1>

4 Comments

I'd rather use [^<] instead of the [a-zA-Z ] placeholders... or just non-greedy accept anything. Still that's just a fix for the given examples. With arbitrary XML, any regex will fail somewhere.
Once there is a digit, or punctuation before sample, there won't be any match due to [a-zA-Z ]*.
I agree with you, it of course doesn't cover all the cases - for instance there could also be punctuation symbols etc. - but it gives an idea where the problem is and how to cover particular input provided in a question.
@grek40 it did a trick regardless for any characters, thanks for the input <([\w]+)[^>]*>[^<]*sample[^<]*<\/\1>

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.