0

I want a regex to find string between two characters but only from start delimiter to first occurrence of end delimiter

I want to extract story from the lines of following format

<metadata name="user" story="{some_text_here}" \/>

So I want to extract only : {some_text_here}

For that I am using the following regex:

<metadata name="user" story="(.*)" \/>

And java code:

public static void main(String[] args) throws IOException {
        String regexString = "<metadata name="user" story="(.*)" \/>";
        String filePath = "C:\\Desktop\\temp\\test.txt";
        Pattern p = Pattern.compile(regexString);
        Matcher m;
        try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
            String line;
            while ((line = br.readLine()) != null) {
                m = p.matcher(line);
                if (m.find()) {                     
                    System.out.println(m.group(1));
                }
            }
        }

    }

This regex mostly works fine but surprisingly if the line is:

<metadata name="user" story="My name is Nick" extraStory="something" />

Running the code also filters My name is Nick" extraStory="something where as I only want to make sure that I get My name is Nick

Also I want to make sure that there is actually no information between story="My name is Nick" and before />

12
  • 2
    Compulsory link. Commented Jan 25, 2017 at 14:17
  • 2
    You want to make the quantifier non-greedy, or exclude the ending character. Commented Jan 25, 2017 at 14:18
  • 3
    What you need is a context-aware parser, which regex isn't. Commented Jan 25, 2017 at 14:18
  • 1
    (?<=story=")[^"]++(?=") ought to work. But see my comment above, regex cannot parse XML in the general case. Commented Jan 25, 2017 at 14:18
  • 1
    You really, really, really should use a parser for this. But given the specificity of your regex, you can just change . to [^"]: <metadata name="user" story="([^"]*)" \/> That will fix the issue you've mentioned, but I bet it will break in other situations. (Hence, parser.) Commented Jan 25, 2017 at 14:19

3 Answers 3

1
<metadata name="user" story="([^"]*)" \/>

[^"]* will match everything except the ". In this case the string

<metadata name="user" story="My name is Nick" extraStory="something" />

will not be matched.

Sign up to request clarification or add additional context in comments.

Comments

1

The following XPath should solve your problem :

//metadata[@name='user' and @story and count(@*) = 2]/@story

It address the story attribute of any metadata node in the document whose name attribute is user, which also has a story attribute but no others (attributes count is 2).

(Note : //metadata[@name='user' and count(@*)=2]/@story would be enough since it would be impossible to address the story attribute of a metadata node whose second attribute isn't story)

In Java code, supposing you are handling an instance of org.w3c.dom.Document and already have an instance of XPath available, the code would be the following :

xPath.evaluate("//metadata[@name='user' and @story and count(@*) = 2]/@story", xmlDoc);

You can try the XPath here or the Java code here.

3 Comments

'extraStory' was just an example. Sorry if I was not clear. It is invalid if it has anything apart from 'name' and 'story' so 'extraStory' tag would make the line invalid, 'extraStory1' would make it invalid, 'xyz' would also make it invalid.
@NickDiv I've updated the XPath expression to make sure the only two attributes are name and story.
thanks a lot. Appreciate the help. Would definitely try this out.
0

Just use Jsoup . right tool for the problem :).

its this easy :

String html; //read html file

Document document = Jsoup.parse(html);

String story = document.select("metadata[name=user]").attr("story");

System.out.println(story);

9 Comments

I'm not sure it's the right tool, I think it's overkill 1) if the source is well-formed XML data and 2) the user isn't familiar already with CSS / jquery selector queries.
But wouldn't it read a string containing invalid attr as well i.e. a line containing extraStory. So this is also a limitation for me that the line should not contain anything but the name and story tag
@Aaron might be ever slightly slower, but its simplicty worth it. a one liner code. you can't get any simpler
@NickDiv it only extract the the data within "story" attribute. nothing more mate. that's why its the right tool for the job. :)
@nafas the XPath for this would be //metadata[@name="user"]/@story, which is slightly smaller than your dom manipulation because it includes the attribute selection. Jsoup is for good for parsing malformed HTML and because it enables access to the dom through the popular CSS selector queries. If you don't need any of those two capabilities, I just wouldn't call it the right tool for the problem
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.