Regex pattern for finding string between two characters - but first occurrence of the second character

Question

I want a regex to find string between two characters but only from start delimiter to first occurrence of end delimiter

I want to extract story from the lines of following format

<metadata name="user" story="{some_text_here}" \/>

So I want to extract only : {some_text_here}

For that I am using the following regex:

<metadata name="user" story="(.*)" \/>

And java code:

public static void main(String[] args) throws IOException {
        String regexString = "<metadata name="user" story="(.*)" \/>";
        String filePath = "C:\\Desktop\\temp\\test.txt";
        Pattern p = Pattern.compile(regexString);
        Matcher m;
        try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
            String line;
            while ((line = br.readLine()) != null) {
                m = p.matcher(line);
                if (m.find()) {                     
                    System.out.println(m.group(1));
                }
            }
        }

    }

This regex mostly works fine but surprisingly if the line is:

<metadata name="user" story="My name is Nick" extraStory="something" />

Running the code also filters My name is Nick" extraStory="something where as I only want to make sure that I get My name is Nick

Also I want to make sure that there is actually no information between story="My name is Nick" and before />

You want to make the quantifier non-greedy, or exclude the ending character. — T.J. Crowder
– T.J. Crowder, Commented Jan 25, 2017 at 14:18
(?<=story=")[^"]++(?=") ought to work. But see my comment above, regex cannot parse XML in the general case. — Boris the Spider
– Boris the Spider, Commented Jan 25, 2017 at 14:18
You really, really, really should use a parser for this. But given the specificity of your regex, you can just change . to [^"]: <metadata name="user" story="([^"]*)" \/> That will fix the issue you've mentioned, but I bet it will break in other situations. (Hence, parser.) — T.J. Crowder
– T.J. Crowder, Commented Jan 25, 2017 at 14:19

radicarl · Accepted Answer · 2017-01-25 14:20:13Z

1

<metadata name="user" story="([^"]*)" \/>

[^"]* will match everything except the ". In this case the string

<metadata name="user" story="My name is Nick" extraStory="something" />

will not be matched.

answered Jan 25, 2017 at 14:20

radicarl

3272 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Aaron · Accepted Answer · 2017-01-25 15:24:32Z

1

The following XPath should solve your problem :

//metadata[@name='user' and @story and count(@*) = 2]/@story

It address the story attribute of any metadata node in the document whose name attribute is user, which also has a story attribute but no others (attributes count is 2).

(Note : //metadata[@name='user' and count(@*)=2]/@story would be enough since it would be impossible to address the story attribute of a metadata node whose second attribute isn't story)

In Java code, supposing you are handling an instance of org.w3c.dom.Document and already have an instance of XPath available, the code would be the following :

xPath.evaluate("//metadata[@name='user' and @story and count(@*) = 2]/@story", xmlDoc);

You can try the XPath here or the Java code here.

edited Jan 25, 2017 at 15:24

answered Jan 25, 2017 at 15:12

Aaron

24.9k2 gold badges41 silver badges61 bronze badges

3 Comments

Nick Div Over a year ago

'extraStory' was just an example. Sorry if I was not clear. It is invalid if it has anything apart from 'name' and 'story' so 'extraStory' tag would make the line invalid, 'extraStory1' would make it invalid, 'xyz' would also make it invalid.

Aaron Over a year ago

@NickDiv I've updated the XPath expression to make sure the only two attributes are name and story.

Nick Div Over a year ago

thanks a lot. Appreciate the help. Would definitely try this out.

nafas · Accepted Answer · 2017-01-25 14:31:35Z

0

Just use Jsoup . right tool for the problem :).

its this easy :

String html; //read html file

Document document = Jsoup.parse(html);

String story = document.select("metadata[name=user]").attr("story");

System.out.println(story);

answered Jan 25, 2017 at 14:31

nafas

5,4533 gold badges34 silver badges59 bronze badges

9 Comments

Aaron Over a year ago

I'm not sure it's the right tool, I think it's overkill 1) if the source is well-formed XML data and 2) the user isn't familiar already with CSS / jquery selector queries.

Nick Div Over a year ago

But wouldn't it read a string containing invalid attr as well i.e. a line containing extraStory. So this is also a limitation for me that the line should not contain anything but the name and story tag

nafas Over a year ago

@Aaron might be ever slightly slower, but its simplicty worth it. a one liner code. you can't get any simpler

nafas Over a year ago

@NickDiv it only extract the the data within "story" attribute. nothing more mate. that's why its the right tool for the job. :)

Aaron Over a year ago

@nafas the XPath for this would be //metadata[@name="user"]/@story, which is slightly smaller than your dom manipulation because it includes the attribute selection. Jsoup is for good for parsing malformed HTML and because it enables access to the dom through the popular CSS selector queries. If you don't need any of those two capabilities, I just wouldn't call it the right tool for the problem

|

Collectives™ on Stack Overflow

Regex pattern for finding string between two characters - but first occurrence of the second character

3 Answers 3

Comments

3 Comments

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related