Regex: Pattern matching a Multiline Input

Question

Im looking for a Regex Pattern to verify that my HTML-Input has the right structure and (probably in a second step) extract some information from it.

Example Inputtext:

<title>Example Title</title><br />
<link>Download:</link> <a href="URL">hier</a> | hoster1 <br />
<link>Download:</link> <a href="URL">hier</a> | hoster2 <br />
<link>Download:</link> <a href="URL">hier</a> | hoster3

Title, hoster and URL of course can change and are interesting to catch, so my attempt was something like this:

<title>([^<]+?)</title><br />\s<link>Download:</link> <a href="([^"]+?)">hier</a> \| ([^<]+?)<br />\s

These Groups might seem a bit silly, but I also tried (.*?) and even with lazy-mode he would just match whole lines.

Right now the second part (< link > part) will match, but not in combination with the < title > one. I'm guessing my whitespace character (\s) doesnt match a new line? How can I check ONLY for a newline character?
The number of available links is dynamic, so i have no idea how many < link > tags there are. How can I use the second half of the pattern as a repeatable pattern? Id like to do something like this (which obviously doesnt work that way):

[ <link>Download:</link> <a href="([^"]+?)">hier</a> \| ([^<]+?)<br />\s ]*

This all is done with MULTILINE Option set (Althought im not too sure it is needed for what I want to do).

Im trying some different things for a few days now and am not getting anywhere, I'd really appreciate a few pointers into the right direction, thank you.

maerics · Accepted Answer · 2012-02-06 19:38:31Z

2

Use a proper HTML parser such as jsoup for this sort of task; regular expressions are fine for very simple cases but will quickly become unwieldy. An HTML parser will be much faster, easier, and more correct to implement, especially as you start doing more advanced testing.

answered Feb 6, 2012 at 19:38

maerics

157k47 gold badges277 silver badges299 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Kalle Richter · Accepted Answer · 2015-06-02 19:31:11Z

0

Just add [^\r\n] wherever you need a new line char for Windows else use [^\n].

edited Jun 2, 2015 at 19:31

Kalle Richter

8,85429 gold badges96 silver badges209 bronze badges

answered Feb 6, 2012 at 19:36

MozenRath

10.1k13 gold badges69 silver badges109 bronze badges

Collectives™ on Stack Overflow

Regex: Pattern matching a Multiline Input

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related