Extracting data from HTML files using regular expressions

Question

I am trying to extract the specific data using regular expression but i couldn't be able to achieve what i desire, for example,

in this page

http://mnemonicdictionary.com/wordlist/GREwordlist/startingwith/A

I have to keep only the data which is between,

<div class="row-fluid">

and

<br /> <br /><i class="icon-user"></i>

SO i copied the HTML code in Notepad++ enabled Regular expression in replace, and tried replacing everything that matches,

.*<div class="row-fluid">

to delete everything before <div class="row-fluid">

but it is not working at all.

Does anyone knows why ?

P.S: I am not using any programming language i just need to perform this on an html code using Notepad++, not on an actual HTML file.

do i have any other option available ? i am doing it just to learn regex. — Sufiyan Ghori
– Sufiyan Ghori, Commented Jan 29, 2014 at 14:50
I am not using any language i am just performing general regex command on different texts using Notepad++ i just need need to extract the particular data from html file which is opened in notepad as a souce code. — Sufiyan Ghori
– Sufiyan Ghori, Commented Jan 29, 2014 at 14:51

MyDogTom · Accepted Answer · 2014-01-30 12:13:44Z

2

I would achieve this in several steps.

Step 1.

transform document into one line. find

 \r\n

and replace with nothing. (make sure to select "Extended (\n, \r,..)" option in Replace dialog)

Step 2.

find

<div class="row-fluid">

and replace with

\r\n~<div class="row-fluid">

Make sure, that character "~" not used in the document. This character wil help us to delete unnecessary lines later

Step 3.

find

<br /> <br /><i class="icon-user"></i>

and replace with

<br /> <br /><i class="icon-user"></i>\r\n

Step 4.

Delete unnecessary lines. Check "Regular expression". find

^[^~].+$\r\n

and replace with nothing

Step 5.

Now you have only lines that starts with

~<div class="row-fluid">

and ends with

<br /> <br /><i class="icon-user"></i>

everything you need it's just delete this tags

PS. You can try to record a macro, if you need to do the same task several times.

answered Jan 30, 2014 at 12:13

MyDogTom

4,6162 gold badges31 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

AdrianHHH Over a year ago

If the original line breaks must be retained then you can modify step 1, first choose a string such as qzq where zq and qz do not occur anywhere in the document. Then replace line breaks (as in step 1) with qzq. Add a new final step 6 that converts all qzq back to line breaks.

Leo · Accepted Answer · 2014-01-29 15:01:49Z

1

You should consider retrieving using Xpath. Most languages support it.

There's a great firefox plugin that infers the xpath expression when you select a page item called xpather.

There's a hacked version that works for newer firefox versions here

http://jassage.com/xpather-1.4.5b.xpi

To use Xpath with python, consider using http://xmlsoft.org/python.html

Notice that Xpath may have problem with malformed html, so you may also find tidy an interesting option to "clean up" the html and get a parseable XML.

http://tidy.sourceforge.net/

answered Jan 29, 2014 at 15:01

Leo

6,5804 gold badges40 silver badges52 bronze badges

Comments

Laur Ivan · Accepted Answer · 2014-01-29 15:11:40Z

IMHO doing it with Notepad++ is difficult. According to this, you need to:

remove all lines (since regexps execute on each line of text)
perform the regexp on the whole (1-line) HTML

Either you want to learn regexps, or you want to parse the HTML. SDepending on which, solution differs.

If you want to learn regular expressions, this is (again IMHO) the wrong problem to solve.

If you want to resolve the problem (keep the data between <div> and <i>), then have a look at how to parse HTML/XML. In python you have some great libraries like BeautifulSoup (which can deal with broken html). You can do it with dom parsing or a more interesting solution (and arguably better for your problem) is to use SAX and per-event processing. Since you know that after every <div> you'll get an <i>, you could do a simple stack to push all the content between the two events...

Collectives™ on Stack Overflow

Extracting data from HTML files using regular expressions

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related