0

I am trying to extract the specific data using regular expression but i couldn't be able to achieve what i desire, for example,

in this page

http://mnemonicdictionary.com/wordlist/GREwordlist/startingwith/A

I have to keep only the data which is between,

<div class="row-fluid">

and

<br /> <br /><i class="icon-user"></i>

SO i copied the HTML code in Notepad++ enabled Regular expression in replace, and tried replacing everything that matches,

.*<div class="row-fluid">

to delete everything before <div class="row-fluid">

but it is not working at all.

Does anyone knows why ?

P.S: I am not using any programming language i just need to perform this on an html code using Notepad++, not on an actual HTML file.

7
  • 4
    Parsing HTML with Regex is a bad idea Commented Jan 29, 2014 at 14:49
  • do i have any other option available ? i am doing it just to learn regex. Commented Jan 29, 2014 at 14:50
  • What language are you using? Commented Jan 29, 2014 at 14:50
  • 1
    I am not using any language i am just performing general regex command on different texts using Notepad++ i just need need to extract the particular data from html file which is opened in notepad as a souce code. Commented Jan 29, 2014 at 14:51
  • Could you use a different language? Commented Jan 29, 2014 at 14:54

3 Answers 3

2

I would achieve this in several steps.

Step 1.

transform document into one line. find

 \r\n 

and replace with nothing. (make sure to select "Extended (\n, \r,..)" option in Replace dialog)

Step 2.

find

<div class="row-fluid">

and replace with

\r\n~<div class="row-fluid">

Make sure, that character "~" not used in the document. This character wil help us to delete unnecessary lines later

Step 3.

find

<br /> <br /><i class="icon-user"></i>

and replace with

<br /> <br /><i class="icon-user"></i>\r\n

Step 4.

Delete unnecessary lines. Check "Regular expression". find

^[^~].+$\r\n

and replace with nothing

Step 5.

Now you have only lines that starts with

~<div class="row-fluid">

and ends with

<br /> <br /><i class="icon-user"></i>

everything you need it's just delete this tags

PS. You can try to record a macro, if you need to do the same task several times.

Sign up to request clarification or add additional context in comments.

1 Comment

If the original line breaks must be retained then you can modify step 1, first choose a string such as qzq where zq and qz do not occur anywhere in the document. Then replace line breaks (as in step 1) with qzq. Add a new final step 6 that converts all qzq back to line breaks.
1

You should consider retrieving using Xpath. Most languages support it.

There's a great firefox plugin that infers the xpath expression when you select a page item called xpather.

There's a hacked version that works for newer firefox versions here

http://jassage.com/xpather-1.4.5b.xpi

To use Xpath with python, consider using http://xmlsoft.org/python.html

Notice that Xpath may have problem with malformed html, so you may also find tidy an interesting option to "clean up" the html and get a parseable XML.

http://tidy.sourceforge.net/

Comments

0

IMHO doing it with Notepad++ is difficult. According to this, you need to:

  • remove all lines (since regexps execute on each line of text)
  • perform the regexp on the whole (1-line) HTML

Either you want to learn regexps, or you want to parse the HTML. SDepending on which, solution differs.

If you want to learn regular expressions, this is (again IMHO) the wrong problem to solve.

If you want to resolve the problem (keep the data between <div> and <i>), then have a look at how to parse HTML/XML. In python you have some great libraries like BeautifulSoup (which can deal with broken html). You can do it with dom parsing or a more interesting solution (and arguably better for your problem) is to use SAX and per-event processing. Since you know that after every <div> you'll get an <i>, you could do a simple stack to push all the content between the two events...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.