16

Using find and replace, what regex would remove the tags surrounding something like this:

<option value="863">Viticulture and Enology</option>

Note: the option value changes to different numbers, but using a regular expression to remove numbers is acceptable

I am still trying to learn but I can't get it to work.

I'm not using it to parse HTML, I have data from one of our company websites that we need in excel, but our designer deleted the original data file and we need it back. I have a list of the options and need to remove the HTML tags, using Notepad++ to find and replace

0

5 Answers 5

20

This works for me Notepad++ 5.8.6 (UNICODE)

search : <option value="\d+">(.*?)</option>

replace : $1

Be sure to select "Regular expression" and ". matches newline" enter image description here

Sign up to request clarification or add additional context in comments.

Comments

13

I have done by using following regular expression:

Find this : <.*?>|</.*?>

and

replace with : \r\n (this for new line)

By using this regular expression (<.*?>|</.*?>) we can easily find value between your HTML tags like below:

enter image description here

I have input:

<otpion value="123">1</option><otpion value="1234">2</option><otpion value="1235">3</option><otpion value="1236">4</option><otpion value="1237">5</option> 

I need to find values between options like 1,2,3,4,5

enter image description here

and got below output :

enter image description here

Comments

8

This works perfectly for me:

  • Select "Regular Expression" in "Find" Mode.
  • Enter [<].*?> in "Find What" field and leave the "Replace With" field empty.
  • Note that you need to have version 5.9 of Notepad++ for the ? operator to work.

as found here: digoCOdigo - strip html tags in notepad++

Comments

2

Something like this would work (as long as you know the format of the HTML won't change):

<option value="(\d+)">(.+)</option>

3 Comments

Hm, this erased the entire line, but looks close.
I will do two find and replaces: one for <option value="(\d+)"> and then one for </option>. Works beautifully thank you.
If you're using Notepad++ find/replace, it's not going to work because the regex uses backreferences to capture the fields you want to keep. For find/replace, just replace everything before the numbers with nothing, then replace "> with a delimeter (like | but not commas, since there may be commas in the name), then finall replace the </option> with nothing. Import the result into Excel.
1
String s = "<option value=\"863\">Viticulture and Enology</option>";
s.replaceAll ("(<option value=\"[0-9]+\">)([^<]+)</option>", "$2")
res1: java.lang.String = Viticulture and Enology

(Tested with scala, therefore the res1:)

With sed, you would use a little different syntax:

echo '<option value="863">Viticulture and Enology</option>'|sed -re 's|(<option value="[0-9]+">)([^<]+)</option>|\2|'

For notepad++, I don't know the details, but "[0-9]+" should mean 'at least one digit', "[^<]" anything but a opening less-than, multiple times. Masking and backreferences may differ. Regexes are problematic, if they span multiple lines, or are hidden by a comment, a regex will not recognize it.

However, a lot of html is genereated in a regex-friendly way, always fitting into a line, and never commented out. Or you use it in throwaway code, and can check your input before.

1 Comment

this is really helpful, just gonna loop through them all now :D TY!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.