0

I am trying to remove the text inside the <>(html tags) and write the outcome to a new file. For example, one line of text could be:

< asdf> Text <here>more text< /asdf >

So the program would write to the output file: "Text more text", excluding those inside the html tags.

This is my attempt so far:

import urllib.request

data=urllib.request.urlopen("some website").read()

text1=data.decode("utf-8")

import re

def asd(text1):

    x=re.compile("<>")

    y=re.sub(x,"",text1)

    file1=open("textfileoutput.txt","w")

    file1.write(y)

    return y

asd(text1)

It doesn't seem to write the clean version, still has the tags. Thank you for your help.

3
  • 1
    Your regular expression will only match "<>". I suggest a solution like BeautifulSoup Grab Visible Webpage Text. Commented Dec 14, 2017 at 2:24
  • You are right, fixed it with replacing a line with this: x=re.compile(r"<[^>]+>") Program works now. Thank you. Commented Dec 14, 2017 at 2:28
  • 1
    What if the tag contains a > somewhere in it? As alecxe pointed out, trying to parse HTML with regular expressions is usually not the best. Commented Dec 14, 2017 at 2:33

2 Answers 2

2
x=re.compile("<>")

I am not sure why do you think this expression is going to match < asdf> or < /asdf >.

In any case, approaching HTML with regular expressions can rarely be justified. Use a more appropriate tool for the task - an HTML parser.

Example using BeautifulSoup and it's unwrap() method:

In [1]: from bs4 import BeautifulSoup

In [2]: html = "<asdf>Text more text</asdf>"

In [3]: soup = BeautifulSoup(html, "html.parser")

In [4]: soup.asdf.unwrap()
Out[4]: <asdf></asdf>

In [5]: print(soup)
Text more text
Sign up to request clarification or add additional context in comments.

1 Comment

For some others who cares about performance, BeautifulSoup is really slow even uses lxml as parser. If your html text is affirmatively well-formatted and you trust your regex expression, there is no problem to use it.
1

Simply replace re.compile("<>") with re.compile(r"<[^<>]*>") is enough

1 Comment

What if the tag contains a > somewhere in it?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.