0

I just start to learn python 3.( I was reading python book and i got questions...)

I have questions to read and find key word from the html source.

I wrote to code for open the url. For example,

page =urllib.request.urlopen(some url)
text = page.read().decode("utf8")

Therefore, I assume that text contain all html code and text is object.

Question 1. I would like to some kind of array or arraylist to store html source code. However, I am not really sure how to get line of code from "text" object and store to some kind of array.

Question 2. Does python have "contain" function to find the special key word like "stack overflow" from the array ?

thanks.

3
  • 2
    Use a HTML Parser; docs.python.org/2/library/htmlparser.html and your whole HTML source will turn into a interactable object. Commented Apr 10, 2013 at 22:23
  • @Allendar -- only don't use a standard library parser - it is crappy :) BeautifulSoup maybe? Commented Apr 10, 2013 at 22:28
  • This then? :) pypi.python.org/pypi/beautifulsoup4/4.1.3 Commented Apr 10, 2013 at 22:31

1 Answer 1

1

If you want to parse the entire HTML document structure, don't try to program it all yourself - do like Allendar says and use a library for it.

If you just want to search and find specific things inside the text, use regular expressions ("re" module).

It doesn't really make sense to talk about "lines" of HTML in the traditional meaning of lines (CR/LF). The entire page could be in one line. It is tags that structure HTML, not lines.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.