Extract data between html tags using BeautifulSoup in python

Question

I want to extract the data between the html tag 'title' and in the 'meta' tag, I want to extract value of URL attribute and that too the text just before the '?'.

<html lang="en" id="facebook" class="no_js">
<head>
    <meta charset="utf-8" />
    <script>
        function envFlush(a) {function b(c){for(var d in)c[d]=a[d];}if(window.requireLazy){window.requireLazy(['Env'],b);}else{window.Env=window.Env||{};b(window.Env);}}envFlush({"ajaxpipe_token":"AXjbmsNXDxPlvhrf","lhsh":"4AQFQfqrV","khsh":"0`sj`e`rm`s-0fdu^gshdoer-0gc^eurf-3gc^eurf;1;enbtldou;fduDmdldourCxO`ld-2YLMIuuqSdptdru;qsnunuxqd;rdoe"});
    </script>
    <script>CavalryLogger=false;</script>
    <noscript>
        <meta http-equiv="refresh" content="0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1" />
    </noscript>
    <meta name="referrer" content="default" id="meta_referrer" />
    <title id="pageTitle">
        &quot; CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN &quot;
    </title>
    <link rel="shortcut icon" href="https://fbstatic-a.akamaihd.net/rsrc.php/yl/r/H3nktOa7ZMg.ico" />

i.e. CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN and 685004288208871.

I tried the following code :

>>> soup.title.contents

output is

[u'" CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN "']

In this I don't want the characters '[]' , 'u' and single quotes.

Also, on implementing the following :

>>> soup.meta.contents

I get the output as :

[]

What can I try next? I am new to BeautifulSoup.

soup.title.text is what you want. The u'...' is only there because the interactive shell calls repr on the return value. — Aran-Fey
– Aran-Fey, Commented Dec 11, 2014 at 15:19

holdenweb · Accepted Answer · 2014-12-12 21:24:30Z

2

The .contents() method of Beautiful Soup objects returns a list. In this case it has only one element, which is a Unicode string. You should find that the expression you want is actually

>>> soup.title.contents[0]

Note that the single quotes only appear because you are asking the interactive interpreter to display a string value. You will find that

>>> print(soup.title.contents[0])

displays

" CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN "

and that is actually the contents of the title tag. You will observe that Beautiful Soup has converted the " HTML entities into the required double-quote characters. To lose the quotes and adjacent spaces you can use

soup.title.contents[0][2:-2]

The meta tag is a little tricker. I make the assumption that there is only one <meta> tag with an http-equiv attribute whose value is "refresh", so the retrieval returns a list of one element. You retrieve that element like so:

>>> meta = soup.findAll("meta", {"http-equiv": "refresh"})[0]
>>> meta
<meta content="0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1" http-equiv="refresh"/>

Note, by the way, that meta isn't a string but a soup element:

>>> type(meta)
<class 'bs4.element.Tag'>

You can retrieve attributes of a soup element using indexing just like Python dicts, so you can get the value of the contentattribute as follows:

>>> content = meta["content"]
>>> content
u'0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1'

In order to extract the URL value you could just look for the first equals sign and take the rest of the string. I prefer to use a rather more disciplined approach, splitting at the semicolon and then splitting the right-hand element of that split on (only one) equals sign.

>>> url = content.split(";")[1].split("=", 1)[1]
>>> url
u'/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1'

edited Dec 12, 2014 at 21:24

answered Dec 11, 2014 at 15:33

holdenweb

37.8k7 gold badges62 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

POOJA GUPTA Over a year ago

thank you for your response. But can you tell me as to how to get the second part i.e. meta content and that too text just before '?' and after last '/' of the URL attribute. ?

holdenweb Over a year ago

I've updated the answer to show you how to extract the URL. Let me know if this doesn't give you a clue as to how to extract the piece before the question mark.

holdenweb Over a year ago

By the way, note that many of the techniques I use above are quite "brittle" (that is, unanticipated data will break the code in unanticipated ways). So more validation may be in order before blindly using it ...

Yogesh · Accepted Answer · 2014-12-11 16:44:46Z

1

To get substring from url of meta tag you need to use some regex. I think you can try this out soup = BeautifulSoup(<your html string>) meta_url = soup.noscript.meta['content'] url = re.search('\-\/(.*)\?', meta_url).group(1) print url print soup.title.text

Hope above code solves your problem.

answered Dec 11, 2014 at 16:44

Yogesh

8956 silver badges13 bronze badges

2 Comments

POOJA GUPTA Over a year ago

it did not work. It gave the error : AttributeError: 'NoneType' object has no attribute 'group'

holdenweb Over a year ago

Note that parsing HTML with regexen is not recommended and is always going to lead to trouble ...

Collectives™ on Stack Overflow

Extract data between html tags using BeautifulSoup in python

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest