0

I want to extract the data between the html tag 'title' and in the 'meta' tag, I want to extract value of URL attribute and that too the text just before the '?'.

<html lang="en" id="facebook" class="no_js">
<head>
    <meta charset="utf-8" />
    <script>
        function envFlush(a) {function b(c){for(var d in)c[d]=a[d];}if(window.requireLazy){window.requireLazy(['Env'],b);}else{window.Env=window.Env||{};b(window.Env);}}envFlush({"ajaxpipe_token":"AXjbmsNXDxPlvhrf","lhsh":"4AQFQfqrV","khsh":"0`sj`e`rm`s-0fdu^gshdoer-0gc^eurf-3gc^eurf;1;enbtldou;fduDmdldourCxO`ld-2YLMIuuqSdptdru;qsnunuxqd;rdoe"});
    </script>
    <script>CavalryLogger=false;</script>
    <noscript>
        <meta http-equiv="refresh" content="0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1" />
    </noscript>
    <meta name="referrer" content="default" id="meta_referrer" />
    <title id="pageTitle">
        &quot; CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN &quot;
    </title>
    <link rel="shortcut icon" href="https://fbstatic-a.akamaihd.net/rsrc.php/yl/r/H3nktOa7ZMg.ico" />

i.e. CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN and 685004288208871.

I tried the following code :

>>> soup.title.contents

output is

[u'" CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN "']

In this I don't want the characters '[]' , 'u' and single quotes.

Also, on implementing the following :

>>> soup.meta.contents

I get the output as :

[]

What can I try next? I am new to BeautifulSoup.

1
  • 1
    soup.title.text is what you want. The u'...' is only there because the interactive shell calls repr on the return value. Commented Dec 11, 2014 at 15:19

2 Answers 2

2

The .contents() method of Beautiful Soup objects returns a list. In this case it has only one element, which is a Unicode string. You should find that the expression you want is actually

>>> soup.title.contents[0]

Note that the single quotes only appear because you are asking the interactive interpreter to display a string value. You will find that

>>> print(soup.title.contents[0])

displays

" CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN "

and that is actually the contents of the title tag. You will observe that Beautiful Soup has converted the &quot; HTML entities into the required double-quote characters. To lose the quotes and adjacent spaces you can use

soup.title.contents[0][2:-2]

The meta tag is a little tricker. I make the assumption that there is only one <meta> tag with an http-equiv attribute whose value is "refresh", so the retrieval returns a list of one element. You retrieve that element like so:

>>> meta = soup.findAll("meta", {"http-equiv": "refresh"})[0]
>>> meta
<meta content="0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1" http-equiv="refresh"/>

Note, by the way, that meta isn't a string but a soup element:

>>> type(meta)
<class 'bs4.element.Tag'>

You can retrieve attributes of a soup element using indexing just like Python dicts, so you can get the value of the contentattribute as follows:

>>> content = meta["content"]
>>> content
u'0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1'

In order to extract the URL value you could just look for the first equals sign and take the rest of the string. I prefer to use a rather more disciplined approach, splitting at the semicolon and then splitting the right-hand element of that split on (only one) equals sign.

>>> url = content.split(";")[1].split("=", 1)[1]
>>> url
u'/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1'
Sign up to request clarification or add additional context in comments.

3 Comments

thank you for your response. But can you tell me as to how to get the second part i.e. meta content and that too text just before '?' and after last '/' of the URL attribute. ?
I've updated the answer to show you how to extract the URL. Let me know if this doesn't give you a clue as to how to extract the piece before the question mark.
By the way, note that many of the techniques I use above are quite "brittle" (that is, unanticipated data will break the code in unanticipated ways). So more validation may be in order before blindly using it ...
1

To get substring from url of meta tag you need to use some regex. I think you can try this out soup = BeautifulSoup(<your html string>) meta_url = soup.noscript.meta['content'] url = re.search('\-\/(.*)\?', meta_url).group(1) print url print soup.title.text

Hope above code solves your problem.

2 Comments

it did not work. It gave the error : AttributeError: 'NoneType' object has no attribute 'group'
Note that parsing HTML with regexen is not recommended and is always going to lead to trouble ...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.