Python: scrape a part of source code and save it as html

Question

Here is the case, I need to save a web page's source code as html file. But if you look at the web page, there are lots of section, I don't need them, I only want to save the source code of the article itself.

code:

from urllib.request import urlopen

page = urlopen('http://www.abcde.com')
page_content = page.read()

with open('page_content.html', 'wb') as f:
    f.write(page_content)

I can save the whole source code from my code, but how can I just save the only part I want?

Explain:

<div itemscope itemtype="http://schema.org/MedicalWebPage">
.
.
.
</div>

I need to save the source code with and inside this tag , not extract the sentences in the tags.

The result I want is to save like this:

<div itemscope itemtype="http://schema.org/MedicalWebPage">

                    <div class="col-md-12 col-xs-12" style="padding-left:10px;">
                        <h1 itemprop="name" class="page_article_title" title="Apple" id="mask">Apple</h1>
                    </div>
                    <!--Article Start-->
                    <section class="page_article_div" id="print">
                        <article itemprop="text" class="page_article_content">
<p>
    <img alt="Apple" src="http://www.abcde.com/383741719.jpg" style="width: 300px; height: 200px;" /></p>
<p>
    The apple tree (Malus pumila, commonly and erroneously called Malus domestica) is a deciduous tree in the rose family best known for its sweet, pomaceous fruit, the apple.</p>
<p>
    It is cultivated worldwide as a fruit tree, and is the most widely grown species in the genus Malus.</p>
<p>
    <strong><span style="color: #884499;">Appe is red</span></strong></p>
<ol>
    <li>
        Germanic paganism</li>
    <li>
        Greek mythology</li>
</ol>
<p style="text-align: right;">
    【Jane】</p>
<p style="text-align: right;">
    Credit : Wiki</p>

                        </article>
                            <div style="text-align:right;font-size:1.2em;"><a class="authorlink" href="http://www.abcde.com/web/online;url=http://61.66.117.1234/name=2017">2017</a></div>
                        <br />                  
                        <div style="text-align:right;font-size:1.2em;">【Thank you!】</div>
                    </section>
                    <!--Article End-->
</div>

@andrew_reece I explained it wrong, sorry. I know I can use beautifulsoup to extract the sentences I need, But now I need to save the whole source code inside the tag I wrote above (including those two tags as well) — Makiyo
– Makiyo, Commented Oct 23, 2017 at 5:39
assign string1 to the openning div tag line,assign string2 to the closing tag,finnaly append string1,extracted string,string2 to a single string and save as a file — Vijayabhaskar J
– Vijayabhaskar J, Commented Oct 23, 2017 at 5:40
Use pyquery which works exactly like jquery & easy for DOM query & manipulation — Garfield
– Garfield, Commented Oct 23, 2017 at 5:52

Makiyo · Accepted Answer · 2017-10-23 08:10:33Z

1

My own solution here:

page = urlopen('http://www.abcde.com')
page_content = page.read()
soup = BeautifulSoup(page_content, "lxml")
list = []
for tag in soup.select('div[itemtype="http://schema.org/MedicalWebPage"]'):
    list.append(str(tag))
list2= (', '.join(list))
#print(list2)        
#print(type(list2)) 
with open('C:/html/try.html', 'w',encoding='UTF-8') as f:
    f.write(list2)

I am a beginner so I am trying to do it as simple as it is, and this is my answer, it's working quite well at the moment :)

edited Oct 23, 2017 at 8:10

answered Oct 23, 2017 at 8:03

Makiyo

4515 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Shaamuji · Accepted Answer · 2017-10-23 05:39:18Z

0

You can search with the tag with the property of tag such as class or tag name or id and save it to the what ever format you want like the example below.

driver = BeautifulSoup(yoursavedfile.read(), 'html.parser')
tag_for_me = driver.find_elements_by_class_name('class_name_of_your_tag')
print tag_for_me

tag_for_me will have your required code.

answered Oct 23, 2017 at 5:39

Shaamuji

3712 bronze badges

2 Comments

Makiyo Over a year ago

driver = BeautifulSoup(page_content, 'html.parser') tag_for_me = driver.find_elements_by_class_name('.page_article_title') print (tag_for_me) An error occurred on the second line 'NoneType' object is not callable

Shaamuji Over a year ago

probably the tag is not found through the property try searching with other property or share your tag here.

andrew_reece · Accepted Answer · 2017-10-23 05:44:05Z

0

You can use Beautiful Soup to get any HTML source you need.

import requests
from bs4 import BeautifulSoup

target_class = "gb4"
target_text = "Web History"
r = requests.get("https://google.com")
soup = BeautifulSoup(r.text, "lxml")

for elem in soup.find_all(attrs={"class":target_class}):
    if elem.text == target_text:
        print(elem)

Output:

<a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>

answered Oct 23, 2017 at 5:44

andrew_reece

21.4k3 gold badges40 silver badges64 bronze badges

2 Comments

Makiyo Over a year ago

I tried yours and it didn't print out anything. I uploaded the output I expected above, is there no way to get those source code all together? Which means I need to save the article HTML first then content HTML(article title and content are in different class)?

andrew_reece Over a year ago

You can do what you want to do with BeautifulSoup. Read through the docs, they're pretty good and they'll show you how to do what you want. This answer is an answer to your original request, I need to save the source code with and inside this tag , not extract the sentences in the tags. .

Anubhav Singh · Accepted Answer · 2017-10-23 06:54:20Z

Use BeautifulSoup to get the HTML where you want to insert, get the HTML which you want to insert. use insert() to generate new_tag. Overwrite to the original file.

from bs4 import BeautifulSoup
import requests

#Use beautiful soup to get the place you want to insert.
# div_tag is extracted div
soup = BeautifulSoup("Your content here",'lxml')
div_tag = soup.find('div',attrs={'class':'id=itemscope'})
#e.g 
#div_tag = <div id=itemscope itemtype="http://schema.org/MedicalWebPage">
</div>


res = requests.get('url to get content from')
soup1 = BeautifulSoup(res.text,'lxml')
insert_data = soup1.find('your div/data to insert')
#this will insert the tag to div_tag. You can overwrite this to your original page_content.html.
div_tag.insert(3,insert_data)
#div tag contains you desired output. Overwrite it to original file.

Collectives™ on Stack Overflow

Python: scrape a part of source code and save it as html

4 Answers 4

Comments

2 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related