0

Here is the case, I need to save a web page's source code as html file. But if you look at the web page, there are lots of section, I don't need them, I only want to save the source code of the article itself.

code:

from urllib.request import urlopen

page = urlopen('http://www.abcde.com')
page_content = page.read()

with open('page_content.html', 'wb') as f:
    f.write(page_content)

I can save the whole source code from my code, but how can I just save the only part I want?

Explain:

<div itemscope itemtype="http://schema.org/MedicalWebPage">
.
.
.
</div>

I need to save the source code with and inside this tag , not extract the sentences in the tags.

The result I want is to save like this:

<div itemscope itemtype="http://schema.org/MedicalWebPage">

                    <div class="col-md-12 col-xs-12" style="padding-left:10px;">
                        <h1 itemprop="name" class="page_article_title" title="Apple" id="mask">Apple</h1>
                    </div>
                    <!--Article Start-->
                    <section class="page_article_div" id="print">
                        <article itemprop="text" class="page_article_content">
<p>
    <img alt="Apple" src="http://www.abcde.com/383741719.jpg" style="width: 300px; height: 200px;" /></p>
<p>
    The apple tree (Malus pumila, commonly and erroneously called Malus domestica) is a deciduous tree in the rose family best known for its sweet, pomaceous fruit, the apple.</p>
<p>
    It is cultivated worldwide as a fruit tree, and is the most widely grown species in the genus Malus.</p>
<p>
    <strong><span style="color: #884499;">Appe is red</span></strong></p>
<ol>
    <li>
        Germanic paganism</li>
    <li>
        Greek mythology</li>
</ol>
<p style="text-align: right;">
    【Jane】</p>
<p style="text-align: right;">
    Credit : Wiki</p>

                        </article>
                            <div style="text-align:right;font-size:1.2em;"><a class="authorlink" href="http://www.abcde.com/web/online;url=http://61.66.117.1234/name=2017">2017</a></div>
                        <br />                  
                        <div style="text-align:right;font-size:1.2em;">【Thank you!】</div>
                    </section>
                    <!--Article End-->
</div>
5
  • 1
    Use BeautifulSoup. Commented Oct 23, 2017 at 5:33
  • @andrew_reece I explained it wrong, sorry. I know I can use beautifulsoup to extract the sentences I need, But now I need to save the whole source code inside the tag I wrote above (including those two tags as well) Commented Oct 23, 2017 at 5:39
  • assign string1 to the openning div tag line,assign string2 to the closing tag,finnaly append string1,extracted string,string2 to a single string and save as a file Commented Oct 23, 2017 at 5:40
  • Use bs4 to select the tag and save tag.prettify() to file Commented Oct 23, 2017 at 5:43
  • Use pyquery which works exactly like jquery & easy for DOM query & manipulation Commented Oct 23, 2017 at 5:52

4 Answers 4

1

My own solution here:

page = urlopen('http://www.abcde.com')
page_content = page.read()
soup = BeautifulSoup(page_content, "lxml")
list = []
for tag in soup.select('div[itemtype="http://schema.org/MedicalWebPage"]'):
    list.append(str(tag))
list2= (', '.join(list))
#print(list2)        
#print(type(list2)) 
with open('C:/html/try.html', 'w',encoding='UTF-8') as f:
    f.write(list2)

I am a beginner so I am trying to do it as simple as it is, and this is my answer, it's working quite well at the moment :)

Sign up to request clarification or add additional context in comments.

Comments

0

You can search with the tag with the property of tag such as class or tag name or id and save it to the what ever format you want like the example below.

driver = BeautifulSoup(yoursavedfile.read(), 'html.parser')
tag_for_me = driver.find_elements_by_class_name('class_name_of_your_tag')
print tag_for_me

tag_for_me will have your required code.

2 Comments

driver = BeautifulSoup(page_content, 'html.parser') tag_for_me = driver.find_elements_by_class_name('.page_article_title') print (tag_for_me) An error occurred on the second line 'NoneType' object is not callable
probably the tag is not found through the property try searching with other property or share your tag here.
0

You can use Beautiful Soup to get any HTML source you need.

import requests
from bs4 import BeautifulSoup

target_class = "gb4"
target_text = "Web History"
r = requests.get("https://google.com")
soup = BeautifulSoup(r.text, "lxml")

for elem in soup.find_all(attrs={"class":target_class}):
    if elem.text == target_text:
        print(elem)

Output:

<a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>

2 Comments

I tried yours and it didn't print out anything. I uploaded the output I expected above, is there no way to get those source code all together? Which means I need to save the article HTML first then content HTML(article title and content are in different class)?
You can do what you want to do with BeautifulSoup. Read through the docs, they're pretty good and they'll show you how to do what you want. This answer is an answer to your original request, I need to save the source code with and inside this tag , not extract the sentences in the tags. .
0

Use BeautifulSoup to get the HTML where you want to insert, get the HTML which you want to insert. use insert() to generate new_tag. Overwrite to the original file.

from bs4 import BeautifulSoup
import requests

#Use beautiful soup to get the place you want to insert.
# div_tag is extracted div
soup = BeautifulSoup("Your content here",'lxml')
div_tag = soup.find('div',attrs={'class':'id=itemscope'})
#e.g 
#div_tag = <div id=itemscope itemtype="http://schema.org/MedicalWebPage">
</div>


res = requests.get('url to get content from')
soup1 = BeautifulSoup(res.text,'lxml')
insert_data = soup1.find('your div/data to insert')
#this will insert the tag to div_tag. You can overwrite this to your original page_content.html.
div_tag.insert(3,insert_data)
#div tag contains you desired output. Overwrite it to original file.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.