Python Web Scrape Write Output to File

Question

I have a basic Python Script which can store the output to a file. This is file is difficult to parse. Any other way to write scraped data to a file which can be read easily into Python for analysis ?

import requests
from bs4 import BeautifulSoup as BS
import json
data='C:/test.json'
url="http://sfbay.craigslist.org/search/sby/sss?sort=rel&query=baby" 

r=requests.get(url)
soup=BS(r.content)
links=soup.find_all("p")
#print soup.prettify()

for link in links:
    connections=link.text
    f=open(data,'a')
    f.write(json.dumps(connections,indent=1))
    f.close()

Output File contains this: " $25 Sep 5 Porcelain Baby Deer $25 (sunnyvale) pic household items - by owner "" $7500 Sep 5 GEORGE STECK BABY GRAND PLAYER PIANO $7500 (morgan hill) map musical instruments - by

Mike Ounsworth · Accepted Answer · 2014-09-05 18:47:56Z

1

If you want to write it from python to a file, and read it back into python later, you can use Pickle - Pickle Tutorial.

Pickle files are in binary and will not be human-readable, if that's important to you then you could look at yaml, which I'll admit has a bit of a learning curve, but produces nicely formatted files.

import yaml

f = open(filename, 'w')
f.write( yaml.dump(data) )
f.close()

...


stream = open(filename, 'r')
data = yaml.load(stream)

answered Sep 5, 2014 at 18:47

Mike Ounsworth

2,5241 gold badge24 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Aphid Over a year ago

This will allow @Amrita Sawant to store object states, but I don't think it gets to the heart of the question, which was a good way of writing data so that it may be easily parsed by Python later. The scope of this question is a bit broad, admittedly.

Mike Ounsworth Over a year ago

Good point, I interpreted "parsed by Python" to mean "read from file", but you're right that may not have been the intent of the question. In that case it's a string manipulation question, not a file IO question.

A--- · Accepted Answer · 2014-09-05 19:47:29Z

It sounds like your question is more about how to parse the scraped data you get from craigslist, rather than how to deal with files. One way is to take each <p> element and tokenize the string by spaces. For example, tokenizing the string

"$25 Sep 5 Porcelain Baby Deer $25 (sunnyvale) pic household items - by owner"

can be done using split:

s = " $25 Sep 5 Porcelain Baby Deer $25 (sunnyvale) pic household items - by owner "
L = s.strip().split(' ') #remove whitespace at ends and break string apart by spaces

L is now a list with the values

['$25', 'Sep', '5', 'Porcelain', 'Baby', 'Deer', '$25', '(sunnyvale)', 'pic', 'household', 'items', '-', 'by', 'owner']

From here you could try to determine the meanings of the list elements by the order they appear. L[0] might always hold the price, L[1] the month, L[2] the day of month, etcetera. If you are interested in writing these values to file and parsing again later, consider reading up on the csv module.

Community · Accepted Answer · 2017-05-23 11:44:03Z

Decide on what data you actually want. Prices? Descriptions? List dates?
Decide on a good data structure to hold this information. I recommend a class containing pertinent fields, or lists.
Scrape the data you NEED using regular expressions or one of many other methods.
Throw what you DON'T NEED away

5a. Write the list contents to a file in a format you can easily use later (XML, comma delimited, etc)

OR

5b. Pickle the objects as recommended by Mike Ounsworth above.

If you aren't comfortable with XML parsing yet, just write a single line per link and delimit the fields you want with a character you can use later to split. e.g.:

import re #I'm going to use regular expressions here

link_content_matcher = re.compile("""\$(?P<price>[1-9]{1,4})\s+(?P<list_date>[A-Z]{1}[a-z]{2}\s+[0-9]{1,2})\s+(?P<description>.*)\((?P<location>.*)\)""")

some_link = "$50    Sep 5 Baby Carrier - Black/Silver (san jose)"

# Grab the matches
matched_fields = link_content_matcher.search(some_link)

# Write what you want to a file using a delimiter that 
# probably won't exist in the description. This is risky,
# but will do in a pinch.
output_file = open('results.txt', 'w')
output_file.write("{price}^{date}^{desc}^{location}\n".format(price=matched_fields.group('price'),
    date=matched_fields.group('list_date'),
    desc=matched_fields.group('description'),
    location=matched_fields.group('location')))
output_file.close()

When you want to revisit this data, grab it line by line from the file and parse using split.

input_contents = open('results.txt', 'r').readlines()

for line in input_contents:
    price, date, desc, location = line.split('^')
    # Do something with this data or add it to a list

Amrita Sawant · Accepted Answer · 2014-09-11 05:01:10Z

0

import requests
from bs4 import BeautifulSoup as bs
url="http://sfbay.craigslist.org/baa/"
r=requests.get(url)
soup=bs(r.content)
import re
s=soup.find_all('a', class_=re.compile("hdrlnk")) 
for i in s:
  col=i.text
  scol=str(col)
  print scol

s1=soup.find_all('span', class_=re.compile("price")) ### Price

answered Sep 11, 2014 at 5:01

Amrita Sawant

11k4 gold badges25 silver badges26 bronze badges

Collectives™ on Stack Overflow

Python Web Scrape Write Output to File

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related