0

I have a basic Python Script which can store the output to a file. This is file is difficult to parse. Any other way to write scraped data to a file which can be read easily into Python for analysis ?

import requests
from bs4 import BeautifulSoup as BS
import json
data='C:/test.json'
url="http://sfbay.craigslist.org/search/sby/sss?sort=rel&query=baby" 

r=requests.get(url)
soup=BS(r.content)
links=soup.find_all("p")
#print soup.prettify()

for link in links:
    connections=link.text
    f=open(data,'a')
    f.write(json.dumps(connections,indent=1))
    f.close()

Output File contains this: " $25 Sep 5 Porcelain Baby Deer $25 (sunnyvale) pic household items - by owner "" $7500 Sep 5 GEORGE STECK BABY GRAND PLAYER PIANO $7500 (morgan hill) map musical instruments - by

4 Answers 4

1

If you want to write it from python to a file, and read it back into python later, you can use Pickle - Pickle Tutorial.

Pickle files are in binary and will not be human-readable, if that's important to you then you could look at yaml, which I'll admit has a bit of a learning curve, but produces nicely formatted files.

import yaml

f = open(filename, 'w')
f.write( yaml.dump(data) )
f.close()

...


stream = open(filename, 'r')
data = yaml.load(stream)
Sign up to request clarification or add additional context in comments.

2 Comments

This will allow @Amrita Sawant to store object states, but I don't think it gets to the heart of the question, which was a good way of writing data so that it may be easily parsed by Python later. The scope of this question is a bit broad, admittedly.
Good point, I interpreted "parsed by Python" to mean "read from file", but you're right that may not have been the intent of the question. In that case it's a string manipulation question, not a file IO question.
0

It sounds like your question is more about how to parse the scraped data you get from craigslist, rather than how to deal with files. One way is to take each <p> element and tokenize the string by spaces. For example, tokenizing the string

"$25 Sep 5 Porcelain Baby Deer $25 (sunnyvale) pic household items - by owner"

can be done using split:

s = " $25 Sep 5 Porcelain Baby Deer $25 (sunnyvale) pic household items - by owner "
L = s.strip().split(' ') #remove whitespace at ends and break string apart by spaces

L is now a list with the values

['$25', 'Sep', '5', 'Porcelain', 'Baby', 'Deer', '$25', '(sunnyvale)', 'pic', 'household', 'items', '-', 'by', 'owner']

From here you could try to determine the meanings of the list elements by the order they appear. L[0] might always hold the price, L[1] the month, L[2] the day of month, etcetera. If you are interested in writing these values to file and parsing again later, consider reading up on the csv module.

Comments

0
  1. Decide on what data you actually want. Prices? Descriptions? List dates?
  2. Decide on a good data structure to hold this information. I recommend a class containing pertinent fields, or lists.
  3. Scrape the data you NEED using regular expressions or one of many other methods.
  4. Throw what you DON'T NEED away

5a. Write the list contents to a file in a format you can easily use later (XML, comma delimited, etc)

OR

5b. Pickle the objects as recommended by Mike Ounsworth above.

If you aren't comfortable with XML parsing yet, just write a single line per link and delimit the fields you want with a character you can use later to split. e.g.:

import re #I'm going to use regular expressions here

link_content_matcher = re.compile("""\$(?P<price>[1-9]{1,4})\s+(?P<list_date>[A-Z]{1}[a-z]{2}\s+[0-9]{1,2})\s+(?P<description>.*)\((?P<location>.*)\)""")

some_link = "$50    Sep 5 Baby Carrier - Black/Silver (san jose)"

# Grab the matches
matched_fields = link_content_matcher.search(some_link)

# Write what you want to a file using a delimiter that 
# probably won't exist in the description. This is risky,
# but will do in a pinch.
output_file = open('results.txt', 'w')
output_file.write("{price}^{date}^{desc}^{location}\n".format(price=matched_fields.group('price'),
    date=matched_fields.group('list_date'),
    desc=matched_fields.group('description'),
    location=matched_fields.group('location')))
output_file.close()

When you want to revisit this data, grab it line by line from the file and parse using split.

input_contents = open('results.txt', 'r').readlines()

for line in input_contents:
    price, date, desc, location = line.split('^')
    # Do something with this data or add it to a list

Comments

0
import requests
from bs4 import BeautifulSoup as bs
url="http://sfbay.craigslist.org/baa/"
r=requests.get(url)
soup=bs(r.content)
import re
s=soup.find_all('a', class_=re.compile("hdrlnk")) 
for i in s:
  col=i.text
  scol=str(col)
  print scol

s1=soup.find_all('span', class_=re.compile("price")) ### Price

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.