0

I'm writing a python scraper code for OpenData and I have one question about : how to check if all values aren't filled in site and if it is null change value to null.

My scraper is here.

Currently I'm working on it to optimalize.

My variables now look like:

    evcisloval = soup.find_all('td')[3].text.strip()
    prinalezival = soup.find_all('td')[5].text.strip()
    popisfaplnenia = soup.find_all('td')[7].text.replace('\"', '')
    hodnotafaplnenia = soup.find_all('td')[9].text[:-1].replace(",", ".").replace(" ", "")
    datumdfa = soup.find_all('td')[11].text
    datumzfa = soup.find_all('td')[13].text
    formazaplatenia = soup.find_all('td')[15].text
    obchmenonazov = soup.find_all('td')[17].text
    sidlofirmy = soup.find_all('td')[19].text
    pravnaforma = soup.find_all('td')[21].text
    sudregistracie = soup.find_all('td')[23].text
    ico = soup.find_all('td')[25].text
    dic = soup.find_all('td')[27].text
    cislouctu = soup.find_all('td')[29].text

And Output :

scraperwiki.sqlite.save(unique_keys=["invoice_id"],
                                    data={  "invoice_id":number,
                                            "invoice_price":hodnotafaplnenia,
                                            "evidence_no":evcisloval,
                                            "paired_with":prinalezival,
                                            "invoice_desc":popisfaplnenia,
                                            "date_received":datumdfa,
                                            "date_payment":datumzfa,
                                            "pay_form":formazaplatenia,
                                            "trade_name":obchmenonazov,
                                            "trade_form":pravnaforma,
                                            "company_location":sidlofirmy,
                                            "court":sudregistracie,
                                            "ico":ico,
                                            "dic":dic,
                                            "accout_no":cislouctu,
                                            "invoice_attachment":urlfa,
                                            "invoice_url":url})

I googled it but without success.

3
  • 2
    if it is null change value to null: If it is null, then it is already null, you do nothing. Do you mean "null" as a string? Commented Feb 11, 2015 at 9:53
  • if the values are not filled in what will they be? Commented Feb 11, 2015 at 10:06
  • If you run bulk upload you need to have something as a value. Real date or null. Scraped site has 2 date values one for invoice accepted and one when is invoice paid. If you're in elastic ten you're able to select from this dates. But another use case is select from elastic fields with null values and then inform System administrator of this site that he has not good data and that it is in conflict with law. Commented Feb 11, 2015 at 11:47

2 Answers 2

2

First, write a configuration dict of your variables in the form:

conf = {'evidence_no': (3, str.strip),
        'trade_form': (21, None),
         ...}

i.e. key is the output key, value is a tuple of id from soup.find_all('td') and of an optional function that has to be applied to the result, None otherwise. You don't need those Slavic variable names that may confuse other SO members.

Then iterate over conf and fill the data dict.

Also, run soup.find_all('td') before the loop.

tds = soup.find_all('td')

data = {}
for name, (num, func) in conf.iteritems():
    text = tds[num].text

    # replace text with None or "NULL" or whatever if needed
    ...

    if func is None:
        data[name] = text
    else:
        data[name] = func(text)

This will remove a lot of duplicated code. Easier to maintain.

Also, I am not sure the strings "NULL" are the best way to write missing data. Doesn't sqlite support Python's real None objects?

Sign up to request clarification or add additional context in comments.

4 Comments

Defintely more elaborate than my suggestion. Might be overkill if the function is the same for every variable though.
Many thanks , I'm going to rewrite code without Slovak in variable. I need null because I plan to export as json and import data into Elastic, and if you need to make operations with date field it must be set or null.
Yes, use None, not Null. SQLite and Postgres both support None.
Yes, but I'm planning to import data to Elastic and then you need correct date value or null. This is easiest way like change strings when it finish.
1

Just read your attached link, and it seems what you want is

evcisloval = soup.find_all('td')[3].text.strip() or "NULL"

But be careful. You should only do this with strings. If the part before or is either empty or False or None, or 0, they will all be replaced with "NULL"

2 Comments

This is my python scraper. Maybe it helps to someone : pastebin.com/j911H5qZ

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.