1

I'm scraping a website in order to store data in a database that has 3 columns. The part of the webstsite i'm scraping looks like one of either of the three examples below

# Example 1:
<div>
<a href="sample1">text1</a>
</div>

# Example 2:
<div>
<a href="sample1">text1</a>
<a href="sample2">text2</a>
</div>

# Example 3:
<div>
<a href="sample1">text1</a>
<a href="sample2">text2</a>
<a href="sample3">text3</a>
</div>

I'm trying to assign

  • "text1" to var1,
  • either an empty string or "text2" to var2,
  • either an empty string or "text3" to var3.

What is the best method to do this??

A few things I've tried are

### FIRST ATTEMPT
var1, var2, var3 = '','',''
# could also do var1, var2, var3 = ('',)*3
all = soup.find_all('a')

var1 = all[0].text

try:
    var2 = all[1].text
except:
    pass

try:
    var3 = all[3].text
except:
    pass

#### SECOND ATTEMPT
all = [s.text for s in soup.find_all('a')]
# This is where i get stuck... This could return a list of length 1, 2, or 3, and I need the output to be a list of length 3 so i can use the following line to assign variables
var1, var2, var3 = all

#### THIRD ATTEMPT
all = [s.text for s in soup.find_all('a')]
var1, var2, var3 = '','',''
n = len(all)
var1 = all[0].text
if n = 2:
    var2 = all[1].text
else:
    var2 = all[1].text
    var3 = all[2].text

EDIT: The reason i'm trying to have three fields in my db is because I want to be able to filter by each of these different variables. var1 is the most accurate label, var2 is slightly more accurate, and var3 is accurate at a high level. Think of it like clothing... var1 could be grey-slacks, var2 could be business-slacks, and var3 could be pants.

3
  • Might I ask why you're trying to do it this way? Instead, just assign the results of find_all to a list, and then you can use the objects within the list. Commented Nov 14, 2015 at 1:54
  • I'm trying to build out a database. The database will ultimately have three columns as this section of code can have up to three values and i would like to capture them all. At some point, i'll have to either set the field to be blank, so i'm trying to figure out a neat way to do so rather than the brute force if...elif... method. I might come across a time where there are > 3 fields i'd like to capture and I'd like to have a smoother way to do so in that case, too. Commented Nov 14, 2015 at 1:56
  • OK, you can capture them all and still use other logic to control how/when they write in to your db. I'll write some ansewr below to help out if I can. Commented Nov 14, 2015 at 1:57

4 Answers 4

2

Your second attempt is probably more pythonic. Of course, you don't know in advance whether the result of .find_all will be a list of length ==3 (or more, or less). So you should use the try/except or other logic to control how/when the results are written to your database.

# create a dictionary of your database column names:
dbColumns = {0:'column1', 1:'column2', 2:'column3'}

# get all the results; there might be 0 or 3 or any number really, 
#     we'll deal with that later
results = [s.text if s.text else "" for s in soup.find_all('a')]

# iterate the items in the list, and put in corresponding DB
for col in range(len(results)):
    # use the dbColumns dict to insert to the desired column

    query = "Insert INTO [db_name].[" + dbColumns[col] + "]"
    query += "VALUES '" + results[i] + '"

    """
    db.insert(query)  # assumes a db object that has an "insert" function; modify as needed
    """

The point of this approach is that there seems to be nothing about this problem that technically would require hardcoding exactly three objects (var1, var2, var3) and trying to assign to these. Instead, just return the results of find_all and deal with them by their index within that resulting list.

Sign up to request clarification or add additional context in comments.

Comments

2

You can use some simple list multiplication:

# use a constant at the top of your script in case the number of columns
# change in the future
COLUMNS = 3

# ... other code ...

all = [s.text for s in soup.find_all('a')]
all.extend(['']*(COLUMNS-len(all))) # append 1 empty string for each missing text field
var1, var2, var3 = all 

But as David Zemens has mentioned in the comments, there has got to be a better way to do this. I can't make any concrete suggestions without seeing the code that consumes your text variables, but you should seriously reconsider your design. Even if you use the constant like I suggested, having var1, var2, var3 = all is still going to make it difficult to maintain and modify this script in the future.


Based on your edit, I would suggest you use a dictionary instead. This will allow you to reference specific data by name, like you would reference a variable, but retains the flexibility of a list instead of restricting you to the number of variables you have hard coded.

For example:

all = [s.text for s in soup.find_all('a')]

d = {}
for i, field in enumerate(all): 
    d['var{}'.format(i)] = field

# later in your code that consumes this dictionary...

try:
    foo(d['var1']) # function to do something with the scraped string corresponding
                   # to var1
except KeyError:
    # do something else or pass when the expected data doesn't exist

If all is ['a', 'b'], then this code produces this:

{'var1': 'b', 'var0': 'a'}

Variable assignments are really nothing more than a mapping - your code knows the variable name, and it can look up the corresponding value. A dictionary lets your code build the mapping on the fly instead of you having to hard code it. Now we have built a dictionary where the varX variables are constructed dynamically. If you decide to add another column, you don't have to change this code at all. You just add your code that would use var4 and be ready to catch the exception if var4 doesn't exist in the dictionary. No more adding empty strings - your code is ready to handle the case where the data it's looking for doesn't exist.

Notes:

  1. The enumerate() function iterates over an iterable object and increments a counter for you. In my code, i is the counter (so we can construct the 'var1', 'var2'... strings), and field is each item from the list.

15 Comments

This looks pretty good and solves the immediate problem, of course for maintainability, hardcoding the number of elements may not be ideal; the number of columns (and consequently the number of "variables" might need to change in the future, etc.)
@DavidZemens I strongly agree. I also agree with your earlier comments that there must be a better way to do this. I've adjusted my answer to suggest at least using a constant that will be easier to change if the OP decides to stick with this approach.
extend is perfect in this situation. As I'm thinking about db functionality, in order to quickly filter the db, I might make 4 columns.... var_all, var1, var2, and var3. Since var1 is the most appropriate label while the other labels are secondary (think of clothing... var1 could be grey-slacks, var2 could be business-slacks, and var3 could be pants), being able to quickly filter all labels could be useful, too.
@exhoosier10 What David and I are getting at though, is there is probably a better way to do this than splitting out the list into explicit variables. These variables are being consumed by other code somewhere, right? If you can make that code operate on the list rather than individual variables, your script will be much easier to modify later. You're intentionally subtracting from the flexibility that lists are intended to provide.
@skrrgwasme I understand what you're saying. Assuming a field of "grey-slacks, business-slacks, pants", i should be able to filter the db by any of these values or something like where var_all = "grey-slacks%"... So i'm starting to think splitting into variables might be a waste of resources
|
1

How about:

all = soup.find_all('a')
var1 = all[0].text if len(all) > 0 else ""
var2 = all[1].text if len(all) > 1 else ""
var3 = all[2].text if len(all) > 2 else ""

The conditional expression x if y else z (often called a ternary operator) keeps the code simple and readable. It's not going to win any design awards though.

Comments

0

you can try this

#if list1 has uncertain number of values and you want to give them each variable 

#create random list2 with max number of possible veriables 
list2 = ['var1', 'var2', 'var3', 'var4' , . . . ]

for li1, li2 in zip(list1, list2):
    globals()[li2] = li1
    print(li2)

I am not pro in python, i just figured this out on my own it might not be very pythonic but it solves the problem

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.