0

Code below, gets the answer through get request and writes the result to the list "RESULT"

for i in url:
    df = pd.read_html(i,header=0)[0]
    df = df.as_matrix().tolist()
    for item in df:           
        RESULT.append(item)

I use the code below to exclude duplicate entries:

def unique_items(RESULT):
found = set()
for item in RESULT:
    if item[0] not in found:
        yield item
        found.add(item[0])
NOT_DUBLICATE = (list(unique_items(RESULT)))
print(NOT_DUBLICATE)

It seems to me it is not optimal since it is necessary to get a list of all the rows to exclude duplicates.

How can I find duplicates before loading a rows into the list RESULT?

for example, the rows I write to the list RESULT:

[[55323602, 'system]
,[55323603, 'system]]
[[55323602, 'system]
,[55323603, 'system]]
5
  • @msanford I don't think that's a suitable dupe - the OP isn't really eliminating duplicates; they're comparing the elements by item[0]. We're gonna need a "eliminate duplicates based on a key function" sort of question Commented May 7, 2018 at 14:04
  • He is asking something which can avoid duplicates before appending to the list. check my answer! Commented May 7, 2018 at 14:12
  • @Aran-Fey Fair observation; I'll retract. Phillip you may wish to rephrase your title. Commented May 7, 2018 at 14:12
  • 1
    I don't understand the problem. You say it's "necessary to get a list of all the rows to exclude duplicates", but that's not even true. Instead of building a list RESULTS and then removing duplicates from that list, just skip the duplicates in the for item in df: loop. Commented May 7, 2018 at 14:19
  • @phillipwatts344, is it not possible to call drop_duplicates on your df before mapping it into a list? It would automatically drop all duplicates. Commented May 7, 2018 at 14:46

2 Answers 2

1

Instead of use another method to exclude duplicate entries, append item to the list if item doesn't exist in the list RESULT. Then you don't need method unique_items().

You can find duplicates before loading a row into the list RESULT using this:

for i in url:
    df = pd.read_html(i,header=0)[0]
    df = df.as_matrix().tolist()
    for item in df:  
        if item not in RESULT         
            RESULT.append(item)
Sign up to request clarification or add additional context in comments.

Comments

1

Just use a set instead of a list.

result = set()
for i in url:
    df = pd.read_html(i,header=0)[0]
    df_list = df.as_matrix().tolist()
    for item in df_list:          
       result.add(tuple(item))

Above code will exclude any duplicates. The only difference from your case will be that elements of result will be tuples instead of lists.

At the end, you can recast the set to a list by:

result = list(result)

6 Comments

1) Your result is a dict, not a set. 2) item isn't defined in your loop. 3) item seems to be a list, and lists can't be stored in sets.
@Aran-Fey Thanks for first two points, I corrected them. Regarding #3, you are wrong. A set can be updated with iterables: docs.python.org/3/library/stdtypes.html#frozenset.update.
Yes, a set can be updated with an iterable, but that's not what we're trying to do here. We're trying to detect duplicate rows based on the first element, i.e. item[0]. Your code doesn't do that; it just tosses all the values in a row into a set. You end up with a list of values, not a list of rows.
If that is the case, the last edit should work fine.
Given OP's example, the second element always seems to be 'system' so my code technically compares based on the first element. @phillipwatts344, correct me if I am wrong.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.