1

I'm trying to filter the rows of a dataframe according to whether there is a certain value in one column:

def get_rid_of(data, not_in=None):
  '''Deletes rows from a dataframe if the addresses are foreign'''
  if not_in is None: 
    not_in = ['AB', 'BC', 'MB', 'NB', 'NL', 'NS', 'NT', 'ON', 'PE', 'QC', 'SK', 'YT', 'MEX']
  not_in_precise = [", " + i + "," for i in not_in]
  
  for place in not_in_precise:
    boolean_data = data.drop([data['Address'].str.contains(place)], axis=0)

  array = data.to_numpy()
  mask = ma.masked_where(array, mask=boolean_data)
  mask_df = pd.DataFrame(mask)

  return mask_df

The dataset I have has the locations of a lot of Walmarts in North America. The "Address" column has the addresses of the Walmarts, including the state code (i.e., "CA" for "California", "NY" for "New York", or "AB" for "Alberta", or "QC" for "Quebec"). I'm trying to filter out the rows (the Walmarts) that are not in the US, so the rows with state codes in the list "not_in". To make this targeting more precise within each row, I added the "not_in_precise" row, but I don't think it's strictly necessary.

Then with the for-loop, I tried to create a boolean object that represented which rows had foreign state codes (i.e., "AB", "BC", etc.) and should therefore be dropped.

I then tried to pass this boolean object to numpy's masking module to (eventually, with the "pd.DataFrame" argument that follows) create a new dataframe that has had the foreign state codes filtered out.

This returns the following KeyError:

KeyError: '[(True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False...

Could someone show me how I might be able to do this?

For reference, here is a subset of my dataframe below:

index   Latitude    Longitude   Location    Address
309 -94.171438  36.352592   Sam's Club; #4969,  Gas/Diesel,3500 SE Club Blvd; I-540 Exit 86, Bentonville, AR,72712
2993    -70.716625  41.954876   Walmart SC; #2336,  300 Colony Place Rd, Plymouth, MA,02360
905 -119.576351 36.70575    Walmart SC; #4238,  2761 Jensen Ave, Sanger, CA,93657
2711    -87.121413  37.720832   Walmart SC; #0701,  5031 Frederica St, Owensboro, KY,42301
5315    -80.081003  32.822709   Walmart SC; #1748,  Gas/Diesel,3951 W Ashley Circle, Charleston, SC,29414
2796    -89.984047  29.958078   Walmart SC; #0909,  Gas/Diesel,8101 W Judge Perez Dr, Chalmette, LA,70043
3624    -88.7491766 32.3821297  Murphy USA; #7452,  Gas/Diesel,2336 Hwy 19 N, Meridian, MS,39307
119 -87.9055252 30.6250114  Murphy USA; #6619,  Gas/Diesel,27521 WalMart Dr, Daphne, AL,36526
1813    -83.84777   34.2928 Walmart SC; #0510,  Gas/Diesel,400 Shallowford Rd, Gainesville, GA,30504
1443    -80.82825   27.222426   Walmart SC; #0814,  Gas/Diesel,2101 S Parrott Ave, Okeechobee, FL,34974
817 -117.098304 32.674348   Walmart SC; #5023,  1200 Highland Ave; I-5 Exit 11, National City, CA,91950
5271    -72.749455  46.553894   Walmart SC; #3647,  1600 Boul Royal, Shawinigan, QC,G9N 8S8
2865    -93.2695149 31.1197414  Murphy USA; #7534,  Gas/Diesel,2208 S 5th St, Leesville, LA,71446
6615    -97.414152  31.116911   Walmart SC; #6929,  Gas/Diesel,6801 W Adams Ave, Temple, TX,76502
6940    -117.047849 46.422968   Walmart SC; #2006,  Gas,306 5th St, Clarkston, WA,99403
6373    -95.569196  29.573397   Walmart SC; #2505,  Gas/Diesel,5501 Hwy 6, Missouri City, TX,77459
1829    -84.504404  34.181813   Walmart SC; #5814,  2200 Holly Springs Pkwy; I-575 Exit 14, Holly Springs, GA,30115
5572    -89.664482  35.545041   Walmart SC; #0093,  Gas/Diesel,201 Lanny Bridges Ave, Covington, TN,38019
4846    -95.3952406 35.9573273  Murphy USA; #6569,  Gas/Diesel,416 S Dewey Ave, Wagoner, OK,74467
5915    -95.6510462 30.3821669  Murphy USA; #6916,  Gas/Diesel,18702 Hwy 105W, Conroe, TX,77356

Ideally, after applying this function to my data, I'd like the to get rid of the row with index 5271 (and potentially others, but I can't really tell right here).

1 Answer 1

1

Your problem seems simple enough to be solved by str.contains.

foreign_states = ['AB', 'BC', 'MB', 'NB', 'NL', 'NS', 'NT', 'ON', 'PE', 'QC', 'SK', 'YT', 'MEX']
foreign_states_precise = [", " + i + "," for i in foreign_states]

df = df[~df.Address.str.contains('|'.join(foreign_states_precise), regex=True)]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.