I'm trying to filter the rows of a dataframe according to whether there is a certain value in one column:
def get_rid_of(data, not_in=None):
'''Deletes rows from a dataframe if the addresses are foreign'''
if not_in is None:
not_in = ['AB', 'BC', 'MB', 'NB', 'NL', 'NS', 'NT', 'ON', 'PE', 'QC', 'SK', 'YT', 'MEX']
not_in_precise = [", " + i + "," for i in not_in]
for place in not_in_precise:
boolean_data = data.drop([data['Address'].str.contains(place)], axis=0)
array = data.to_numpy()
mask = ma.masked_where(array, mask=boolean_data)
mask_df = pd.DataFrame(mask)
return mask_df
The dataset I have has the locations of a lot of Walmarts in North America. The "Address" column has the addresses of the Walmarts, including the state code (i.e., "CA" for "California", "NY" for "New York", or "AB" for "Alberta", or "QC" for "Quebec"). I'm trying to filter out the rows (the Walmarts) that are not in the US, so the rows with state codes in the list "not_in". To make this targeting more precise within each row, I added the "not_in_precise" row, but I don't think it's strictly necessary.
Then with the for-loop, I tried to create a boolean object that represented which rows had foreign state codes (i.e., "AB", "BC", etc.) and should therefore be dropped.
I then tried to pass this boolean object to numpy's masking module to (eventually, with the "pd.DataFrame" argument that follows) create a new dataframe that has had the foreign state codes filtered out.
This returns the following KeyError:
KeyError: '[(True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False...
Could someone show me how I might be able to do this?
For reference, here is a subset of my dataframe below:
index Latitude Longitude Location Address
309 -94.171438 36.352592 Sam's Club; #4969, Gas/Diesel,3500 SE Club Blvd; I-540 Exit 86, Bentonville, AR,72712
2993 -70.716625 41.954876 Walmart SC; #2336, 300 Colony Place Rd, Plymouth, MA,02360
905 -119.576351 36.70575 Walmart SC; #4238, 2761 Jensen Ave, Sanger, CA,93657
2711 -87.121413 37.720832 Walmart SC; #0701, 5031 Frederica St, Owensboro, KY,42301
5315 -80.081003 32.822709 Walmart SC; #1748, Gas/Diesel,3951 W Ashley Circle, Charleston, SC,29414
2796 -89.984047 29.958078 Walmart SC; #0909, Gas/Diesel,8101 W Judge Perez Dr, Chalmette, LA,70043
3624 -88.7491766 32.3821297 Murphy USA; #7452, Gas/Diesel,2336 Hwy 19 N, Meridian, MS,39307
119 -87.9055252 30.6250114 Murphy USA; #6619, Gas/Diesel,27521 WalMart Dr, Daphne, AL,36526
1813 -83.84777 34.2928 Walmart SC; #0510, Gas/Diesel,400 Shallowford Rd, Gainesville, GA,30504
1443 -80.82825 27.222426 Walmart SC; #0814, Gas/Diesel,2101 S Parrott Ave, Okeechobee, FL,34974
817 -117.098304 32.674348 Walmart SC; #5023, 1200 Highland Ave; I-5 Exit 11, National City, CA,91950
5271 -72.749455 46.553894 Walmart SC; #3647, 1600 Boul Royal, Shawinigan, QC,G9N 8S8
2865 -93.2695149 31.1197414 Murphy USA; #7534, Gas/Diesel,2208 S 5th St, Leesville, LA,71446
6615 -97.414152 31.116911 Walmart SC; #6929, Gas/Diesel,6801 W Adams Ave, Temple, TX,76502
6940 -117.047849 46.422968 Walmart SC; #2006, Gas,306 5th St, Clarkston, WA,99403
6373 -95.569196 29.573397 Walmart SC; #2505, Gas/Diesel,5501 Hwy 6, Missouri City, TX,77459
1829 -84.504404 34.181813 Walmart SC; #5814, 2200 Holly Springs Pkwy; I-575 Exit 14, Holly Springs, GA,30115
5572 -89.664482 35.545041 Walmart SC; #0093, Gas/Diesel,201 Lanny Bridges Ave, Covington, TN,38019
4846 -95.3952406 35.9573273 Murphy USA; #6569, Gas/Diesel,416 S Dewey Ave, Wagoner, OK,74467
5915 -95.6510462 30.3821669 Murphy USA; #6916, Gas/Diesel,18702 Hwy 105W, Conroe, TX,77356
Ideally, after applying this function to my data, I'd like the to get rid of the row with index 5271 (and potentially others, but I can't really tell right here).