1

I have pandas dataframe

import pandas as pd
data = {"Column1": ["258 E SONORA ST SAN",
                    "57474 SAXONY WAY APT 223 WESLEY",  
                    "62748 CALIFORNIA ST APT 2 SAN",    
                    "3211 LONGLAKE DR FERN",    
                    "420 S PLYMOUTH CT APT 265",    
                    "AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG", 
                    "224-22 141 STREET RICHMOND",   
                    "15624 274TH ST CAMBRIA",   
                    "778 SANTO DOMINGO AVE SW PALM",    
                    "261 BROADMOOR DR SOUTH SIOUX"],    

        "Colum2" : ["BERNARDINO", "CHAPEL", "FRANCISCO", "CREEK", "CHICAGO", "VALLEY", "HILL", "HEIGHTS", "BAY", "CITY"]}

df = pd.DataFrame(data)

df

Output


                  Column1                                 Colum2
0   258 E SONORA ST SAN                                 BERNARDINO
1   57474 SAXONY WAY APT 223 WESLEY                     CHAPEL
2   62748 CALIFORNIA ST APT 2 SAN                       FRANCISCO
3   3211 LONGLAKE DR FERN                               CREEK
4   420 S PLYMOUTH CT APT 265                           CHICAGO
5   AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG    VALLEY
6   224-22 141 STREET RICHMOND                          HILL
7   15624 274TH ST CAMBRIA                              HEIGHTS
8   778 SANTO DOMINGO AVE SW PALM                       BAY
9   261 BROADMOOR DR SOUTH SIOUX                        CITY

and I have a list of values at where I need to split the string in column1

split_city = ["ST","DR", "STREET", "AVE SW"]

I also want to include a split after APT and digit characters.

How can I Split a string of columns based on values in the list and add them to another column in Pandas DataFrame?

Required Output


                  Column1                                 Colum2
0   258 E SONORA ST                                     SAN BERNARDINO
1   57474 SAXONY WAY APT 223                            WESLEY CHAPEL
2   62748 CALIFORNIA ST APT 2                           SAN FRANCISCO
3   3211 LONGLAKE DR                                    FERN CREEK
4   420 S PLYMOUTH CT APT 265                           CHICAGO
5   AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR         LONG VALLEY
6   224-22 141 STREET                                   RICHMOND HILL
7   15624 274TH ST                                      CAMBRIA HEIGHTS
8   778 SANTO DOMINGO AVE SW                            PALM BAY
9   261 BROADMOOR DR                                    SOUTH SIOUX CITY
3
  • How did your data get improperly split in the first place? There may be a nice way to do this with Pandas, but my first instinct would be to use zip on the two lists before putting them into a dataframe. Note that you wouldn't get Wesley Chapel from any of your split choices. Commented Oct 5, 2021 at 4:38
  • For Wesley Chapel, I don't know how to split after APT and digits Commented Oct 5, 2021 at 4:48
  • Using zip and ...?? Commented Oct 5, 2021 at 4:49

1 Answer 1

1

I don't know if there's a nice way in Pandas to do this, but because the edge cases are so numerous here, it'd be easier to combine the addresses and then use regex instead of trying to split on your split choices (while also accounting for apartments):

Solution

import re


pattern = re.compile(r"([-\d\w ]*)\s(ST|WAY|DR|STREET|AVE|N|S|E|W|SW|SE|NW|NE|APT \d*)\s([\w ]*)")


column1 = ["258 E SONORA ST SAN",
           "57474 SAXONY WAY APT 223 WESLEY",
           "62748 CALIFORNIA ST APT 2 SAN",
           "3211 LONGLAKE DR FERN",
           "420 S PLYMOUTH CT APT 265",
           "AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG",
           "224-22 141 STREET RICHMOND",
           "15624 274TH ST CAMBRIA",
           "778 SANTO DOMINGO AVE SW PALM",
           "261 BROADMOOR DR SOUTH SIOUX"]


column2 = ["BERNARDINO", "CHAPEL", "FRANCISCO", "CREEK", "CHICAGO", "VALLEY", "HILL", "HEIGHTS", "BAY", "CITY"]


combined = [" ".join(t) for t in zip(column1, column2)]
streets = []
cities = []
for t in (pattern.findall(s) for s in combined):
    *street, city = t[0]
    streets.append(" ".join(street))
    cities.append(city)


df = pd.DataFrame({"street": streets, "city": cities})

Output:

In [10]: pd.DataFrame({"street": streets, "city": cities})
Out[10]:
                                        street              city
0                              258 E SONORA ST    SAN BERNARDINO
1                     57474 SAXONY WAY APT 223     WESLEY CHAPEL
2                    62748 CALIFORNIA ST APT 2     SAN FRANCISCO
3                             3211 LONGLAKE DR        FERN CREEK
4                    420 S PLYMOUTH CT APT 265           CHICAGO
5  AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR       LONG VALLEY
6                            224-22 141 STREET     RICHMOND HILL
7                               15624 274TH ST   CAMBRIA HEIGHTS
8                     778 SANTO DOMINGO AVE SW          PALM BAY
9                             261 BROADMOOR DR  SOUTH SIOUX CITY
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.