0

I have a pandas data frame data with several columns. One of these columns is GEN. This column contains german cities as strings. Some of these cities are in a bad format, meaning that they have values like "Frankfurt a.Main". For every element in data['GEN'] I would like to replace every expression of the form "\.[A-ZÄÖÜ]" (i.e. dot followed by upper case letter) by the corresponding expression "\.\b[A-ZÄÖÜ]". For example

  • "Frankfurt a.Main" becomes "Frankfurt a. Main"
  • "Frankfurt a.d.Oder" becomes "Frankfurt a.d. Oder" and so on.

I am pretty sure that pandas.Series.str.contains and pandas.Series.str.replace are helpful here, but one of my problems is that I don't know how to put the replacement task in a form that can be used by the above functions.

2 Answers 2

1

You can use pandas.Series.str.replace to capture the two groups that compose a german city name in your original data and then add a whitespace between them.

Try this :

data['GEN'] = data['GEN'].str.replace(r'(\w+\s.*\.)(\w*)', r'\1 \2', regex=True)

# Output :

0      Frankfurt a. Main
1    Frankfurt a.d. Oder
Sign up to request clarification or add additional context in comments.

Comments

1

You could assert a dot to the left using a positive lookbehind (?<=\.) and match one of [A-ZÄÖÜ]

In the replacement use a space followed by the full match using \g<0>

import pandas as pd

pattern = r"(?<=\.)[A-ZÄÖÜ]"
items = [
    "Frankfurt a.Main",
    "Frankfurt a.d.Oder"
]
data = pd.DataFrame(items, columns=["GEN"])
data['GEN'] = data['GEN'].str.replace(pattern, r' \g<0>')
print(data)

Output

                   GEN
0    Frankfurt a. Main
1  Frankfurt a.d. Oder

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.