0

I have a data set that has columns for number of units sold in a given month - the problem being that the monthly units columns are named in MM/yyyy format, meaning that I have 12 columns of units information per record.

So for instance, my data looks like:

ProductID  |  CustomerID  |  04/2018  |  03/2018  |  02/2018  |  FileDate  |
a1032      |  c1576       |     36    |     12    |     19    | 04/20/2018 |

What causes this to be problematic is that a new file comes in every month, with the same file name, but different column headers for the units information based on the last 12 months.

What I would like to do, is rename the monthly units columns to Month1, Month2, Month3... based on a simple regex such as ([0-9]*)/([0-9]*) that will result in the output:

ProductID  |  CustomerID  |   Month1  |   Month2  |   Month3  |  FileDate  |
a1032      |  c1576       |     36    |     12    |     19    | 04/20/2018 |

I know that this should be possible using Python, but as I have never used Python before (I am an old .Net developer) I honestly have no idea how to achieve this.

I have done a bit of research on renaming columns in Python, but none of them mentioned pattern matching to rename a column, eg:

 df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})

UPDATE: The data that I am showing in my example is only a subset of the columns; total, in my data set I have 120 columns, only 12 of which need to be renamed, this is why I thought that regex might be the simplest way to go.

3 Answers 3

1
import re

# regex pattern
pattern = re.compile("([0-9]*)/([0-9]*)")

# get headers as list
headers = list(df)

# apply regex
months = 1
for index, header in enumerate(headers):
    if pattern.match(header):
        headers[index] = 'Month{}'.format(months)
        months += 1

# set new list as column headers
df.columns = headers
Sign up to request clarification or add additional context in comments.

Comments

1

If you have some set names that you want to convert to, then rather than using rename, it might easier to just pass a new list to the df.columns attribute

df.columns = ['ProductID','CustomerID']+['Month{}'.format(i) for i in range(12)]+['FileDate']

If you want to use rename, if you can write a function find_new_name that does the conversion you want for a single name, you can rename an entire list old_names with

df.rename(columns = {oldname:find_new_name(old_name) for old_name in old_names})

Or if you have a function that takes a new name and figures out what old name corresponds to it, then it would be

df.rename(columns = {find_old_name(new_name):new_name for new_name in new_names})

You can also do

for new_name in new_names:
    old_name = find_new_name(old_name)
    df[new_name] = df[old_name]

This will copy the data into new columns with the new names rather than renaming, so you can then subset to just the columns you want.

1 Comment

Thanks for the answer - one of the reasons that I was considering Regex is because my full data set contains 120 columns. I realized I did not include this information in my question and have updated it accordingly
1

Since rename could take a function as a mapper, we could define a customized function which returns a new column name in the new format if the old column name matches regex; otherwise, returns the same column name. For example,

import re


def mapper(old_name):
    match = re.match(r'([0-9]*)/([0-9]*)', old_name)
    if match:
        return 'Month{}'.format(int(match.group(1)))
    return old_name

df = df.rename(columns=mapper)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.