1

I have a pandas dataframe column (Data Type) which I want to split into three columns

target_table_df = LoadS_A [['Attribute Name',
                              'Data Type',
                              'Primary Key Indicator']]

Example input (target_table_df)

                 Attribute Name      Data Type Primary Key Indicator
0                       ACC_LIM  DECIMAL(18,4)                 False
1                        ACC_NO   NUMBER(11,0)                 False
2                   ACC_OPEN_DT           DATE                 False
3                          ACCB  DECIMAL(18,4)                 False
4                          ACDB  DECIMAL(18,4)                 False
5                     AGRMNT_ID   NUMBER(11,0)                  True
6                     BRNCH_NUM   NUMBER(11,0)                 False
7                      CLRD_BAL  DECIMAL(18,4)                 False
8              CR_INT_ACRD_GRSS  DECIMAL(18,4)                 False
9               CR_INT_ACRD_NET  DECIMAL(18,4)                 False

I aim to:

  • Reassign 'Data Type' to the text preceding the parenthesis

[..if parenthesis exists in 'Data Type']:

  • Create new column 'Precision' and assign to first comma separated value
  • Create new column 'Scale' and assign to second comma separated value

Intended output would therefore become:

    Data Type   Precision   Scale
0   decimal 18  4
1   number  11  0
2   date        
3   decimal 18  4
4   decimal 18  4
5   number  4   0

I have tried in anger to achieve this but i'm new to dataframes....can't work out if I am to iterate over all rows or if there is a way to apply to all values in the dataframe?

Any help much appreciated

2
  • 1
    Please provide sample input and output. Commented May 27, 2016 at 13:30
  • Sample output provided. Sample input is what is marked as 'Data Type' values. Cheers for any help Commented May 27, 2016 at 13:45

1 Answer 1

3

Use target_table_df['Data Type'].str.extract(pattern)

You'll need to assign pattern to be a regular expression that captures each of the components you're looking for.

pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'

([^\(]+) says grab as many non-open parenthesis characters you can up to the first open parenthesis.

\(([^,]*, says to grab the first set of non-comma characters after an open parenthesis and stop at the comma.

,(.*)\) says to grab the rest of the characters between the comma and the close parenthesis.

(\(([^,]*),(.*)\))? says the whole parenthesis thing may not even happen, grab it if you can.

Solution

everything together looks like this:

pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'
df = s.str.extract(pattern, expand=True).iloc[:, [0, 2, 3]]

# Formatting to get it how you wanted
df.columns = ['Data Type', 'Precision', 'Scale']
df.index.name = None
print df

I put a .iloc[:, [0, 2, 3]] at the end because the pattern I used grabs the whole parenthesis in column 1 and I wanted to skip it. Leave it off and see.

  Data Type Precision Scale
0   decimal        18     4
1    number        11     0
2      date       NaN   NaN
3   decimal        18     4
4   decimal        18     4
5    number        11     0
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.