2

Hi I'm trying to create multiple columns in my dataframe based on multiple lines within the [comment] column cell. The source data is a .csv file

This is my dataset sample

+---------+-----------------------------------------+
| id      | comment                                 | 
+---------+-----------------------------------------+
| 123ab12 | DATE: 2/3/21 10:23:42 AM CST            |  
|         | STAGE: 1                                | 
|         | SCORE: 2,321                            |  
|         | NAME: Sally                             |  
|         | HOBBY: Swimming                         |  
|         | NOTES: But she doesn't like: sun, fish  |
+---------+-----------------------------------------+
| 123ab12 | DATE: 4/3/21 8:15:20 AM CST             |  
|         | STAGE: 1                                |  
|         | SCORE: 500                              |  
|         | NAME: Tom                               |  
|         | HOBBY: Running                          |  
|         | AGE: 26                                 |  
|         | NOTES: He needs new pair of sport shoes |
+---------+-----------------------------------------+

This is what I want to get

+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| id      | date                   | stage | score | name  | hobby    | age | notes                            |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| 123ab12 | 2/3/21 10:23:42 AM CST | 1     | 2,321 | Sally | Swimming |     | But she doesn't like: sun, fish  |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| 123ab12 | 4/3/21 8:15:20 AM CST  | 1     | 500   | Tom   | Running  | 26  | He needs new pair of sport shoes |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+

Note that :

  • Some comments may have an additional line for AGE
  • Colon : may appear twice in NOTES in the [comment] column, e.g NOTES: bla bla bla : further sentence
  • ID can be duplicated
  • There are different IDs and thousands of rows

My initial thought was to :

  • somehow use a regex to use the line breaks \n before NOTES: as column separator (but the AGE line that sometimes appear seem to mess it up, or my brain just isn't working...)

Your help is much appreciated. Thank you!

4
  • 1
    What is your source data. It is best to solve it before you create the initial dataframe. If we need to work with this dataframe, we can do so. Also, have you tried anything to solve this? If so what haver you trtied Commented Mar 9, 2021 at 2:14
  • Hi joe, the source data is a csv file, and I will be outputting the data into a csv as well after transforming it. I tried regex but didn't make much sense on my end :( Commented Mar 9, 2021 at 9:15
  • can you share sample records from the source csv file please? It will be best to solve it at source than to move it into a dataframe and then rearrange then convert it back to csv Commented Mar 10, 2021 at 17:49
  • 1
    Please don't add your data as images, but instead copy it into the post. This makes it easier for others to work on your problem. There are online table generators that can help you make pretty text or markdown tables in no time. Commented Mar 10, 2021 at 23:45

1 Answer 1

1

You can use str.extract and a regex with named capture groups to directly capture the extracted data into dataframe columns with the respective group names (see this answer to a question on pandas split list into columns with regex).

You can use the fixed parts of your comments column (i.e. the labels and newlines) as anchors and make the AGE: part optional.

Regex:

DATE: (?P<date>[\s\S]+)\nSTAGE: (?P<stage>[\s\S]+)\nSCORE: (?P<score>[\s\S]+)\nNAME: (?P<name>[\s\S]+)\nHOBBY: (?P<hobby>[\s\S]+?)\n(?:AGE: )?(?P<age>[\s\S]*?)(\n)?NOTES:(?P<notes>[\s\S]+)

Explanation:

  1. Each column is extracted by the pattern of the form ANCHOR: (?P<groupname>[\s\S]+)\n
  • ANCHOR: - This are just your plain text labels, i.e. DATE: , STAGE: , etc.
  • (?P<groupname>- This starts a named capture group. The <groupname> directly becomes the dataframe column name.
  • [\s\S]+ - greedy match any series of (at least one) characters (including newlines, see this answer)
  1. For the column age, we need some changes, as the AGE: anchor may be present or not:
  • [\s\S]+? - the last group right before the AGE: anchor is matched lazily, else it would greedily include the whole AGE: part that follows in the match
  • (?:AGE: )? - the AGE: anchor itself is enclosed in an optional non-capturing group, since it may be there or not
  • (?P<age>[\s\S]*?) - the named capture group for age is permitted to be empty, unlike the others
  • (?:\n)? - the trailing newline character is of course also optional and should not be captured

Altogether, this will find matches in your string whether the AGE: part is there (https://regex101.com/r/tn6ixo/2/) or not (https://regex101.com/r/tn6ixo/1/).

Full example:

Input CSV file (comments.csv):

id;comments
123ab12;"DATE: 2/3/21 10:23:42 AM CST
STAGE: 1
SCORE: 2,321
NAME: Sally
HOBBY: Swimming
NOTES: But she doesn't like: sun, fish"
123ab12;"DATE: 4/3/21 8:15:20 AM CST
STAGE: 1
SCORE: 500
NAME: Tom
HOBBY: Running
AGE: 26
NOTES: He needs new pair of sport shoes"

Python script:

import pandas as pd

df = pd.read_csv('comments.csv', delimiter=';')

ef =  df['comments'].str.extract('DATE: (?P<date>[\s\S]+)\nSTAGE: (?P<stage>[\s\S]+)\nSCORE: (?P<score>[\s\S]+)\nNAME: (?P<name>[\s\S]+)\nHOBBY: (?P<hobby>[\s\S]+?)\n(?:AGE: )?(?P<age>[\s\S]*?)(?:\n)?NOTES:(?P<notes>[\s\S]+)', expand=True)

Outcome:

                     date stage  score   name     hobby age                              notes
0  2/3/21 10:23:42 AM CST     1  2,321  Sally  Swimming        But she doesn't like: sun, fish
1   4/3/21 8:15:20 AM CST     1    500    Tom   Running  26   He needs new pair of sport shoes

Supplemental

Please note that this will yield a dataframe where all columns have dtype: object. You may want to convert some of your columns, e.g.

ef[['stage', 'age']] = ef[['stage', 'age']].apply(pd.to_numeric)
ef['score'] = ef['score'].str.replace(',', '').astype(int)
ef[['name', 'hobby', 'notes']] = ef[['name', 'hobby', 'notes']].astype('string')
ef['date'] = pd.to_datetime(ef['date'])

Beware, the latter command will not be able to automatically identify your timezone correctly, as CST is an ambiguous timezone. Instead, you will end up with naive timestamps.

To add your timezone information, you can e.g. add a timezone with pytz :

import pytz

ef['date'] = ef['date'].apply(lambda x: x.replace(tzinfo=pytz.timezone('America/Chicago')))

You can also create a DatetimeIndex and localize or convert to your timezone.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.