Create a dataframe using specific strings in a column from a parent dataframe

Question

I am facing the following problem. My Dataframe is as follows,

I want to create 3 dataset from this dataframe,

Response column stays and need context with the first string, so Tweet1, Tweet3 ,Tweet6,Tweet7 and Tweet11
Response column stays and need context with the first and second string, so Tweet1,Tweet2, Tweet3, Tweet4,Tweet6, Tweet7, Tweet8,Tweet11 and Tweet12.
Response column stays and need context with the first, second and third string, so, Tweet1,Tweet2, Tweet3,Tweet4,Tweet5,Tweet6,Tweet7,Tweet8,Tweet9,Tweet11 and Tweet12

All the tweets in the context column are in a list as shown above and they are separated using a comma.

I appreciate your repsonse and comments.

Eelco van Vliet · Accepted Answer · 2022-03-19 16:15:27Z

1

Based on your new information, I will now mimic the reading of the json file like this::

import pandas as pd
from io import StringIO

file_as_str="""
[
{"label":1, "response" : "resp_exmaple1", "context": ["tweet1,with comma", "tweet2"]},
{"label":0, "response" : "resp_exmaple2", "context": ["tweet3", "tweet4", "tweet5"]},
{"label":1, "response" : "resp_exmaple3", "context": ["tweet6, with comma"]},
{"label":1, "response" : "resp_exmaple4", "context": ["tweet7", "Tweet8", "Tweet9", "Tweet10"]},
{"label":0, "response" : "resp_exmaple5", "context": ["tweet11", "Tweet12"]}
]
"""
tweets_json = StringIO(file_as_str)

The above string is only to mimic reading from file like this:

tweets = pd.read_json(tweets_json, orient='records')

If the structure is indeed is like my example, you should give orient='records', but if it is different you may need to pick another scheme. The dataframe now looks like:

   label       response                            context
0      1  resp_exmaple1        [tweet1,with comma, tweet2]
1      0  resp_exmaple2           [tweet3, tweet4, tweet5]
2      1  resp_exmaple3               [tweet6, with comma]
3      1  resp_exmaple4  [tweet7, Tweet8, Tweet9, Tweet10]
4      0  resp_exmaple5                 [tweet11, Tweet12]

The difference is that the context column now contains lists of strings, so the comma's dont matter. Now you can easily make a selection of maximum number of tweets like this:

context = tweets["context"]

max_tweets = 2
new_context = list()

for tweet_list in context:
    n_selection = min(len(tweet_list), max_tweets)
    tweets_selection = tweet_list[:n_selection]
    new_context.append(tweets_selection)
tweets["context"] = new_context

The result looks like

   label       response                      context
0      1  resp_exmaple1  [tweet1,with comma, tweet2]
1      0  resp_exmaple2             [tweet3, tweet4]
2      1  resp_exmaple3         [tweet6, with comma]
3      1  resp_exmaple4             [tweet7, Tweet8]
4      0  resp_exmaple5           [tweet11, Tweet12]

edited Mar 19, 2022 at 16:15

answered Mar 19, 2022 at 12:49

Eelco van Vliet

1,24813 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

CD_NS Over a year ago

Thank you so much for your detailed answer. I appreciate it. It is working as expected. However i am facing another issue. Let say tweet 1 is "A minor child deserves privacy and should be kept out of politics . Pamela Karlan , you should be ashamed of your very angry and obviously biased public pandering , and using a child to do it ." as you can see that this tweet is having mulitple commas and from your code i am splitting based on (","). as a result the tweet1 is creating multiple portion. I want to have the first tweet intact, despite having multiple commas. How can i achieve it ?

Eelco van Vliet Over a year ago

Is your context column really a list of strings, as I assumed right now? Or are you reading from a csv file and do you have a real list from the start? In the latter case, it is probably easy to create the context_df. You should provide the way you have your data and I can say something about it

CD_NS Over a year ago

It is basically a Json file. I am reading it using pandas. the context column is a list of strings. but some strings have multiple commas. As a result, from your code, some rows are working some are not.

CD_NS Over a year ago

Probably converting the json file to a csv a good idea?

Eelco van Vliet Over a year ago

but in the json file, probably you context columns look like [["Tweet1", "Tweet2"], ["Tweet3", "Tweet4", "Tweet5"],...], i.e. a list of lists of strings. Tweet1, and Tweet2 are separated at this point. Is that correct? If that is the case, you should try to make use of that. Because once the different tweets are merged to one string it is more difficult to separate them

|

Collectives™ on Stack Overflow

Create a dataframe using specific strings in a column from a parent dataframe

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related