0

I am facing the following problem. My Dataframe is as follows,

pic2

I want to create 3 dataset from this dataframe,

  1. Response column stays and need context with the first string, so Tweet1, Tweet3 ,Tweet6,Tweet7 and Tweet11
  2. Response column stays and need context with the first and second string, so Tweet1,Tweet2, Tweet3, Tweet4,Tweet6, Tweet7, Tweet8,Tweet11 and Tweet12.
  3. Response column stays and need context with the first, second and third string, so, Tweet1,Tweet2, Tweet3,Tweet4,Tweet5,Tweet6,Tweet7,Tweet8,Tweet9,Tweet11 and Tweet12

All the tweets in the context column are in a list as shown above and they are separated using a comma.

I appreciate your repsonse and comments.

1 Answer 1

1

Based on your new information, I will now mimic the reading of the json file like this::

import pandas as pd
from io import StringIO

file_as_str="""
[
{"label":1, "response" : "resp_exmaple1", "context": ["tweet1,with comma", "tweet2"]},
{"label":0, "response" : "resp_exmaple2", "context": ["tweet3", "tweet4", "tweet5"]},
{"label":1, "response" : "resp_exmaple3", "context": ["tweet6, with comma"]},
{"label":1, "response" : "resp_exmaple4", "context": ["tweet7", "Tweet8", "Tweet9", "Tweet10"]},
{"label":0, "response" : "resp_exmaple5", "context": ["tweet11", "Tweet12"]}
]
"""
tweets_json = StringIO(file_as_str)

The above string is only to mimic reading from file like this:

tweets = pd.read_json(tweets_json, orient='records')

If the structure is indeed is like my example, you should give orient='records', but if it is different you may need to pick another scheme. The dataframe now looks like:

   label       response                            context
0      1  resp_exmaple1        [tweet1,with comma, tweet2]
1      0  resp_exmaple2           [tweet3, tweet4, tweet5]
2      1  resp_exmaple3               [tweet6, with comma]
3      1  resp_exmaple4  [tweet7, Tweet8, Tweet9, Tweet10]
4      0  resp_exmaple5                 [tweet11, Tweet12]

The difference is that the context column now contains lists of strings, so the comma's dont matter. Now you can easily make a selection of maximum number of tweets like this:

context = tweets["context"]

max_tweets = 2
new_context = list()

for tweet_list in context:
    n_selection = min(len(tweet_list), max_tweets)
    tweets_selection = tweet_list[:n_selection]
    new_context.append(tweets_selection)
tweets["context"] = new_context

The result looks like

   label       response                      context
0      1  resp_exmaple1  [tweet1,with comma, tweet2]
1      0  resp_exmaple2             [tweet3, tweet4]
2      1  resp_exmaple3         [tweet6, with comma]
3      1  resp_exmaple4             [tweet7, Tweet8]
4      0  resp_exmaple5           [tweet11, Tweet12]
Sign up to request clarification or add additional context in comments.

8 Comments

Thank you so much for your detailed answer. I appreciate it. It is working as expected. However i am facing another issue. Let say tweet 1 is "A minor child deserves privacy and should be kept out of politics . Pamela Karlan , you should be ashamed of your very angry and obviously biased public pandering , and using a child to do it ." as you can see that this tweet is having mulitple commas and from your code i am splitting based on (","). as a result the tweet1 is creating multiple portion. I want to have the first tweet intact, despite having multiple commas. How can i achieve it ?
Is your context column really a list of strings, as I assumed right now? Or are you reading from a csv file and do you have a real list from the start? In the latter case, it is probably easy to create the context_df. You should provide the way you have your data and I can say something about it
It is basically a Json file. I am reading it using pandas. the context column is a list of strings. but some strings have multiple commas. As a result, from your code, some rows are working some are not.
Probably converting the json file to a csv a good idea?
but in the json file, probably you context columns look like [["Tweet1", "Tweet2"], ["Tweet3", "Tweet4", "Tweet5"],...], i.e. a list of lists of strings. Tweet1, and Tweet2 are separated at this point. Is that correct? If that is the case, you should try to make use of that. Because once the different tweets are merged to one string it is more difficult to separate them
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.