2

I have a a csv that has multiple columns, one of these columns consists of strings.

I start with just reading the csv file and then just using two columns

df = pd.read_csv("MyDATA_otherstring.csv", usecols=["describe_file", "data_numbers"])

This is the output

    describe_file   data_numbers
0   This is the start of the story  7309.0
1   This is the start of the story  35.0
2   This is the start of the story  302.0
3   Difficult part  7508.5
4   Difficult part  363.0

In around 10k rows, there are around 150 unique strings. These strings appear multiple times within the file.

My goal Filter by the first string example 'This is is the start of the story' and replace it with a random string.

I want to run over all the strings in that column and replace them with unique strings

I have looked into the random library and some questions that have been asked here, unfortunately I have not found anything that would help me.

1
  • Please be more specific about what research you have done, and what you’ve tried. You could at the very least provide the data in a more convenient or practical format. Commented Mar 14, 2020 at 2:23

2 Answers 2

1

This is your example:

import pandas as pd
import numpy as np
from string import ascii_lowercase

df = pd.DataFrame([['This is the start of the story']*3 + ['Difficult part']*2, 
    np.random.rand(5)], index=['describe_file', 'data_numbers']).T
                    describe_file data_numbers
0  This is the start of the story     0.825913
1  This is the start of the story     0.704422
2  This is the start of the story      0.91563
3                  Difficult part     0.192693
4                  Difficult part     0.795088

This is how you can do it:

df.describe_file = df.join(df.groupby('describe_file')['describe_file'].apply(lambda x:
    ''.join(np.random.choice(list(ascii_lowercase), 10))), \
    on='describe_file', rsuffix='_NEW')['describe_file_NEW']

The result:

  describe_file data_numbers
0    skgfdrsktw     0.204907
1    skgfdrsktw     0.399947
2    skgfdrsktw     0.990196
3    rziuoslpqn     0.930852
4    rziuoslpqn     0.210122
Sign up to request clarification or add additional context in comments.

7 Comments

Thank you for your answer, however I was trying to find something that would do this with all the string. If I have to cope and paste into the code every string. I won't safe more than than just doing the find and replace option in excel
like this? (see answer). i.e., make random strings for an entire column?
And "the string "this is the start of story" should be replaced by one strong not Everytime a different one
Example. All the "this is the start of the story" replace with with "kkbim" . All the " difficult part" replace them by some other string
got it. hope this is what you're expecting (see edit)
|
0

The previous answer by @Nicolas Gervais is fine, but after reading several times the question I interpret that the question is to replace 'This is the part of the story' by a random string, but leave the rest 'Difficult part' as it is. The following command including .replace() statement is doing that.

df['describe_file'].apply(lambda x: x.replace('This is the start of the story', ''.join(np.random.choice(list(ascii_lowercase), 10)))) 
0        glhrtqwlnl
1        qxrklnxhoj
2        kszgtysptj
3    Difficult part
4    Difficult part
Name: describe_file, dtype: object

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.