1

I have data frame in which txt column contains a list. I want to clean the txt column using function clean_text().

data = {'value':['abc.txt', 'cda.txt'], 'txt':['['2019/01/31-11:56:23.288258 1886     7F0ED4CDC704     asfasnfs: remove datepart']',
                                               '['2019/02/01-11:56:23.288258 1886     7F0ED4CDC704     asfasnfs: remove datepart']']}
df = pandas.DataFrame(data=data)
    df
 value    txt
 abc.txt  ['2019/01/31-11:56:23.288258 1886     7F0ED4CDC704     asfasnfs: remove datepart']
 cda.txt  ['2019/02/01-11:56:23.288258 1886     7F0ED4CDC704     asfasnfs: remove datepart']
def clean_text(text):
    """
    :param text:  it is the plain text
    :return: cleaned text
    """
    patterns = [r"^.{53}",
                r"[A-Za-z]+[\d]+[\w]*|[\d]+[A-Za-z]+[\w]*",
                r"[-=/':,?${}\[\]-_()>.~" ";+]"]

    for p in patterns:
        text = re.sub(p, '', text)

    return text

My Solution:

df['txt'] = df['txt'].apply(lambda x: clean_text(x))

But I am getting below error: Error

df['txt'] = df['txt'].apply(lambda x: clean_text(x))
AttributeError: 'list' object has no attribute 'apply'



clean_text(df['txt'][1]
TypeError: expected string or bytes-like object

I am not sure how to use numpy.where in this problem.

4
  • It's different. How can I use np.where in my case? Commented Feb 10, 2019 at 20:19
  • I take it your data-set in this example is incomplete? When running your code with the provided value for data, this runs fine for me and does not produce an attribute error. Commented Feb 10, 2019 at 20:25
  • @SpencerD, I have updated question, basically txt column contains a list of string. Commented Feb 10, 2019 at 20:28
  • Ah that makes a bit more sense, although, the code above obviously is malformed, due to where you pasted the output of df at. Anyway, not sure what your end-goal is for the data, but this does seem to run and does perform replacements df['txt'].apply(lambda x: [clean_text(z) for z in x]) Commented Feb 10, 2019 at 20:40

1 Answer 1

2

Based on the revision to your question, and discussion in the comments, I believe you need to use the following line:

df['txt'] = df['txt'].apply(lambda x: [clean_text(z) for z in x])

In this approach, apply is used with lambda to loop each element of the txt series, while a simple for-loop (expressed using Python's list comprehension) is utilized to iterate over each item in the txt sub-list.

I have tested that snippet with the following value for data:

data = {
    'value': [
        'abc.txt',
        'cda.txt',
    ],
    'txt':[
        [
            '2019/01/31-11:56:23.288258 1886     7F0ED4CDC704     asfasnfs: remove datepart',
        ],
        [
            '2019/02/01-11:56:23.288258 1886     7F0ED4CDC704     asfasnfs: remove datepart',
        ],
    ]
}

Here is a snippet of console output showing the dataframe before and after transformation:

>>> df

     value                                                txt
0  abc.txt  [2019/01/31-11:56:23.288258 1886     7F0ED4CDC...
1  cda.txt  [2019/02/01-11:56:23.288258 1886     7F0ED4CDC...

>>> df['txt'] = df['txt'].apply(lambda x: [clean_text(z) for z in x])

>>> df

     value                         txt
0  abc.txt  [asfasnfs remove datepart]
1  cda.txt  [asfasnfs remove datepart]
>>> 
Sign up to request clarification or add additional context in comments.

2 Comments

@user15051990 I will never understand why people accept answers but don't also up vote them. It's of course totally optional, but it doesn't cost anything. (currently the only up vote is mine)
@uhoh, lol I cannot pretend to understand, but oh well. Glad to be of help to someone! 😁

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.