1

I have a dataframe called csv_table that looks like this:

      class                      ID                                               text
0         2  BIeDBg4MrEd1NwWRlFHLQQ  Decent but terribly inconsistent food. I've ha...
1         4  NJHPiW30SKhItD5E2jqpHw  Looks aren't everything.......  This little di...
2         2  nnS89FMpIHz7NPjkvYHmug  Being a creature of habit anytime I want good ...
3         2   FYxSugh9PGrX1PR0BHBIw  I recently told a friend that I cant figure ou...
4         4  ScViKtQ2xq6i5AyN4curYQ  Chevy's five years ago was crisp and fresh and...
5         2  vz8Q37FSlypZlgy5N7Ym0A  Every time I go to this Jack In The Box I get ...
6         4   OJuG2EvItSZXbu8KowI9A  I've been going to Cluckers for years. Every t...
7         4   k9ci6SfI5RZT3smNdnvSg  .                                             ...
8         4  qq6bQbrBZyd4lOBd8KSCoA  Well, after their remodel the place no longer ...
9         4     FldFfwfuk9T8kvkp8iw  Beer selection was good, but they were out of ...
10        4  63ufCUqbPcnl6abC1SBpvQ  Ihop is my favorite breakfast chain, and the s...
11        4   nDYCZDIAvdcx77EcmYz0Q  A very good Jewish deli tucked in and amongst ...
12        4  uoC1llZumwFKgXAMlDbZIg  Went here for lunch with Rand H. and this plac...
13        2   BBs1rbz75dDifvoQyVMDg  Picture the least attractive person you'd sett...
14        4    2t9znjapzhioLqb4Pf1Q  Really really really strong Margaritas!   The ...
15        4  GqLgixGcbWh51IzkwsiswA  I would not have known about this place had it...

[1999 rows x 3 columns]

I am trying to add 2 columns to the csv_table, one that specifies the number of words in the text column (as denoted with a "word" being a split on space), and a column that specifies the number of "clean" words as defined by a custom function.

I have the ability to count the total clean and dirty words, but how can I apply these functions to each row in the dataframe, and append those columns?

Code is below:

import nltk, re, pandas as pd
from nltk.corpus import stopwords
import sklearn, string
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from itertools import islice

# This function removes numbers from an array
def remove_nums(arr): 
    # Declare a regular expression
    pattern = '[0-9]'  
    # Remove the pattern, which is a number
    arr = [re.sub(pattern, '', i) for i in arr]    
    # Return the array with numbers removed
    return arr

# This function cleans the passed in paragraph and parses it
def get_words(para):   
    # Create a set of stop words
    stop_words = set(stopwords.words('english'))
    # Split it into lower case    
    lower = para.lower().split()
    # Remove punctuation
    no_punctuation = (nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in lower)
    # Remove integers
    no_integers = remove_nums(no_punctuation)
    # Remove stop words
    dirty_tokens = (data for data in no_integers if data not in stop_words)
    # Ensure it is not empty
    tokens = [data for data in dirty_tokens if data.strip()]
    # Ensure there is more than 1 character to make up the word
    tokens = [data for data in tokens if len(data) > 1]

    # Return the tokens
    return tokens 

def main():

    tsv_file = "filepath"
    csv_table=pd.read_csv(tsv_file, sep='\t')
    csv_table.columns = ['class', 'ID', 'text']

    print(csv_table)

    s = pd.Series(csv_table['text'])
    new = s.str.cat(sep=' ')
    clean_words = get_words(new)
    dirty_words = [word for word in new if word.split()]
    clean_length = len(clean_words)
    dirty_length = len(dirty_words)
    print("Clean Length: ", clean_length)
    print("Dirty Length: ", dirty_length)


main()

Which currently produces:

Clean Length:  125823
Dirty Length:  1091370

I did try csv_table['clean'] = csv_table['text'].map(get_words(csv_table['text'])) which yielded:

AttributeError: 'Series' object has no attribute 'lower'

How can I apply the dirty / clean logic to each row and append those two columns to the dataframe?

6
  • Generally if you have a function myFunc that takes one cell's data and return the result you need, you can do df['new_col'] = df['text'].map(myFunc) and it will give you in new_col the result of the function for each row. Commented Oct 22, 2019 at 6:05
  • How is it I can do that but then add it back to this dataframe, is there example you can post? @Aryerez Commented Oct 22, 2019 at 6:07
  • When you'll do that, you will have a new column named 'new_col' in your dataframe. If you want to replace the existing 'text' column, do df['text'] = df['text'].map(myFunc) Commented Oct 22, 2019 at 6:10
  • I don't want to replace the existing column. I'm also confused as to what I'm supposed to pass to get_words, as passing csv_table['text'] passes the entire column...not just the row Commented Oct 22, 2019 at 6:11
  • In fact, csv_table['clean'] = csv_table['text'].map(get_words(csv_table['text'])) is an error Commented Oct 22, 2019 at 6:12

1 Answer 1

1

Use apply to apply a function on each row. For the dirty word count you can split the strings with pandas and then apply len to get the count. For the clean word count, directly apply the custom function:

csv_table['dirty'] = csv_table['text'].str.split().apply(len)
csv_table['clean'] = csv_table['text'].apply(lambda s: len(get_words(s)))
Sign up to request clarification or add additional context in comments.

2 Comments

You have to .apply(len) to the second row as well.
@jorijnsmit: You are correct, I erroneously assumed that was what get_words returned.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.