I have a dataframe called csv_table that looks like this:
class ID text
0 2 BIeDBg4MrEd1NwWRlFHLQQ Decent but terribly inconsistent food. I've ha...
1 4 NJHPiW30SKhItD5E2jqpHw Looks aren't everything....... This little di...
2 2 nnS89FMpIHz7NPjkvYHmug Being a creature of habit anytime I want good ...
3 2 FYxSugh9PGrX1PR0BHBIw I recently told a friend that I cant figure ou...
4 4 ScViKtQ2xq6i5AyN4curYQ Chevy's five years ago was crisp and fresh and...
5 2 vz8Q37FSlypZlgy5N7Ym0A Every time I go to this Jack In The Box I get ...
6 4 OJuG2EvItSZXbu8KowI9A I've been going to Cluckers for years. Every t...
7 4 k9ci6SfI5RZT3smNdnvSg . ...
8 4 qq6bQbrBZyd4lOBd8KSCoA Well, after their remodel the place no longer ...
9 4 FldFfwfuk9T8kvkp8iw Beer selection was good, but they were out of ...
10 4 63ufCUqbPcnl6abC1SBpvQ Ihop is my favorite breakfast chain, and the s...
11 4 nDYCZDIAvdcx77EcmYz0Q A very good Jewish deli tucked in and amongst ...
12 4 uoC1llZumwFKgXAMlDbZIg Went here for lunch with Rand H. and this plac...
13 2 BBs1rbz75dDifvoQyVMDg Picture the least attractive person you'd sett...
14 4 2t9znjapzhioLqb4Pf1Q Really really really strong Margaritas! The ...
15 4 GqLgixGcbWh51IzkwsiswA I would not have known about this place had it...
[1999 rows x 3 columns]
I am trying to add 2 columns to the csv_table, one that specifies the number of words in the text column (as denoted with a "word" being a split on space), and a column that specifies the number of "clean" words as defined by a custom function.
I have the ability to count the total clean and dirty words, but how can I apply these functions to each row in the dataframe, and append those columns?
Code is below:
import nltk, re, pandas as pd
from nltk.corpus import stopwords
import sklearn, string
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from itertools import islice
# This function removes numbers from an array
def remove_nums(arr):
# Declare a regular expression
pattern = '[0-9]'
# Remove the pattern, which is a number
arr = [re.sub(pattern, '', i) for i in arr]
# Return the array with numbers removed
return arr
# This function cleans the passed in paragraph and parses it
def get_words(para):
# Create a set of stop words
stop_words = set(stopwords.words('english'))
# Split it into lower case
lower = para.lower().split()
# Remove punctuation
no_punctuation = (nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in lower)
# Remove integers
no_integers = remove_nums(no_punctuation)
# Remove stop words
dirty_tokens = (data for data in no_integers if data not in stop_words)
# Ensure it is not empty
tokens = [data for data in dirty_tokens if data.strip()]
# Ensure there is more than 1 character to make up the word
tokens = [data for data in tokens if len(data) > 1]
# Return the tokens
return tokens
def main():
tsv_file = "filepath"
csv_table=pd.read_csv(tsv_file, sep='\t')
csv_table.columns = ['class', 'ID', 'text']
print(csv_table)
s = pd.Series(csv_table['text'])
new = s.str.cat(sep=' ')
clean_words = get_words(new)
dirty_words = [word for word in new if word.split()]
clean_length = len(clean_words)
dirty_length = len(dirty_words)
print("Clean Length: ", clean_length)
print("Dirty Length: ", dirty_length)
main()
Which currently produces:
Clean Length: 125823
Dirty Length: 1091370
I did try csv_table['clean'] = csv_table['text'].map(get_words(csv_table['text'])) which yielded:
AttributeError: 'Series' object has no attribute 'lower'
How can I apply the dirty / clean logic to each row and append those two columns to the dataframe?
myFuncthat takes one cell's data and return the result you need, you can dodf['new_col'] = df['text'].map(myFunc)and it will give you innew_colthe result of the function for each row.df['text'] = df['text'].map(myFunc)get_words, as passingcsv_table['text']passes the entire column...not just the rowcsv_table['clean'] = csv_table['text'].map(get_words(csv_table['text']))is an error