0

I am having some very large text file on which I want to execute multiple regex based string replacement. Currently I am doing it using Sublime's similar feature. However, in files larger than a GB my system is hanging.

I am running some of the below matches in my sublime currently

\\\n - Remove all the backslash followed by newline.

\n - Remove all newlines.

\=\\\" - Replace all instances of =\" with just ="

In one case, I also want to group the match and use it in the replaced text.

Some experts around me suggested writing a quick python script for the same, and performance won't be an issue.

With my limited python knowledge, I tried something as below:

import pandas as pd
import numpy as np

df = pd.read_csv('story_all.csv')

output = df.str.replace('\n', '')

output.to_csv(story_done.csv, sep='\n', encoding='utf-8')

It, however, isn't working. And somewhere I think, I might be overdoing.


Note: The fact the text file is CSV doesn't really matter. I just need to execute some string replaces. The new line required by CSV is preserved while it's done.


The error am getting is as below:

Traceback (most recent call last): File "replace.py", line 4, in df = pd.read_csv('story_all.csv') File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 709, in parser_f return _read(filepath_or_buffer, kwds) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 455, in _read data = parser.read(nrows) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 1069, in read ret = self._engine.read(nrows) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 1839, in read data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory File "pandas/_libs/parsers.pyx", line 978, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 2208, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 19 fields in line 8058, saw 65

Example of a CSV file value:

id,title,name_in_english,type,water_directory_term,org_work_area_term,org_type_term,defined_state,org_location_taluka_term,org_location_state_term,org_location_village_term,org_name_term,ha_free_term,org_location_dist_term,fax,samprak_bekti,email,phoneno,website/blog,postal_address,sangathan_ke_bare_main,rajya_state,taluka_sahar,jilla_district,kisi_prakar_kaa_sangathan,name,ID,created,status
"883","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"884","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"885","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"886","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
3
  • This might not be a regex issue. Rather the field count for each csv entry is clearly wrong. Please provide some input and expected output strings. Additionally, sep='\n' seems odd. Commented Jan 23, 2018 at 19:39
  • Do any of the strings you want to replace span more then one line? Commented Jan 23, 2018 at 19:43
  • Added a sample data. I have removed the body column which will usually be very very large utf8 text(non-english). @wwii No. Its mostly removing some special characters, newline etc... Commented Jan 23, 2018 at 19:51

2 Answers 2

1

I was finally able to do the required task without the help of pandas. while the approach reads the whole file to memory, it works fairly well for files up to 1-1.5 GB on my MacBook Pro. It serves my purpose. I found the base code for this here.

# import the modules that we need. (re is for regex)
import os, re

# set the working directory for a shortcut
os.chdir('/Users/username/Code/python/regex')

# open the source file and read it
# fh = file('org.csv', 'r')
fh = file('story_all.csv', 'r')
thetext = fh.read()
fh.close()

# create the pattern object. Note the "r". In case you're unfamiliar with Python
# this is to set the string as raw so we don't have to escape our escape characters

#match all newline followed by backslash.
p1 = re.compile(r'\n\\')
# p2 = re.compile(r'\n')
#match all newline except the one followed by digits in quotes.
p2 = re.compile(r'\n+(?!\"\d+\")')
p3 = re.compile(r'\\N')
p4 = re.compile(r'\=\\\"')




# do the replace
result = p1.sub("", thetext)
result = p2.sub("", result)
result = p3.sub("", result)
result = p4.sub('="', result)

# write the file
f_out = file('done.csv', 'w')
f_out.write(result)
f_out.close()

It's taking around 30-40 second when used against files close to 1 GB.

Sign up to request clarification or add additional context in comments.

Comments

1

If I understand correctly you could do as below .
This seems to work with the data sample you shared

import pandas as pd

df = pd.read_csv('story_all.csv', sep=',')

# Chars to replace
chars = [
    '\n',
]

output = df.replace(chars, '', regex=True)
output.to_csv('story_done.csv', sep=',', encoding='utf-8', index=False)

2 Comments

Will regex match work with this? For some reason, I am not able to get a match for some example I tried.
Sorry, you can try output = df.replace(chars, "", regex=True)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.