Multiple regex string replace on large text file using Python

Question

I am having some very large text file on which I want to execute multiple regex based string replacement. Currently I am doing it using Sublime's similar feature. However, in files larger than a GB my system is hanging.

I am running some of the below matches in my sublime currently

\\\n - Remove all the backslash followed by newline.

\n - Remove all newlines.

\=\\\" - Replace all instances of =\" with just ="

In one case, I also want to group the match and use it in the replaced text.

Some experts around me suggested writing a quick python script for the same, and performance won't be an issue.

With my limited python knowledge, I tried something as below:

import pandas as pd
import numpy as np

df = pd.read_csv('story_all.csv')

output = df.str.replace('\n', '')

output.to_csv(story_done.csv, sep='\n', encoding='utf-8')

It, however, isn't working. And somewhere I think, I might be overdoing.

Note: The fact the text file is CSV doesn't really matter. I just need to execute some string replaces. The new line required by CSV is preserved while it's done.

The error am getting is as below:

Traceback (most recent call last): File "replace.py", line 4, in df = pd.read_csv('story_all.csv') File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 709, in parser_f return _read(filepath_or_buffer, kwds) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 455, in _read data = parser.read(nrows) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 1069, in read ret = self._engine.read(nrows) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 1839, in read data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory File "pandas/_libs/parsers.pyx", line 978, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 2208, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 19 fields in line 8058, saw 65

Example of a CSV file value:

id,title,name_in_english,type,water_directory_term,org_work_area_term,org_type_term,defined_state,org_location_taluka_term,org_location_state_term,org_location_village_term,org_name_term,ha_free_term,org_location_dist_term,fax,samprak_bekti,email,phoneno,website/blog,postal_address,sangathan_ke_bare_main,rajya_state,taluka_sahar,jilla_district,kisi_prakar_kaa_sangathan,name,ID,created,status
"883","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"884","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"885","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"886","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"

This might not be a regex issue. Rather the field count for each csv entry is clearly wrong. Please provide some input and expected output strings. Additionally, sep='\n' seems odd. — Jan
– Jan, Commented Jan 23, 2018 at 19:39
Do any of the strings you want to replace span more then one line? — wwii
– wwii, Commented Jan 23, 2018 at 19:43
Added a sample data. I have removed the body column which will usually be very very large utf8 text(non-english). @wwii No. Its mostly removing some special characters, newline etc... — esafwan
– esafwan, Commented Jan 23, 2018 at 19:51

esafwan · Accepted Answer · 2018-01-23 21:18:13Z

I was finally able to do the required task without the help of pandas. while the approach reads the whole file to memory, it works fairly well for files up to 1-1.5 GB on my MacBook Pro. It serves my purpose. I found the base code for this here.

# import the modules that we need. (re is for regex)
import os, re

# set the working directory for a shortcut
os.chdir('/Users/username/Code/python/regex')

# open the source file and read it
# fh = file('org.csv', 'r')
fh = file('story_all.csv', 'r')
thetext = fh.read()
fh.close()

# create the pattern object. Note the "r". In case you're unfamiliar with Python
# this is to set the string as raw so we don't have to escape our escape characters

#match all newline followed by backslash.
p1 = re.compile(r'\n\\')
# p2 = re.compile(r'\n')
#match all newline except the one followed by digits in quotes.
p2 = re.compile(r'\n+(?!\"\d+\")')
p3 = re.compile(r'\\N')
p4 = re.compile(r'\=\\\"')




# do the replace
result = p1.sub("", thetext)
result = p2.sub("", result)
result = p3.sub("", result)
result = p4.sub('="', result)

# write the file
f_out = file('done.csv', 'w')
f_out.write(result)
f_out.close()

It's taking around 30-40 second when used against files close to 1 GB.

smallwat3r · Accepted Answer · 2019-01-02 15:34:32Z

1

If I understand correctly you could do as below .
This seems to work with the data sample you shared

import pandas as pd

df = pd.read_csv('story_all.csv', sep=',')

# Chars to replace
chars = [
    '\n',
]

output = df.replace(chars, '', regex=True)
output.to_csv('story_done.csv', sep=',', encoding='utf-8', index=False)

edited Jan 2, 2019 at 15:34

answered Jan 23, 2018 at 20:31

smallwat3r

1,1261 gold badge8 silver badges29 bronze badges

2 Comments

esafwan Over a year ago

Will regex match work with this? For some reason, I am not able to get a match for some example I tried.

smallwat3r Over a year ago

Sorry, you can try output = df.replace(chars, "", regex=True)

Collectives™ on Stack Overflow

Multiple regex string replace on large text file using Python

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related