I am having some very large text file on which I want to execute multiple regex based string replacement. Currently I am doing it using Sublime's similar feature. However, in files larger than a GB my system is hanging.
I am running some of the below matches in my sublime currently
\\\n - Remove all the backslash followed by newline.
\n - Remove all newlines.
\=\\\" - Replace all instances of =\" with just ="
In one case, I also want to group the match and use it in the replaced text.
Some experts around me suggested writing a quick python script for the same, and performance won't be an issue.
With my limited python knowledge, I tried something as below:
import pandas as pd
import numpy as np
df = pd.read_csv('story_all.csv')
output = df.str.replace('\n', '')
output.to_csv(story_done.csv, sep='\n', encoding='utf-8')
It, however, isn't working. And somewhere I think, I might be overdoing.
Note: The fact the text file is CSV doesn't really matter. I just need to execute some string replaces. The new line required by CSV is preserved while it's done.
The error am getting is as below:
Traceback (most recent call last): File "replace.py", line 4, in df = pd.read_csv('story_all.csv') File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 709, in parser_f return _read(filepath_or_buffer, kwds) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 455, in _read data = parser.read(nrows) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 1069, in read ret = self._engine.read(nrows) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 1839, in read data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory File "pandas/_libs/parsers.pyx", line 978, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 2208, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 19 fields in line 8058, saw 65
Example of a CSV file value:
id,title,name_in_english,type,water_directory_term,org_work_area_term,org_type_term,defined_state,org_location_taluka_term,org_location_state_term,org_location_village_term,org_name_term,ha_free_term,org_location_dist_term,fax,samprak_bekti,email,phoneno,website/blog,postal_address,sangathan_ke_bare_main,rajya_state,taluka_sahar,jilla_district,kisi_prakar_kaa_sangathan,name,ID,created,status
"883","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"884","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"885","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"886","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
sep='\n'seems odd.