Memory Limit Using Regex on massive text file

Question

I have a text file of the following form:

('1', '2')
('3', '4')
     .
     .
     .

and i'm trying to get it to look like this:

1 2
3 4
etc...

I've been trying to do this using the re module in python, by chaining together re.sub commands like so:

for line in file:
    s = re.sub(r"\(", "", line)
    s1 = re.sub(r",", "", s)
    s2 = re.sub(r"'", "", s1)
    s3 = re.sub(r"\)", "", s2)
    output.write(s3)
output.close()

It seems to work great until I get near the end of my output file; then it becomes inconsistent and stops working. I am thinking it is because of the sheer SIZE of the file I am working with; 300MB or approximately 12 million lines.

Can anyone help me confirm that I'm simply running out of memory? Or if it is something else? Suitable alternatives, or ways around this?

It looks like your file is full of representations of two-tuples of strings representing integers - why?! You could ast.literal_eval each line and use csv to write it back out. — jonrsharpe
– jonrsharpe, Commented Sep 22, 2015 at 15:45
It's processing the file line by line, so I don't see how the size of the file should be causing a problem. Are you sure there isn't something else in your code creating an isue? — lurker
– lurker, Commented Sep 22, 2015 at 15:46
You can use a single regex: output.write(re.sub(r"\(\s*'(\d+)',\s*'(\d+)'\s*\)", r"\1 \2", line)). But as I say, that's not your problem. You might need to show more of your code to get an answer to that particular issue. — lurker
– lurker, Commented Sep 22, 2015 at 16:08

Christian Stade-Schuldt · Accepted Answer · 2015-09-22 15:51:57Z

2

You could simplify your code by using a simpler regex that finds all numbers in your input:

import re
with open(file_name) as input,open(output_name,'w') as output:
for line in input:
       output.write(' '.join(re.findall('\d+', line))
       output.write('\n')

answered Sep 22, 2015 at 15:51

Christian Stade-Schuldt

4,8717 gold badges38 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Kasravnd · Accepted Answer · 2015-09-22 15:47:28Z

1

Why don't load them as python tuples with ast.literal_eval. Also instead of opening and closing the files manually you can use with statement which close the file at the end of the block :

With open(file_name) as input,open(output_name,'w') as output:
    for line in input:
       output.write(','.join(ast.literal_eval(line.strip())))

answered Sep 22, 2015 at 15:47

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

Comments

siegerts · Accepted Answer · 2015-09-23 14:57:53Z

1

I would used a namedtuple for better performance. And the code becomes more readable.

# Python 3

from collections import namedtuple
from ast import literal_eval
#...

Row = namedtuple('Row', 'x y')
with open(in_file, 'r') as f, open(out_file, 'w') as output:
    for line in f.readlines():
        output.write("{0.x} {0.y}".
                     format(Row._make(literal_eval(line))))

edited Sep 23, 2015 at 14:57

answered Sep 22, 2015 at 16:05

siegerts

4714 silver badges11 bronze badges

2 Comments

Eli Riekeberg Over a year ago

I got this error(my first line is 35 characters long): r = Row._make(line) File "<string>", line 21, in _make TypeError: Expected 2 arguments, got 35

siegerts Over a year ago

@EliRiekeberg , Okay, updated to fix that - the answer now converts using ast.literal_eval as mentioned by @Kasramvd which converts from the string line to tuple for input in namedtuple and also consolidate output.write()

Matej · Accepted Answer · 2015-09-22 18:47:13Z

0

This is one way to do it without the re module:

in_file = open(r'd:\temp\02\input.txt', 'r')
out_file = open(r'd:\temp\02\output.txt', 'w')

for line in in_file:
    out_file.write(line.replace("'", '').replace('(', '').replace(', ', ' ').replace(')', ''))
out_file.close()

answered Sep 22, 2015 at 18:47

Matej

9424 gold badges14 silver badges22 bronze badges

Collectives™ on Stack Overflow

Memory Limit Using Regex on massive text file

4 Answers 4

Comments

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related