I have a large text file that I need to parse into a pipe delimited text file using python. The file looks like this (basically):
product/productId: D7SDF9S9
review/userId: asdf9uas0d8u9f
review/score: 5.0
review/some text here
product/productId: D39F99
review/userId: fasd9fasd9f9f
review/score: 4.1
review/some text here
Each record is separated by two newline charters /n. I have written a parser below.
with open ("largefile.txt", "r") as myfile:
fullstr = myfile.read()
allsplits = re.split("\n\n",fullstr)
articles = []
for i,s in enumerate(allsplits[0:]):
splits = re.split("\n.*?: ",s)
productId = splits[0]
userId = splits[1]
profileName = splits[2]
helpfulness = splits[3]
rating = splits[4]
time = splits[5]
summary = splits[6]
text = splits[7]
fw = open(outnamename,'w')
fw.write(productId+"|"+userID+"|"+profileName+"|"+helpfulness+"|"+rating+"|"+time+"|"+summary+"|"+text+"\n")
return
The problem is the file I am reading in is so large that I run out of memory before it can complete.
I suspect it's bambing out at the allsplits = re.split("\n\n",fullstr) line.
Can someone let me know of a way to just read in one record at a time, parse it, write it to a file, and then move to the next record?
sedwas made for.