Split String in Text File to Multiple Rows in Python

Question

I have a string within a text file that reads as one row, but I need to split the string into multiple rows based on a separator. If possible, I would like to separate the elements in the string based on the period (.) separating the different line elements listed here:

"Line 1: Element '{URL1}Decimal': 'x' is not a valid value of the atomic type 'xs:decimal'.Line 2: Element '{URL2}pos': 'y' is not a valid value of the atomic type 'xs:double'.Line 3: Element '{URL3}pos': 'y z' is not a valid value of the list type '{list1}doubleList'"

Here is my current script that is able to read the .txt file and convert it to a csv, but does not separate each entry into it's own row.

import glob
import csv
import os

path = "C:\\Users\\mdl518\\Desktop\\txt_strip\\"

with open(os.path.join(path,"test.txt"), 'r') as infile, open(os.path.join(path,"test.csv"), 'w') as outfile:
       stripped = (line.strip() for line in infile)
       lines = (line.split(",") for line in stripped if line)
       writer = csv.writer(outfile)
       writer.writerows(lines)

If possible, I would like to be able to just write to a .txt with multiple rows but a .csv would also work - Any help is most appreciated!

zwjjoy · Accepted Answer · 2020-06-15 15:13:36Z

1

One way to make it work:

import glob
import csv
import os

path = "C:\\Users\\mdl518\\Desktop\\txt_strip\\"

with open(os.path.join(path,"test.txt"), 'r') as infile, open(os.path.join(path,"test.csv"), 'w') as outfile:
       stripped = (line.strip() for line in infile)
       lines = ([sent] for para in (line.split(".") for line in stripped if line) for sent in para)
       writer = csv.writer(outfile)
       writer.writerows(lines)

Explanation below:

The output is one line because code in the last line reads a 2d array and there is only one instance in that 2d array which is the entire paragraph. To visualise it, "lines" is stored as [[s1,s2,s3]] where writer.writerows() takes rows input as [[s1],[s2],[s3]]

There can be two improvements.

(1) Take period '.' as seperator. line.split(".")

(2) Iterate over the split list in the list comprehension. lines = ([sent] for para in (line.split(".") for line in stripped if line) for sent in para)

str.split() splits a string by separator and store instances in a list. In your case, it tried to store the list in a list comprehension which made it a 2d array. It saves your paragraph into [[s1,s2,s3]]

answered Jun 15, 2020 at 15:13

zwjjoy

3703 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mdl518 Over a year ago

Zwjjoy - This definitely helps, but there is still one tweak that needs to be made! To clarify, the URLs (e.g. URL1) contain formal web addresses (e.g. standards.iso.org/iso) so the "iso" is being stored on one line of the csv and the remaining URL "org/…/../" is written on the next line, which has a space separating the two. Thanks again for your continued help, most appreciated!

zwjjoy Over a year ago

@mdl518 Happy to help! There are many ways to get it working. I suggest preprocessing your input textfile and store sentences in a list first. One way is to get it working by expanding on your existing code is to use split function in regex re.split() . Suggested code can be: import re lines = (['Line' + sent] for para in (re.split(r'Line', line) for line in stripped if line) for sent in para if sent)

Collectives™ on Stack Overflow

Split String in Text File to Multiple Rows in Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related