0

I have a string within a text file that reads as one row, but I need to split the string into multiple rows based on a separator. If possible, I would like to separate the elements in the string based on the period (.) separating the different line elements listed here:

"Line 1: Element '{URL1}Decimal': 'x' is not a valid value of the atomic type 'xs:decimal'.Line 2: Element '{URL2}pos': 'y' is not a valid value of the atomic type 'xs:double'.Line 3: Element '{URL3}pos': 'y z' is not a valid value of the list type '{list1}doubleList'"

Here is my current script that is able to read the .txt file and convert it to a csv, but does not separate each entry into it's own row.

import glob
import csv
import os

path = "C:\\Users\\mdl518\\Desktop\\txt_strip\\"

with open(os.path.join(path,"test.txt"), 'r') as infile, open(os.path.join(path,"test.csv"), 'w') as outfile:
       stripped = (line.strip() for line in infile)
       lines = (line.split(",") for line in stripped if line)
       writer = csv.writer(outfile)
       writer.writerows(lines)

If possible, I would like to be able to just write to a .txt with multiple rows but a .csv would also work - Any help is most appreciated!

1 Answer 1

1

One way to make it work:

import glob
import csv
import os

path = "C:\\Users\\mdl518\\Desktop\\txt_strip\\"

with open(os.path.join(path,"test.txt"), 'r') as infile, open(os.path.join(path,"test.csv"), 'w') as outfile:
       stripped = (line.strip() for line in infile)
       lines = ([sent] for para in (line.split(".") for line in stripped if line) for sent in para)
       writer = csv.writer(outfile)
       writer.writerows(lines)

Explanation below:

The output is one line because code in the last line reads a 2d array and there is only one instance in that 2d array which is the entire paragraph. To visualise it, "lines" is stored as [[s1,s2,s3]] where writer.writerows() takes rows input as [[s1],[s2],[s3]]

There can be two improvements.

(1) Take period '.' as seperator. line.split(".")

(2) Iterate over the split list in the list comprehension. lines = ([sent] for para in (line.split(".") for line in stripped if line) for sent in para)

str.split() splits a string by separator and store instances in a list. In your case, it tried to store the list in a list comprehension which made it a 2d array. It saves your paragraph into [[s1,s2,s3]]

Sign up to request clarification or add additional context in comments.

2 Comments

Zwjjoy - This definitely helps, but there is still one tweak that needs to be made! To clarify, the URLs (e.g. URL1) contain formal web addresses (e.g. standards.iso.org/iso) so the "iso" is being stored on one line of the csv and the remaining URL "org/…/../" is written on the next line, which has a space separating the two. Thanks again for your continued help, most appreciated!
@mdl518 Happy to help! There are many ways to get it working. I suggest preprocessing your input textfile and store sentences in a list first. One way is to get it working by expanding on your existing code is to use split function in regex re.split() . Suggested code can be: import re lines = (['Line' + sent] for para in (re.split(r'Line', line) for line in stripped if line) for sent in para if sent)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.