1

I want to create a CSV file from a bunch of text files in a directory with the following structure to import them into a database later.

Title:
Article title

Word Count:
100

Summary:
Article summary.

Can consist of multiple lines.
    
Keywords:
keyword1, keyword2, keyword3


Article Body:
The rest of the article body.

Till the end of the file.

So desired result is to have them in a CSV file with sections as headers with their contents as follows.

Title          | Word Count | Summary                  | Keywords                     | Article Body |
Article title  | 100        | Article summary.\nCan... | keyword1, keyword2, keyword3 | ...          |
Article2 title | 110        | Article summary.\nCan... | keyword1, keyword2, keyword3 | ...          |

I've tried a few approaches with awk and shell scripting but couldn't succeed so far. Any ideas?

7
  • Include expected output with the commands you tried Commented Apr 14, 2022 at 15:27
  • I've added expected output. I've extracted parts of a file using awk as follows but I'm looking for a more generic solution to create the whole file. awk -F: '/^Title:/{v=$2}/^Word Count:/{print v $2}' Commented Apr 14, 2022 at 15:44
  • That doesn't look like CSV, though. The requirement to put comma-separated values inside some of the columns is decidedly weird, and surely hugely impractical. Perhaps a better design if I can guess what you ultimately want to accomplish would be to create two or three separate ouput files, with some sort of identifiers in each to allow you to map between them, sort of like in a relational database. Commented Apr 14, 2022 at 16:03
  • Like 1,Article title,100,Article summary ...,Article body and separately 1001,keyword 1, 1002,keyword two, 1003,third keyword etc and finally 1,1001 to indicate that article 1 contains keyword 1001, 1,1002 etc. (I made the keywords 1000-based just to make it easier eo see which is which; in reality both could start from 1, or any arbitrary number really.) Commented Apr 14, 2022 at 16:09
  • 1
    does Postgre know how to handle csv files with commas as both delimiter and data? does the data need to be wrapped in double quotes and if so, what about embedded/escaped quotes? I'd suggest taking an example set of data (including commas and double quotes) and work out the actual file format that you'll be able to (successfullly) load into Postgres; once you have a valid format/structure for the file the rest should be (relatively) easy to figure out ... Commented Apr 14, 2022 at 16:25

2 Answers 2

1

According to the documentation of COPY, PostgreSQL fully supports the CSV format, and a Text format which is compatible with the lossless TSV format.

Because I'm using awk, I choose to generate a TSV. The reason is that there are newlines in the data and POSIX awk doesn't allow having literal newlines in a user defined variable. A TSV doesn't have this problem because you have to replace the literal newlines with their C‑style notation \n.

Also, I changed the input format for making it easier to parse. The new rule is that one or more empty line(s) delimit the records, which means that you can't have empty lines in the content of Summary or Article Body; the work-around is to add a single space character, like I did in the example.


Input example:

Title:
Article title

Word Count:
100

Summary:
Article summary.
 
Can consist of multiple lines.

Keywords:
keyword1, keyword2, keyword3


Article Body:
The rest of the article body.
 
Till the end of the file.

And here's the awk command, which accepts multiple files as argument:

edit: added TSV escaping for the header / added basic comments / reduced code size

awk -v RS='' -v FS='^$' -v OFS='\t' '
    FNR == 1 { ++onr } # the current file number is our "output record number"
    /^[^:\n]+:/ {
        # lossless TSV escaping
        gsub(/\\/,"\\\\")
        gsub(/\n/,"\\n")
        gsub(/\r/,"\\r")
        gsub(/\t/,"\\t")

        # get the current field name
        id = substr($0,1,index($0,":")-1)

        # strip the first line (NOTE: the newline character is escaped)
        sub(/^(\\[^n]|[^\\])*\\n/,"")

        # save the data
        fields[id]           # keep track of the field names that we came across
        records[0,id] = id   # for the header line
        records[onr,id] = $0 # for the output record
    }
    END {
        # print the header (onr == 0) and the records (onr >= 1)
        for (i = 0; i <= onr; i++) {
            out = sep = ""
            for (id in fields) {
                out = out sep records[i,id]
                sep = OFS
            }
            print out
        }
    }
' *.txt

Then the output (I replaced all the literal tabs with | for better legibility):

Summary | Article Body | Word Count | Title | Keywords
Article summary.\n \nCan consist of multiple lines. | The rest of the article body.\n \nTill the end of the file. | 100 | Article title | keyword1, keyword2, keyword3

Postscript: Once you got a valid TSV file, you can use a tool like mlr to convert it to CSV, JSON, etc... but for the purpose of importing the data in postgreSQL, it isn't required.

The SQL statement will be this (untested):

COPY table_name FROM '/path/file.tsv' WITH HEADER;

remark: You don't need to specify the FORMAT and the DELIMITER because the defaults are already text and \t

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks so much for the effort @fravadona! I made some changes on your script and created an insert statement instead, thought that would be more practical. But I accepted your answer and I'll write my version as another answer. I just noticed the COPY statement btw, I guess you edited your answer. I'll try that as well. Thanks again.
You're welcome ;-) You can also do that, but the string literals will have to be escaped for SQL, otherwise a simple single quote in the data can break your INSERT statement
Yes thank you, that's what I was trying to do now :)
0

I made some changes on @Fravadona's script and created an insert statement. Seemed more practical to me and it works. But the answer was really helpful, just adding here as a reference, might be useful for other people.

awk -v RS='' -v FS='^$' -v OFS='\t' '
    FNR == 1 { fn++ }
    /^[^:]+:/ {
        fieldName = substr($0,1,index($0,":")-1)

        sub("^[^:]+:[^\n]*\n","")
        gsub(/\\/,"\\\\")
        gsub(/\n/,"\\n")
        gsub(/\r/,"\\r")
        gsub(/\t/,"\\t")

        header[fieldName]
        record[fn,fieldName] = $0
    }
    END {
        ORS=""
        print "insert into article(summary, content, word_count, title, keywords) values(E\047"

        for (i = 1; i <= fn; i++) {
            sep = "\047,\047"
            out = ""
            for (fieldName in header) {
                out = out record[fn,fieldName] sep
            }
            print substr(out,0,length(out)-2)")"
        }
    }
' x.txt

Result:

insert into article(summary, content, word_count, title, keywords) values(E'Article summary.\n \nCan consist of multiple lines.','The rest of the article body.\n \nTill the end of the file.','100','Article title','keyword1, keyword2, keyword3')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.