Convert .csv to .jsonl python

Question

I have a .csv file which I would like to convert into a .jsonl file.

I found the Pandas to_json method:

df = pd.read_csv('DIRECTORY/texts1.csv', sep=';')
df.to_json ('DIRECTORY/texts1.json')

However, I am not aware of a function to turn it into a .jsonl format. How can I do this?

As I said, a lot of attempts to hijack a common practice. Just append the JSON strings at the end of the file you want. That;s the whole point. You only need to read to the next newline to read a JSON document instead of reading the entire file. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented May 7, 2021 at 13:35
In fact, ndjson.org appeared before jsonlines.org and contained the same text as the historical json.org site, without having any relation to either Douglas Crockford or ECMA — Panagiotis Kanavos
– Panagiotis Kanavos, Commented May 7, 2021 at 13:39
The whole point of storing a JSON document per line is that you don't have to read either the document or the data in memory. It's the same benefit CSV has. You can read the CSV file line-by-line, generate a JSON string from each line, and just append it to the target file. This way you could handle eg a 10GB without using any more data than necessary to process and serialize a single line. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented May 7, 2021 at 13:42
From this answer you can see that to_json can write each row in a separate row if you use orient='records', lines=True. From to_json docs: If ‘orient’ is ‘records’ write out line delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list like. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented May 7, 2021 at 13:47

ccdoug · Accepted Answer · 2022-04-22 17:40:45Z

5

This is probably a bit late, but I wrote a silly module called csv-jsonl that may help with this sort of thing.

>>> from csv_jsonl import JSONLinesDictWriter
>>> l = [{"foo": "bar", "bat": 1}, {"foo": "bar", "bat": 2}]
>>> with open("foo.jsonl", "w", encoding="utf-8") as _fh:
...     writer = JSONLinesDictWriter(_fh)
...     writer.writerows(l)
...

It extends the native csv module, so it's mostly familiar. Hope it helps.

answered Apr 22, 2022 at 17:40

ccdoug

951 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ben · Accepted Answer · 2021-05-09 18:13:25Z

I'm not sure if this result is compliant with "jsonl" syntax, but it's a hack that might get towards a relevant outcome.

The primary trick is to treat each line of the input file as a separate JSON file upon export, then read that JSON back in from disk and treat as distinct jsonl lines.

I'm starting from a CSV that contains

hello, from, this, file
another, amazing, line, csv
last, line, of, file

The snippet below builds on another post.

import pandas
df = pandas.read_csv("myfile.csv", header=None)

file_to_write = ""
for index in df.index:
    df.loc[index].to_json("row{}.json".format(index))
    with open("row{}.json".format(index)) as file_handle:
        file_content = file_handle.read()
        file_to_write += file_content + "\n"
        
with open("result.jsonl","w") as file_handle:
    file_handle.write(file_to_write)

The resulting .jsonl file contains

{"0":"hello","1":" from","2":" this","3":" file"}
{"0":"another","1":" amazing","2":" line","3":" csv"}
{"0":"last","1":" line","2":" of","3":" file"}

If the row indices are not desired, those could be removed from the .to_json() line of the Python snippet above.

J. B. Martin · Accepted Answer · 2024-04-26 18:10:35Z

Thought this should be added. This version builds upon Ben's answer but avoids using temporary files and optimizes string handling, thus addressing the potential inefficiencies and issues in the original script.

import pandas as pd

# Reading the CSV file into a DataFrame without headers
df = pd.read_csv("myfile.csv", header=None)

# Prepare an empty list to collect JSON strings
json_lines = []

# Convert each row to a JSON string and append it to the list
for index, row in df.iterrows():
    json_str = row.to_json(orient='records')
    json_lines.append(json_str)

# Join all JSON strings with newline characters and write to a single file
with open("result.jsonl", "w") as file_handle:
    file_handle.write("\n".join(json_lines))

Collectives™ on Stack Overflow

Convert .csv to .jsonl python

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related