How can I split a file in python?

Question

Is it possible to split a file? For example you have huge wordlist, I want to split it so that it becomes more than one file. How is this possible?

This is certainly possible. If you want useful answers, you may want to provide some useful details. — EBGreen
– EBGreen, Commented Feb 13, 2009 at 16:08
do you want to do it with python? how is this file structured? is it a text file? — Paolo Tedesco
– Paolo Tedesco, Commented Feb 13, 2009 at 16:10
Is this a duplicate? See: http://stackoverflow.com/questions/291740/how-do-i-split-a-huge-text-file-in-python — quamrana
– quamrana, Commented Feb 13, 2009 at 17:00

score 22 · Accepted Answer · 2009-02-15 21:08:50Z

22

This one splits a file up by newlines and writes it back out. You can change the delimiter easily. This can also handle uneven amounts as well, if you don't have a multiple of splitLen lines (20 in this example) in your input file.

splitLen = 20         # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.

# This is shorthand and not friendly with memory
# on very large files (Sean Cavanagh), but it works.
input = open('input.txt', 'r').read().split('\n')

at = 1
for lines in range(0, len(input), splitLen):
    # First, get the list slice
    outputData = input[lines:lines+splitLen]

    # Now open the output file, join the new slice with newlines
    # and write it out. Then close the file.
    output = open(outputBase + str(at) + '.txt', 'w')
    output.write('\n'.join(outputData))
    output.close()

    # Increment the counter
    at += 1

edited Feb 15, 2009 at 21:08

answered Feb 13, 2009 at 16:17

user13876

Sign up to request clarification or add additional context in comments.

3 Comments

Sean Cavanagh Over a year ago

Might mention that for REALLY BIG FILES, open().read() chews a lot of memory and time. But mostly it's okay.

user13876 Over a year ago

Oh, I know. I just wanted to throw together a working script quickly, and I normally work with small files. I end up with shorthand like that.

keiv.fly Over a year ago

This method is actually very fast. I split 1GB file with 7M lines in 28 sec using 1.5GB memory. Compared to this: stackoverflow.com/questions/20602869/… it is much faster.

lacorbeille · Accepted Answer · 2012-11-12 14:15:35Z

15

A better loop for sli's example, not hogging memory :

splitLen = 20         # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.

input = open('input.txt', 'r')

count = 0
at = 0
dest = None
for line in input:
    if count % splitLen == 0:
        if dest: dest.close()
        dest = open(outputBase + str(at) + '.txt', 'w')
        at += 1
    dest.write(line)
    count += 1

answered Nov 12, 2012 at 14:15

lacorbeille

3251 gold badge4 silver badges8 bronze badges

1 Comment

Dhara Over a year ago

Careful when copying this code! It leaves open file handles for dest and input. Also, not a great idea to over-write the built-in method "input"

Robin · Accepted Answer · 2018-02-08 13:57:16Z

9

Solution to split binary files into chapters .000, .001, etc.:

FILE = 'scons-conversion.7z'

MAX  = 500*1024*1024  # 500Mb  - max chapter size
BUF  = 50*1024*1024*1024  # 50GB   - memory buffer size

chapters = 0
uglybuf  = ''
with open(FILE, 'rb') as src:
  while True:
    tgt = open(FILE + '.%03d' % chapters, 'wb')
    written = 0
    while written < MAX:
      if len(uglybuf) > 0:
        tgt.write(uglybuf)
      tgt.write(src.read(min(BUF, MAX - written)))
      written += min(BUF, MAX - written)
      uglybuf = src.read(1)
      if len(uglybuf) == 0:
        break
    tgt.close()
    if len(uglybuf) == 0:
      break
    chapters += 1

edited Feb 8, 2018 at 13:57

Robin

5491 gold badge7 silver badges15 bronze badges

answered Jun 17, 2011 at 17:58

anatoly techtonik

20.7k14 gold badges133 silver badges145 bronze badges

Comments

NullUserException · Accepted Answer · 2015-06-25 01:04:00Z

3

def split_file(file, prefix, max_size, buffer=1024):
    """
    file: the input file
    prefix: prefix of the output files that will be created
    max_size: maximum size of each created file in bytes
    buffer: buffer size in bytes

    Returns the number of parts created.
    """
    with open(file, 'r+b') as src:
        suffix = 0
        while True:
            with open(prefix + '.%s' % suffix, 'w+b') as tgt:
                written = 0
                while written < max_size:
                    data = src.read(buffer)
                    if data:
                        tgt.write(data)
                        written += buffer
                    else:
                        return suffix
                suffix += 1


def cat_files(infiles, outfile, buffer=1024):
    """
    infiles: a list of files
    outfile: the file that will be created
    buffer: buffer size in bytes
    """
    with open(outfile, 'w+b') as tgt:
        for infile in sorted(infiles):
            with open(infile, 'r+b') as src:
                while True:
                    data = src.read(buffer)
                    if data:
                        tgt.write(data)
                    else:
                        break

edited Jun 25, 2015 at 1:04

NullUserException

85.8k31 gold badges212 silver badges239 bronze badges

answered Mar 13, 2013 at 1:38

michaelmeyer

8,2657 gold badges33 silver badges38 bronze badges

2 Comments

yangsibai Over a year ago

There is a bug if max_size is integer times of 1024. written <= max_size should be written < max_size. I can't edit it because it's only remove a character.

NullUserException Over a year ago

@osrpt Note that this introduces a different off-by-one error where it creates an extra file with zero bytes if the second-to-last file reads all the remaining bytes (eg: if you split a file in half it creates two files and a third file with zero bytes). I suppose this problem isn't as bad.

Aaron Digulla · Accepted Answer · 2009-02-13 16:37:33Z

2

Sure it's possible:

open input file
open output file 1
count = 0
for each line in file:
    write to output file
    count = count + 1
    if count > maxlines:
         close output file
         open next output file
         count = 0

edited Feb 13, 2009 at 16:37

Aaron Digulla

330k111 gold badges626 silver badges840 bronze badges

answered Feb 13, 2009 at 16:10

Charlie Martin

113k27 gold badges198 silver badges268 bronze badges

Comments

Igor Z · Accepted Answer · 2017-09-22 10:00:22Z

import re
PATENTS = 'patent.data'

def split_file(filename):
    # Open file to read
    with open(filename, "r") as r:

        # Counter
        n=0

        # Start reading file line by line
        for i, line in enumerate(r):

            # If line match with teplate -- <?xml --increase counter n
            if re.match(r'\<\?xml', line):
                n+=1

                # This "if" can be deleted, without it will start naming from 1
                # or you can keep it. It depends where is "re" will find at
                # first time the template. In my case it was first line
                if i == 0:
                    n = 0               

            # Write lines to file    
            with open("{}-{}".format(PATENTS, n), "a") as f:
                f.write(line)

split_file(PATENTS)

As a result you will get:

patent.data-0

patent.data-1

patent.data-N

Cœur · Accepted Answer · 2020-02-22 16:56:10Z

2

You can use use this pypi filesplit module.

edited Feb 22, 2020 at 16:56

Cœur

39k25 gold badges207 silver badges282 bronze badges

answered Jan 24, 2018 at 21:31

Ram

5953 gold badges8 silver badges18 bronze badges

Comments

Serge Ballesta · Accepted Answer · 2019-06-24 07:37:42Z

1

This is a late answer, but a new question was linked here and none of the answers mentioned itertools.groupby.

Assuming you have a (huge) file file.txt that you want to split in chunks of MAXLINES lines file_part1.txt, ..., file_partn.txt, you could do:

with open(file.txt) as fdin:
    for i, sub in itertools.groupby(enumerate(fdin), lambda x: 1 + x[0]//3):
        fdout = open("file_part{}.txt".format(i))
        for _, line in sub:
            fdout.write(line)

answered Jun 24, 2019 at 7:37

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Comments

manoj · Accepted Answer · 2022-05-22 07:19:32Z

0

 import subprocess
 subprocess.run('split -l number_of_lines file_path', shell = True)

For example if you want 50000 lines in one files and path is /home/data then you can run below command

subprocess.run('split -l 50000 /home/data', shell = True)

If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want

! wc -l file_path

in this case

! wc -l /home/data

And Just so you know output file will not have file extension but its same extension as input file You can change it manually if Windows

edited May 22, 2022 at 7:19

answered May 22, 2022 at 7:13

manoj

4041 gold badge5 silver badges14 bronze badges

Comments

Borhan Kazimipour · Accepted Answer · 2019-10-22 05:53:11Z

-1

All the provided answers are good and (probably) work However, they need to load the file into memory (as a whole or partially). We know Python is not very efficient in this kind of tasks (or at least is not as efficient as the OS level commands).

I found the following is the most efficient way to do it:

import os

MAX_NUM_LINES = 1000
FILE_NAME = "input_file.txt"
SPLIT_PARAM = "-d"
PREFIX = "__"

if os.system(f"split -l {MAX_NUM_LINES} {SPLIT_PARAM} {FILE_NAME} {PREFIX}") == 0:
    print("Done:")
    print(os.system(f"ls {PREFIX}??"))
else:
    print("Failed!")

Read more about split here: https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/

answered Oct 22, 2019 at 5:53

Borhan Kazimipour

4451 gold badge7 silver badges14 bronze badges

Collectives™ on Stack Overflow

How can I split a file in python?

10 Answers 10

3 Comments

1 Comment

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

3 Comments

1 Comment

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related