22

Is it possible to split a file? For example you have huge wordlist, I want to split it so that it becomes more than one file. How is this possible?

3

10 Answers 10

22

This one splits a file up by newlines and writes it back out. You can change the delimiter easily. This can also handle uneven amounts as well, if you don't have a multiple of splitLen lines (20 in this example) in your input file.

splitLen = 20         # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.

# This is shorthand and not friendly with memory
# on very large files (Sean Cavanagh), but it works.
input = open('input.txt', 'r').read().split('\n')

at = 1
for lines in range(0, len(input), splitLen):
    # First, get the list slice
    outputData = input[lines:lines+splitLen]

    # Now open the output file, join the new slice with newlines
    # and write it out. Then close the file.
    output = open(outputBase + str(at) + '.txt', 'w')
    output.write('\n'.join(outputData))
    output.close()

    # Increment the counter
    at += 1
Sign up to request clarification or add additional context in comments.

3 Comments

Might mention that for REALLY BIG FILES, open().read() chews a lot of memory and time. But mostly it's okay.
Oh, I know. I just wanted to throw together a working script quickly, and I normally work with small files. I end up with shorthand like that.
This method is actually very fast. I split 1GB file with 7M lines in 28 sec using 1.5GB memory. Compared to this: stackoverflow.com/questions/20602869/… it is much faster.
15

A better loop for sli's example, not hogging memory :

splitLen = 20         # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.

input = open('input.txt', 'r')

count = 0
at = 0
dest = None
for line in input:
    if count % splitLen == 0:
        if dest: dest.close()
        dest = open(outputBase + str(at) + '.txt', 'w')
        at += 1
    dest.write(line)
    count += 1

1 Comment

Careful when copying this code! It leaves open file handles for dest and input. Also, not a great idea to over-write the built-in method "input"
9

Solution to split binary files into chapters .000, .001, etc.:

FILE = 'scons-conversion.7z'

MAX  = 500*1024*1024  # 500Mb  - max chapter size
BUF  = 50*1024*1024*1024  # 50GB   - memory buffer size

chapters = 0
uglybuf  = ''
with open(FILE, 'rb') as src:
  while True:
    tgt = open(FILE + '.%03d' % chapters, 'wb')
    written = 0
    while written < MAX:
      if len(uglybuf) > 0:
        tgt.write(uglybuf)
      tgt.write(src.read(min(BUF, MAX - written)))
      written += min(BUF, MAX - written)
      uglybuf = src.read(1)
      if len(uglybuf) == 0:
        break
    tgt.close()
    if len(uglybuf) == 0:
      break
    chapters += 1

Comments

3
def split_file(file, prefix, max_size, buffer=1024):
    """
    file: the input file
    prefix: prefix of the output files that will be created
    max_size: maximum size of each created file in bytes
    buffer: buffer size in bytes

    Returns the number of parts created.
    """
    with open(file, 'r+b') as src:
        suffix = 0
        while True:
            with open(prefix + '.%s' % suffix, 'w+b') as tgt:
                written = 0
                while written < max_size:
                    data = src.read(buffer)
                    if data:
                        tgt.write(data)
                        written += buffer
                    else:
                        return suffix
                suffix += 1


def cat_files(infiles, outfile, buffer=1024):
    """
    infiles: a list of files
    outfile: the file that will be created
    buffer: buffer size in bytes
    """
    with open(outfile, 'w+b') as tgt:
        for infile in sorted(infiles):
            with open(infile, 'r+b') as src:
                while True:
                    data = src.read(buffer)
                    if data:
                        tgt.write(data)
                    else:
                        break

2 Comments

There is a bug if max_size is integer times of 1024. written <= max_size should be written < max_size. I can't edit it because it's only remove a character.
@osrpt Note that this introduces a different off-by-one error where it creates an extra file with zero bytes if the second-to-last file reads all the remaining bytes (eg: if you split a file in half it creates two files and a third file with zero bytes). I suppose this problem isn't as bad.
2

Sure it's possible:

open input file
open output file 1
count = 0
for each line in file:
    write to output file
    count = count + 1
    if count > maxlines:
         close output file
         open next output file
         count = 0

Comments

2
import re
PATENTS = 'patent.data'

def split_file(filename):
    # Open file to read
    with open(filename, "r") as r:

        # Counter
        n=0

        # Start reading file line by line
        for i, line in enumerate(r):

            # If line match with teplate -- <?xml --increase counter n
            if re.match(r'\<\?xml', line):
                n+=1

                # This "if" can be deleted, without it will start naming from 1
                # or you can keep it. It depends where is "re" will find at
                # first time the template. In my case it was first line
                if i == 0:
                    n = 0               

            # Write lines to file    
            with open("{}-{}".format(PATENTS, n), "a") as f:
                f.write(line)

split_file(PATENTS)

As a result you will get:

patent.data-0

patent.data-1

patent.data-N

Comments

2

You can use use this pypi filesplit module.

Comments

1

This is a late answer, but a new question was linked here and none of the answers mentioned itertools.groupby.

Assuming you have a (huge) file file.txt that you want to split in chunks of MAXLINES lines file_part1.txt, ..., file_partn.txt, you could do:

with open(file.txt) as fdin:
    for i, sub in itertools.groupby(enumerate(fdin), lambda x: 1 + x[0]//3):
        fdout = open("file_part{}.txt".format(i))
        for _, line in sub:
            fdout.write(line)

Comments

0
 import subprocess
 subprocess.run('split -l number_of_lines file_path', shell = True)

For example if you want 50000 lines in one files and path is /home/data then you can run below command

subprocess.run('split -l 50000 /home/data', shell = True)

If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want

! wc -l file_path

in this case

! wc -l /home/data

And Just so you know output file will not have file extension but its same extension as input file You can change it manually if Windows

Comments

-1

All the provided answers are good and (probably) work However, they need to load the file into memory (as a whole or partially). We know Python is not very efficient in this kind of tasks (or at least is not as efficient as the OS level commands).

I found the following is the most efficient way to do it:

import os

MAX_NUM_LINES = 1000
FILE_NAME = "input_file.txt"
SPLIT_PARAM = "-d"
PREFIX = "__"

if os.system(f"split -l {MAX_NUM_LINES} {SPLIT_PARAM} {FILE_NAME} {PREFIX}") == 0:
    print("Done:")
    print(os.system(f"ls {PREFIX}??"))
else:
    print("Failed!")

Read more about split here: https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.