Split text file into smaller multiple text file using command line

Question

I have multiple text file with about 100,000 lines and I want to split them into smaller text files of 5000 lines each.

I used:

split -l 5000 filename.txt

That creates files:

xaa
xab
aac
xad
xbe
aaf

files with no extensions. I just want to call them something like:

file01.txt
file02.txt
file03.txt
file04.txt

or if that is not possible, i just want them to have the ".txt" extension.

What platform are you on? You talk about split (a Unix/Linux utility) but tag with batch-file which is Windows. — Mark Setchell
– Mark Setchell, Commented Aug 11, 2014 at 18:20
Mark, I'm on windows, but have Cygwin bash shell installed, so I have access to split/csplit. — ashleybee97
– ashleybee97, Commented Aug 11, 2014 at 18:22
This answer with PowerShell can be embedded in a batch file. See this for a basis. — sancho.s ReinstateMonicaCellio
– sancho.s ReinstateMonicaCellio, Commented Apr 14, 2016 at 21:45

Exploring · Accepted Answer · 2022-07-22 06:01:30Z

169

I know the question was asked a long time ago, but I am surprised that nobody has given the most straightforward Unix answer:

split -l 5000 -d --additional-suffix=.txt $FileName file

-l 5000: split file into files of 5,000 lines each.
-d: numerical suffix. This will make the suffix go from 00 to 99 by default instead of aa to zz.
--additional-suffix: lets you specify the suffix, here the extension
$FileName: name of the file to be split.
file: prefix to add to the resulting files.

As always, check out man split for more details.

For Mac, the default version of split is dumbed down. You can install the GNU version using the following command. (see this question for more GNU utils)

brew install coreutils

and then you can execute the above command by replacing split with gsplit. Check out man gsplit for details.

edited Jul 22, 2022 at 6:01

Exploring

3,68317 gold badges70 silver badges119 bronze badges

answered Aug 18, 2017 at 17:27

ursan

2,4572 gold badges18 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

bakoyaro Over a year ago

If I could +100 I would! With the syntax you posted I was able to split a >380M file into 10M files in roughly .3 second.

Stefano Munarini Over a year ago

It seems like -d and --additional-suffix are no longer supported options (OSX 10.12.6)

ursan Over a year ago

@StefanoMunarini for mac, you can install the gnu version of split with brew install coreutils, and then you would replace split with gsplit in the command above.

AGrush Over a year ago

and how would you use a delimeter instead of number of lines?

ursan Over a year ago

@AGrush I'm not sure exactly what your use case is, but I think you could use the -t flag which splits on a user-specified delimiter instead of a newline. You can then use the -l flag to specify how many splits you want to group together in the output file.

|

Alex · Accepted Answer · 2016-07-28 09:48:24Z

23

Here's an example in C# (cause that's what I was searching for). I needed to split a 23 GB csv-file with around 175 million lines to be able to look at the files. I split it into files of one million rows each. This code did it in about 5 minutes on my machine:

var list = new List<string>();
var fileSuffix = 0;

using (var file = File.OpenRead(@"D:\Temp\file.csv"))
using (var reader = new StreamReader(file))
{
    while (!reader.EndOfStream)
    {
        list.Add(reader.ReadLine());

        if (list.Count >= 1000000)
        {
            File.WriteAllLines(@"D:\Temp\split" + (++fileSuffix) + ".csv", list);
            list = new List<string>();
        }
    }
}

File.WriteAllLines(@"D:\Temp\split" + (++fileSuffix) + ".csv", list);

answered Jul 28, 2016 at 9:48

Alex

6269 silver badges12 bronze badges

1 Comment

Zachary Dow Over a year ago

And you can basically just throw it in LINQPad and just tweek to your heart's content. No need to compile anything. Good Solution.

Magoo · Accepted Answer · 2016-06-30 06:36:56Z

15

@ECHO OFF
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET /a fcount=100
SET /a llimit=5000
SET /a lcount=%llimit%
FOR /f "usebackqdelims=" %%a IN ("%sourcedir%\q25249516.txt") DO (
 CALL :select
 FOR /f "tokens=1*delims==" %%b IN ('set dfile') DO IF /i "%%b"=="dfile" >>"%%c" ECHO(%%a
)
GOTO :EOF
:select
SET /a lcount+=1
IF %lcount% lss %llimit% GOTO :EOF
SET /a lcount=0
SET /a fcount+=1
SET "dfile=%sourcedir%\file%fcount:~-2%.txt"
GOTO :EOF

Here's a native windows batch that should accomplish the task.

Now I'll not say that it'll be fast (less than 2 minutes for each 5Kline output file) or that it will be immune to batch character-sensitivites. Really depends on the characteristics of your target data.

I used a file named q25249516.txt containing 100Klines of data for my testing.

Revised quicker version

REM

@ECHO OFF
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET /a fcount=199
SET /a llimit=5000
SET /a lcount=%llimit%
FOR /f "usebackqdelims=" %%a IN ("%sourcedir%\q25249516.txt") DO (
 CALL :select
 >>"%sourcedir%\file$$.txt" ECHO(%%a
)
SET /a lcount=%llimit%
:select
SET /a lcount+=1
IF %lcount% lss %llimit% GOTO :EOF
SET /a lcount=0
SET /a fcount+=1
MOVE /y "%sourcedir%\file$$.txt" "%sourcedir%\file%fcount:~-2%.txt" >NUL 2>nul
GOTO :EOF

Note that I used llimit of 50000 for testing. Will overwrite the early file numbers if llimit*100 is gearter than the number of lines in the file (cure by setting fcount to 1999 and use ~3 in place of ~2 in file-renaming line.)

edited Jun 30, 2016 at 6:36

answered Aug 12, 2014 at 2:02

Magoo

80.8k8 gold badges68 silver badges92 bronze badges

4 Comments

shareef Over a year ago

1 MB takes 5 MIN too long

Magoo Over a year ago

@shareef: The time taken should depend on the number of lines in the file, not the filesize. Not sure whther you mean 1Mb or 1M lines. My test on the latest version was 1M lines and 11Mb long.

Arya Over a year ago

This is good but it leaves one blank line at the end of each line. Anyway to prevent that?

Magoo Over a year ago

@arya : I do not understand "one blank line at the end of each line". The line-endings are windows-standard CRLF. There are no empty lines in the output. Perhaps you are using a utility that counts both CR and LF as new-lines?

Fabian Kessler · Accepted Answer · 2016-08-04 17:38:11Z

9

This "File Splitter" Windows command line program works nicely: https://github.com/dubasdey/File-Splitter

It's open source, simple, documented, proven, and worked for me.

Example:

fsplit -split 50 mb mylargefile.txt

answered Aug 4, 2016 at 17:38

Fabian Kessler

82411 silver badges13 bronze badges

2 Comments

Michał Stochmal Over a year ago

This tool would be ideal, but it replaces non-ASCII characters with bushes :). Just saying to be aware of this problem.

user2590805 Over a year ago

@Michał Stochmal: In documentation of this tool is mentioned: ... Split by size produces binary files... So you have to split by line numbers.

Mark Setchell · Accepted Answer · 2014-08-11 21:15:58Z

8

You can maybe do something like this with awk

awk '{outfile=sprintf("file%02d.txt",NR/5000+1);print > outfile}' yourfile

Basically, it calculates the name of the output file by taking the record number (NR) and dividing it by 5000, adding 1, taking the integer of that and zero-padding to 2 places.

By default, awk prints the entire input record when you don't specify anything else. So, print > outfile writes the entire input record to the output file.

As you are running on Windows, you can't use single quotes because it doesn't like that. I think you have to put the script in a file and then tell awkto use the file, something like this:

awk -f script.awk yourfile

and script.awk will contain the script like this:

{outfile=sprintf("file%02d.txt",NR/5000+1);print > outfile}

Or, it may work if you do this:

awk "{outfile=sprintf(\"file%02d.txt\",NR/5000+1);print > outfile}" yourfile

edited Aug 11, 2014 at 21:15

answered Aug 11, 2014 at 18:27

Mark Setchell

210k32 gold badges309 silver badges503 bronze badges

1 Comment

David Balažic Over a year ago

This makes the first file to be one line less that the others. The correct formula is (NR-1)/5000+1

Ravi · Accepted Answer · 2016-01-25 16:45:27Z

7

Syntax looks like:

$ split [OPTION] [INPUT [PREFIX]]

where prefix is PREFIXaa, PREFIXab, ...

Just use proper one and youre done or just use mv for renameing. I think $ mv * *.txt should work but test it first on smaller scale.

:)

answered Jan 25, 2016 at 16:45

Ravi

871 silver badge2 bronze badges

1 Comment

cV2 Over a year ago

I love this one, but actually the suffix gets lost. any idea on this one? How to keep the suffix.

Covenant · Accepted Answer · 2015-10-27 18:23:56Z

My requirement was a bit different. I often work with Comma Delimited and Tab Delimited ASCII files where a single line is a single record of data. And they're really big, so I need to split them into manageable parts (whilst preserving the header row).

So, I reverted back to my classic VBScript method and bashed together a small .vbs script that can be run on any Windows computer (it gets automatically executed by the WScript.exe script host engine on Window).

The benefit of this method is that it uses Text Streams, so the underlying data isn't loaded into memory (or, at least, not all at once). The result is that it's exceptionally fast and it doesn't really need much memory to run. The test file I just split using this script on my i7 was about 1 GB in file size, had about 12 million lines of test and made 25 part files (each with about 500k lines each) – the processing took about 2 minutes and it didn’t go over 3 MB memory used at any point.

The caveat here is that it relies on the text file having "lines" (meaning each record is delimited with a CRLF) as the Text Stream object uses the "ReadLine" function to process a single line at a time. But hey, if you're working with TSV or CSV files, it's perfect.

Option Explicit

Private Const INPUT_TEXT_FILE = "c:\bigtextfile.txt"  'The full path to the big file
Private Const REPEAT_HEADER_ROW = True                'Set to True to duplicate the header row in each part file
Private Const LINES_PER_PART = 500000                 'The number of lines per part file

Dim oFileSystem, oInputFile, oOutputFile, iOutputFile, iLineCounter, sHeaderLine, sLine, sFileExt, sStart

sStart = Now()

sFileExt = Right(INPUT_TEXT_FILE,Len(INPUT_TEXT_FILE)-InstrRev(INPUT_TEXT_FILE,".")+1)
iLineCounter = 0
iOutputFile = 1

Set oFileSystem = CreateObject("Scripting.FileSystemObject")
Set oInputFile = oFileSystem.OpenTextFile(INPUT_TEXT_FILE, 1, False)
Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)

If REPEAT_HEADER_ROW Then
    iLineCounter = 1
    sHeaderLine = oInputFile.ReadLine()
    Call oOutputFile.WriteLine(sHeaderLine)
End If

Do While Not oInputFile.AtEndOfStream
    sLine = oInputFile.ReadLine()
    Call oOutputFile.WriteLine(sLine)
    iLineCounter = iLineCounter + 1
    If iLineCounter Mod LINES_PER_PART = 0 Then
        iOutputFile = iOutputFile + 1
        Call oOutputFile.Close()
        Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)
        If REPEAT_HEADER_ROW Then
            Call oOutputFile.WriteLine(sHeaderLine)
        End If
    End If
Loop

Call oInputFile.Close()
Call oOutputFile.Close()
Set oFileSystem = Nothing

Call MsgBox("Done" & vbCrLf & "Lines Processed:" & iLineCounter & vbCrLf & "Part Files: " & iOutputFile & vbCrLf & "Start Time: " & sStart & vbCrLf & "Finish Time: " & Now())

Mobigital · Accepted Answer · 2017-07-06 23:41:30Z

here is one in c# that doesn't run out of memory when splitting into large chunks! I needed to split 95M file into 10M x line files.

var fileSuffix = 0;
int lines = 0;
Stream fstream = File.OpenWrite($"{filename}.{(++fileSuffix)}");
StreamWriter sw = new StreamWriter(fstream);

using (var file = File.OpenRead(filename))
using (var reader = new StreamReader(file))
{
    while (!reader.EndOfStream)
    {
        sw.WriteLine(reader.ReadLine());
        lines++;

        if (lines >= 10000000)
        {
              sw.Close();
              fstream.Close();
              lines = 0;
              fstream = File.OpenWrite($"{filename}.{(++fileSuffix)}");
              sw = new StreamWriter(fstream);
        }
    }
}

sw.Close();
fstream.Close();

User M · Accepted Answer · 2017-09-19 15:51:13Z

0

I have created a simple program for this and your question helped me complete the solution... I added one more feature and few configurations. In case you want to add a specific character/ string after every few lines (configurable). Please go through the notes. I have added the code files : https://github.com/mohitsharma779/FileSplit

answered Sep 19, 2017 at 15:51

User M

3471 gold badge3 silver badges22 bronze badges

Comments

Just Me · Accepted Answer · 2023-10-24 16:25:30Z

This Python code will split any txt file into 1MB files

import os
from tqdm import tqdm

# Directorul în care se află fișierele txt
directory = r"d:\2022_12_02"

# Funcție pentru împărțirea fișierelor în părți mai mici
def split_txt_files(file_path, max_size):
    with open(file_path, 'rb') as file:
        data = file.read()

    # Verifică dacă fișierul este deja mai mic sau egal cu dimensiunea maximă dorită
    if len(data) <= max_size:
        return [data]

    parts = []
    current_part = b''

    for byte in data:
        if len(current_part) + 1 > max_size:
            parts.append(current_part)
            current_part = b''
        current_part += bytes([byte])

    if current_part:
        parts.append(current_part)

    return parts

# Ittrează prin fișierele txt din director
for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        file_path = os.path.join(directory, filename)
        parts = split_txt_files(file_path, 1024 * 1024)  # 1 MB în bytes

        # Configurăm tqdm pentru afișarea progresului
        progress_bar = tqdm(total=len(parts), desc=f"Splitting {filename}", unit="part")

        for i, part in enumerate(parts):
            part_filename = f"{os.path.splitext(filename)[0]}_part{i+1}.txt"
            part_path = os.path.join(directory, part_filename)

            with open(part_path, 'wb') as part_file:
                part_file.write(part)

            # Actualizăm progresul
            progress_bar.update(1)
            progress_bar.set_postfix({"Current Part": i+1})

            print(f"Fișierul {part_filename} a fost creat.")

        # Terminăm bara de progres
        progress_bar.close()

Collectives™ on Stack Overflow

Split text file into smaller multiple text file using command line

10 Answers 10

6 Comments

1 Comment

4 Comments

2 Comments

1 Comment

1 Comment

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

6 Comments

1 Comment

4 Comments

2 Comments

1 Comment

1 Comment

Comments

Comments

Comments

Comments

Linked

Related