Skip to main content
replaced http://serverfault.com/ with https://serverfault.com/
Source Link

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance
Configuring NTFS file system for performanceConfiguring NTFS file system for performance


Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this extreme case of input, it will mess up the memory management badly - which then needs some time to clean up. (2)

(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time. (1) Or address space only?

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance
Configuring NTFS file system for performance


Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this extreme case of input, it will mess up the memory management badly - which then needs some time to clean up. (2)

(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time. (1) Or address space only?

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance
Configuring NTFS file system for performance


Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this extreme case of input, it will mess up the memory management badly - which then needs some time to clean up. (2)

(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time. (1) Or address space only?
replaced http://superuser.com/ with https://superuser.com/
Source Link

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performanceBad NTFS performance
Configuring NTFS file system for performance


Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this extreme case of input, it will mess up the memory management badly - which then needs some time to clean up. (2)

(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time. (1) Or address space only?

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance
Configuring NTFS file system for performance


Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this extreme case of input, it will mess up the memory management badly - which then needs some time to clean up. (2)

(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time. (1) Or address space only?

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance
Configuring NTFS file system for performance


Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this extreme case of input, it will mess up the memory management badly - which then needs some time to clean up. (2)

(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time. (1) Or address space only?
added 1064 characters in body
Source Link
Volker Siegel
  • 17.8k
  • 6
  • 56
  • 81

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance
Configuring NTFS file system for performance


Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap()().
That is a good thing in other situations.
But in this extreme case of input, it will not really use too much memory - or much at all, but it will mess up the memory management badly - which then needs some time to clean up.

   (Of course, we can still try to make it work.2)

(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time. (1) Or address space only?

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this case of input, it will not really use too much memory - or much at all, but it will mess up the memory management badly - which then needs some time to clean up.

 (Of course, we can still try to make it work.)

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance
Configuring NTFS file system for performance


Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this extreme case of input, it will mess up the memory management badly - which then needs some time to clean up.  (2)

(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time. (1) Or address space only?
added 116 characters in body
Source Link
Volker Siegel
  • 17.8k
  • 6
  • 56
  • 81
Loading
Source Link
Volker Siegel
  • 17.8k
  • 6
  • 56
  • 81
Loading