1. Home
2. Questions
3. Unanswered
4. AI Assist Labs
5. Tags
7. Chat
8. Users
10. Companies
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Stack Internal
Bring the best of human thought and AI automation together at your work. Learn more

Return to Answer

replaced http://serverfault.com/ with https://serverfault.com/

Source Link

edited Apr 13, 2017 at 12:13

Community Bot

1

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance
Configuring NTFS file system for performance Configuring NTFS file system for performance

Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this extreme case of input, it will mess up the memory management badly - which then needs some time to clean up. (2)

_{(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time.} _{(1) Or address space only?}

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance
Configuring NTFS file system for performance

Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this extreme case of input, it will mess up the memory management badly - which then needs some time to clean up. (2)

_{(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time.} _{(1) Or address space only?}

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance
Configuring NTFS file system for performance

Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this extreme case of input, it will mess up the memory management badly - which then needs some time to clean up. (2)

_{(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time.} _{(1) Or address space only?}

replaced http://superuser.com/ with https://superuser.com/

Source Link

edited Mar 20, 2017 at 10:18

Community Bot

1

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance Bad NTFS performance
Configuring NTFS file system for performance

Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this extreme case of input, it will mess up the memory management badly - which then needs some time to clean up. (2)

_{(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time.} _{(1) Or address space only?}

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance
Configuring NTFS file system for performance

Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this extreme case of input, it will mess up the memory management badly - which then needs some time to clean up. (2)

_{(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time.} _{(1) Or address space only?}

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance
Configuring NTFS file system for performance

Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this extreme case of input, it will mess up the memory management badly - which then needs some time to clean up. (2)

_{(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time.} _{(1) Or address space only?}

added 1064 characters in body

Source Link

edited Aug 4, 2014 at 13:05

Volker Siegel

17.8k
6
56
81

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance
Configuring NTFS file system for performance

Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap()().
That is a good thing in other situations.
But in this extreme case of input, it will not really use too much memory - or much at all, but it will mess up the memory management badly - which then needs some time to clean up.

(Of course, we can still try to make it work.2)

_{(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time.} _{(1) Or address space only?}

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this case of input, it will not really use too much memory - or much at all, but it will mess up the memory management badly - which then needs some time to clean up.

(Of course, we can still try to make it work.)

Additional explanation based on NTFS filesystem performance

After writing the lower section ot this answer, the OP pointed out that the script is running it on a NTFS disk, and suspects that may be part of the problem.

This would not be too surprising: There are performance problems with NTFS speciffically related to handling many small files. And we are creating small files in the order of millions - per input file.

So, bad NTFS performance would be an alternative explanation for the performance degradation, while the extreme use of memmory seems still to be related to mmap().

Bad NTFS performance
Configuring NTFS file system for performance

Explaining memmory problem by strong use of mmap()

The memory problem that occurs with split in your script seems to be related to the use of mmap in 'split'.

strace shows the following calls for each output file:

28892 open("xx02", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
28892 fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
28892 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f821582f000
28892 write(3, "sometext\n", 30) = 30
28892 close(3)                          = 0
28892 munmap(0x7f821582f000, 4096)      = 0

Based on the examples, for a rough estimate of the files to handle, we assume input files of 300MB, output files of 100B:

That gives us about 3000000 files to write. We write only one at once. But we use mmap(). That means, for each of the files, at least one memory page is used, which is 4096 B in size.

Taking that into account, we touch about 12GB of memory (1) for one input file (but not all at once). Three million files, and 12 GB, sounds like it could cause some work for the kernel.

Basically, it looks like split is just not made for this job, because it uses mmap().
That is a good thing in other situations.
But in this extreme case of input, it will mess up the memory management badly - which then needs some time to clean up. (2)

_{(2) It will not really use too much memmory at the same time, but mmap a huge number of small files in short time.} _{(1) Or address space only?}

added 116 characters in body

Source Link

edited Aug 4, 2014 at 5:31

Volker Siegel

17.8k
6
56
81

Loading

Source Link

answered Aug 4, 2014 at 5:24

Volker Siegel

17.8k
6
56
81

Loading