1

I'm using C++. Using sort from STL is allowed.

I have an array of int, like this :

1 4 1 5 145 345 14 4

The numbers are stored in a char* (i read them from a binary file, 4 bytes per numbers)

I want to do two things with this array :

  1. swap each number with the one after that

    4 1 5 1 345 145 4 14

  2. sort it by group of 2

    4 1 4 14 5 1 345 145


I could code it step by step, but it wouldn't be efficient. What I'm looking for is speed. O(n log n) would be great.

Also, this array can be bigger than 500MB, so memory usage is an issue.


My first idea was to sort the array starting from the end (to swap the numbers 2 by 2) and treating it as a long* (to force the sorting to take 2 int each time). But I couldn't manage to code it, and I'm not even sure it would work.

I hope I was clear enough, thanks for your help : )

11
  • I don't get your sort by group of 2. Commented Feb 9, 2013 at 9:37
  • 1/ swap each number with the one after that doing that would just move the first number to the end right? Commented Feb 9, 2013 at 9:38
  • Your sort-by-two is easy enough with an int [][2] and a block sorter like qsort(). But I'm still stuck way back on the "those numbers are stored in a char* " announcement. Are you saying they are not in native int form? (and no, your proposal for a long* cast will most-assuredly not work, especially since most implementations have a long and int of the same bit depth). Commented Feb 9, 2013 at 9:42
  • seems you should be able to use the stl sort function with your own compare that only uses the first element of each pair. Commented Feb 9, 2013 at 9:47
  • Also, when you say "this array can be bigger than 500MB" do you mean the data size is 500MB (approx 131072000 int values), or there can be 524288000 or more int values (which on a 32bit int system would comprise over 2gB of data). ? Commented Feb 9, 2013 at 9:49

3 Answers 3

2

This is the most memory efficient layout I could come up with. Obviously the vector I'm using would be replaced by the data blob you're using, assuming endian-ness is all handled well enough. The premise of the code below is simple.

  1. Generate 1024 random values in pairs, each pair consisting of the first number between 1 and 500, the second number between 1 and 50.

  2. Iterate the entire list, flipping all even-index values with their following odd-index brethren.

  3. Send the entire thing to std::qsort with an item width of two (2) int32_t values and a count of half the original vector.

  4. The comparator function simply sorts on the immediate value first, and on the second value if the first is equal.

The sample below does this for 1024 items. I've tested it without output for 134217728 items (exactly 536870912 bytes) and the results were pretty impressive for a measly macbook air laptop, about 15 seconds, only about 10 of that on the actual sort. What is ideally most important is no additional memory allocation is required beyond the data vector. Yes, to the purists, I do use call-stack space, but only because q-sort does.

I hope you get something out of it.

Note: I only show the first part of the output, but I hope it shows what you're looking for.

#include <iostream>
#include <fstream>
#include <algorithm>
#include <iterator>
#include <cstdint>


// a most-wacked-out random generator. every other call will
//  pull from a rand modulo either the first, or second template
//  parameter, in alternation.
template<int N,int M>
struct randN
{
    int i = 0;
    int32_t operator ()()
    {
        i = (i+1)%2;
        return (i ? rand() % N : rand() % M) + 1;
    }
};

// compare to integer values by address.
int pair_cmp(const void* arg1, const void* arg2)
{
    const int32_t *left = (const int32_t*)arg1;
    const int32_t *right = (const int32_t *)arg2;
    return (left[0] == right[0]) ? left[1] - right[1] : left[0] - right[0];
}

int main(int argc, char *argv[])
{
    // a crapload of int values
    static const size_t N = 1024;

    // seed rand()
    srand((unsigned)time(0));

    // get a huge array of random crap from 1..50
    vector<int32_t> data;
    data.reserve(N);
    std::generate_n(back_inserter(data), N, randN<500,50>());

    // flip all the values
    for (size_t i=0;i<data.size();i+=2)
    {
        int32_t tmp = data[i];
        data[i] = data[i+1];
        data[i+1] = tmp;
    }

    // now sort in pairs. using qsort only because it lends itself
    //  *very* nicely to performing block-based sorting.
    std::qsort(&data[0], data.size()/2, sizeof(data[0])*2, pair_cmp);
    cout << "After sorting..." << endl;
    std::copy(data.begin(), data.end(), ostream_iterator<int32_t>(cout,"\n"));
    cout << endl << endl;

    return EXIT_SUCCESS;
}

Output

After sorting...
1
69
1
83
1
198
1
343
1
367
2
12
2
30
2
135
2
169
2
185
2
284
2
323
2
325
2
347
2
367
2
373
2
382
2
422
2
492
3
286
3
321
3
364
3
377
3
400
3
418
3
441
4
24
4
97
4
153
4
210
4
224
4
250
4
354
4
356
4
386
4
430
5
14
5
26
5
95
5
145
5
302
5
379
5
435
5
436
5
499
6
67
6
104
6
135
6
164
6
179
6
310
6
321
6
399
6
409
6
425
6
467
6
496
7
18
7
65
7
71
7
84
7
116
7
201
7
242
7
251
7
256
7
324
7
325
7
485
8
52
8
93
8
156
8
193
8
285
8
307
8
410
8
456
8
471
9
27
9
116
9
137
9
143
9
190
9
190
9
293
9
419
9
453
Sign up to request clarification or add additional context in comments.

8 Comments

Well unfortunately OP has now clarified that they meant char* all along. Not std::vector<int>, not std::vector<char*>, but plain char*.
@KonradRudolph with which comment? I see him loading a char buffer from the file with a raw read, but where does he disavow his original claim the file is in binary 32-little-endian format? Is there somewhere up there he now claims these are all ascii chars? The vector in the code above is only so I have something to demonstrate the algorithm with, I expect the OP to bring an int* to the table.
No, it’s 32 bit LE. But your code still assumes std::vector<int32_t> and OP wants to do it in-place (*evil cackle*).
And how, using the algorithm above, is that not possible? I only used a vector to demonstrate the algorithm. I could always run up and replace it with a big-ol-malloc, I suppose. Do you think it warranted that I do so? I usually prefer to post runnable code that demonstrates the solution, but perhaps not in this case. I think you're right. maybe just a function is all that is needed and let him figure out the setup.
But that’s not valid C++ (UB and all that). ;-) But yeah, I see your point (and you got the obligatory upboat).
|
2

With some additional constraints on both your input and your platform, you can probably use an approach like the one you are thinking of. These constraints would include

  • Your input contains only positive numbers (i.e. can be treated as unsigned)
  • Your platform provides uint8_t and uint64_t in <cstdint>
  • You address a single platform with known endianness.

In that case you can divide your input into groups of 8 bytes, do some byte shuffling to arrange each groups as one uint64_t with the "first" number from the input in the lower-valued half and run std::sort on the resulting array. Depending on endianness you may need to do more byte shuffling to rearrange each sorted 8-byte group as a pair of uint32_t in the expected order.

If you can't code this on your own, I'd strongly advise you not to take this approach.

A better and more portable approach (you have some inherent non-portability by starting from a not clearly specified binary file format), would be:

std::vector<int> swap_and_sort_int_pairs(const unsigned char buffer[], size_t buflen) {
   const size_t intsz = sizeof(int);
   // We have to assume that the binary format in buffer is compatible with our int representation
   // we also require an even number of integers
   assert(buflen % (2*intsz) == 0);

   // load pairwise
   std::vector< std::pair<int,int> > pairs;
   pairs.reserve(buflen/(2*intsz));
   for (const unsigned char* bufp=buffer; bufp<buffer+buflen; bufp+= 2*intsz) {
      // It would be better to have a more portable binary -> int conversion
      int first_value = *reinterpret_cast<int*>(bufp);
      int second_value = *reinterpret_cast<int*>(bufp + intsz);
      // swap each pair here
      pairs.emplace_back( second_value, firstvalue );
   }
   // less<pair<..>> does lexicographical ordering, which is what you are looking ofr
   std::sort(pairs.begin(), pairs.end());

   // convert back to linear vector 
   std::vector<int> result;
   result.reserve(2*pairs.size());
   for (auto& entry : pairs) {
      result.push_back(entry.first);
      result.push_back(entry.second);
   }
   return result;
}

Both the inital parse/swap pass (which you need anyway) and the final conversion are O(N), so the total complexity is still (O(N log(N)).

If you can continue to work with pairs, you can save the final conversion. The other way to save that conversion would be to use a hand-coded sort with two-int strides and two-int swap: much more work - and possibly still hard to get as efficient as a well-tuned library sort.

3 Comments

Depending on the way you read your buffer, you can combine the initial parse/swap pass directly with your I/O.
This is how I’d do it (with the caveat you mentioned about UB in reinterpret_cast).
Thanks for the code, I chose to accept WhozCraig's answer because it is much more memory efficient (even though your solution is slightly faster).
0

Do one thing at a time. First, give your data some *struct*ure. It seems that each 8 byte form a unit of the form

struct unit {
    int key;
    int value;
}

If the endianness is right, you can do this in O(1) with a reinterpret_cast. If it isn't, you'll have to live with a O(n) conversion effort. Both vanish compared to the O(n log n) search effort.

When you have an array of these units, you can use std::sort like:

bool compare_units(const unit& a, const unit& b) {
    return a.key < b.key;
}

std::sort(array, length, compare_units);

The key to this solution is that you do the "swapping" and byte-interpretation first and then do the sorting.

7 Comments

This would require me to go through all the data to store it into a struct. Speed is my first priority. It would also double the memory usage (at least before freeing my original array).
using std::pair<int, int> instead of your own struct removes the necessity of defining a comparator. @user1278743 store the data directly into the array of pairs.
@user1278743: You can either use reinterpret_cast<> with a careful struct layout (be sure to unit-test that) or you can do the right thing and struct the data while reading the file instead of passing around a totally unstructured char pointer. The first solution is in O(1), the second in O(n), and both will be much faster than an O(n log n) sort. So, forget about the speed impact.
@thiton That’s an overrated benefit in this case: the fields have no intrinsic meaning, so don’t invent one. With C++11 you should really get used to using tuples more often. Not everywhere, mind you – but more often.
fwiw, std::pairs built-in operator <() does exactly what the OP needs (sort by first unless they're equal, then sort by second).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.