12

I'm currently sorting by the std::string < operator. The problem with it is that:

30 < 9. The 30 shows up before the 9 since 3 < 9, Windows 9x had this issue to. How could I go about sorting them numerically so that "30 Foxes" shows up after "9 dogs". I should also add that I'm using utf 8 encoding.

Thanks

2
  • possible duplicate of How to implement a natural sort algorithm in c++? Commented Jan 7, 2011 at 6:21
  • This depends on the use-case, of course, but for filenames I actually preferred the pre-XP way. I hate when the software tries to be overly smart, because then it starts to mess things up... try sorting a list of hexadecimal filenames in >=XP. Commented Apr 28, 2013 at 14:13

4 Answers 4

13

You can create a custom comparison function to use with std::sort. This function would have to check if the string begins with a numeric value. If it does, convert the numeric part of each string to an int using some mechanism like a stringstream. Then compare the two integer values. If the values compare equally, compare the non-numeric part of the strings lexicographically. Otherwise, if the strings don't contain a numeric part, simply compare the two strings lexicographically as normal.

Basically, something like the following (untested) comparison function:

bool is_not_digit(char c)
{
    return !std::isdigit(c);
}

bool numeric_string_compare(const std::string& s1, const std::string& s2)
{
    // handle empty strings...

    std::string::const_iterator it1 = s1.begin(), it2 = s2.begin();

    if (std::isdigit(s1[0]) && std::isdigit(s2[0])) {
        int n1, n2;
        std::stringstream ss(s1);
        ss >> n1;
        ss.clear();
        ss.str(s2);
        ss >> n2;

        if (n1 != n2) return n1 < n2;

        it1 = std::find_if(s1.begin(), s1.end(), is_not_digit);
        it2 = std::find_if(s2.begin(), s2.end(), is_not_digit);
    }

    return std::lexicographical_compare(it1, s1.end(), it2, s2.end());
}

And then...

std::sort(string_array.begin(), string_array.end(), numeric_string_compare);

EDIT: Of course, this algorithm is only useful if you're sorting strings where the numeric portion appears at the beginning of the string. If you're dealing with strings where the numeric portion can appear anywhere in the string, then you need a more sophisticated algorithm. See http://www.davekoelle.com/alphanum.html for more information.

Sign up to request clarification or add additional context in comments.

5 Comments

+1 because your code deals with the case when one of them contains a number at its start but the other doesn't.
atoi is ideal for this purpose, you won't even need special code for strings that don't start with numbers (although they will sort to the beginning). Or strtod if you want to control that.
@Ben, yeah, atoi would be much more efficient than std::stringstream.
FWIW, I wrote a similar function that uses strtoll for the stackoverflow question here, it's less correct in another way though - assuming direct character comparison is the desired order when not comparing digits.
Converting to int (e.g., via atoi) assumes the string representation of the number can be successfully converted to an int. With long strings of digits, the conversion would overflow...
3

If you are targeting Windows (XP+) and can afford to convert your strings to utf-16, you can use the StrCmpLogicalW function from Shlwapi. See msdn for details.

Otherwise, ICU provides this functionality in its collators. See UCOL_NUMERIC_COLLATION.

Comments

3

Here's a version that doesn't convert to integer and thus works for long strings of digits regardless of sizeof(int).

#include <cctype>
#include <cstddef>
#include <cstring>
#include <string>

int numcmp(const char *a, const char *aend, const char *b, const char *bend)
{
  for (;;) {
    if (a == aend) {
      if (b == bend)
        return 0;
      return -1;
    }
    if (b == bend)
      return 1;
    if (*a == *b) {
      ++a, ++b;
      continue;
    }
    if (!isdigit((unsigned char) *a) || !isdigit((unsigned char) *b))
      return *a - *b;

    // skip leading zeros in both strings
    while (*a == '0' && ++a != aend)
      ;
    while (*b == '0' && ++b != aend)
      ;

    // skip to end of the consecutive digits
    const char *aa = a;
    while (a != aend && isdigit((unsigned char) *a))
      ++a;
    std::ptrdiff_t alen = a - aa;

    const char *bb = b;
    while (b != bend && isdigit((unsigned char) *b))
      ++b;
    std::ptrdiff_t blen = b - bb;

    if (alen != blen)
      return alen - blen;

    // same number of consecutive digits in both strings
    while (aa != a) {
      if (*aa != *bb)
        return *aa - *bb;
      ++aa, ++bb;
    }
  }
}

int numcmp(const std::string& a, const std::string& b)
{
  return numcmp(a.data(), a.data() + a.size(),
                b.data(), b.data() + b.size());
}

int numcmp(const char *a, const char *b)
{
  return numcmp(a, a + strlen(a), b, b + strlen(b));
}

2 Comments

numcmp("0001", "001") == 0, which means when you use numcmp as a key sorting function in a set, only one of these strings will be stored in the set. This is probably not intended.
Yes, indeed. For all solutions to this problem, we'd need to decide what we want to happen if the strings are different but numcmp says they're equal
2

This what worked for me (assuming no leading zeroes), i.e. the idea is that phonetic compare can be applied just to numbers with same number of digits.

    auto numeric_str_cmp = [](const std::string& a, const std::string& b) -> bool {
                                return (a.size() < b.size()) || (a.size() == b.size() && a < b);
                             };
    std::sort(numeric_strings.begin(), numeric_strings.end(), numeric_str_cmp);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.