Sorting array of string by numbers

Question

I want to sort an array like the following:

["10a","10b","9a","9b","8a","8b"]

When I call,

a = a.sort {|a,b| a <=> b}

it will sort like the following:

["10a","10b","8a","8b","9a","9b"]

The 10 is a string and is not handled as a number. When I first sort by integer and then by string, it will just do the same. Does anybody know how I can handle the 10 as a 10 without making it into an integer? That would mess up the letters a, b etc.

It's interesting that this question has attracted a couple of downvotes and a vote to close, yet there are eight answers to date. Perhaps the wording is not the best, but readers seem to have figured out what was wanted and a diverse collection of interesting approaches have been proposed. — Cary Swoveland
– Cary Swoveland, Commented Jan 19, 2015 at 8:33

Jörg W Mittag · Accepted Answer · 2015-01-19 16:00:39Z

7

When I first sort by integer and then by string, it will just do the same.

That would have been my first instinct, and it seems to work perfectly:

%w[10a 10b 9a 9b 8a 8b].sort_by {|el| [el.to_i, el] }
# => ['8a', '8b', '9a', '9b', '10a', '10b']

edited Jan 19, 2015 at 16:00

answered Jan 15, 2015 at 17:06

Jörg W Mittag

371k79 gold badges457 silver badges666 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

the Tin Man Over a year ago

I was about to add this to my answer and saw yours. Excellent; This should run like blue-blazes.

Jörg W Mittag Over a year ago

@CarySwoveland: Confusion with Kernel#Integer :-D

the Tin Man · Accepted Answer · 2015-01-20 20:58:53Z

2

I'd do something like this:

ary = ["10a","10b","9a","9b","8a","8b"]

sorted_ary = ary.sort_by{ |e|
  /(?<digit>\d+)(?<alpha>\D+)/ =~ e 
  [digit.to_i, alpha]
}

ary        # => ["10a", "10b", "9a", "9b", "8a", "8b"]
sorted_ary # => ["8a", "8b", "9a", "9b", "10a", "10b"]

sorted_by is going to be faster than sort for this sort of problem. Because the value being sorted isn't a direct comparison and we need to dig into it to get the values to use for collation, a normal sort would have to do it multiple times for each element. Instead, using sort_by caches the computed value, and then sorts based on it.

/(?<digit>\d+)(?<alpha>\D+)/ =~ e isn't what you'll normally see for a regular expression. The named-captures ?<digit> and ?<alpha> define the names of local variables that can be accessed immediately, when used in that form.

[digit.to_i, alpha] returns an array consisting of the leading numeric convert to an integer, followed by the character. That array is then used for comparison by sort_by.

Benchmarking sort vs. sort_by using Fruity: I added some length to the array being sorted to push the routines a bit harder for better time resolution.

require 'fruity'

ARY = (%w[10a 10b 9a 9b 8a 8b] * 1000).shuffle

compare do
  cary_to_i_sort_by { ARY.sort_by { |s| s.to_i(36) } }
  cary_to_i_sort    { ARY.map { |s| s.to_i(36) }.sort.map {|i| i.to_s(36)} }
end

compare do
  jorge_sort_by { ARY.sort_by {|el| [el.to_i, el] } }
  jorg_sort     { ARY.map {|el| [el.to_i, el] }.sort.map(&:last) }
end
# >> Running each test 2 times. Test will take about 1 second.
# >> cary_to_i_sort_by is faster than cary_to_i_sort by 19.999999999999996% ± 1.0%
# >> Running each test once. Test will take about 1 second.
# >> jorge_sort_by is faster than jorg_sort by 10.000000000000009% ± 1.0%

Ruby's sort_by uses a Schwartzian Transform, which can make a major difference in sort speed when dealing with objects where we have to compute the value to be sorted.

Could you run your benchmark for 100_000 instead of 1_000 in the definition of ARY?

require 'fruity'

ARY = (%w[10a 10b 9a 9b 8a 8b] * 100_000).shuffle

compare do
  cary_to_i_sort_by { ARY.sort_by { |s| s.to_i(36) } }
  cary_to_i_sort    { ARY.map { |s| s.to_i(36) }.sort.map {|i| i.to_s(36)} }
end

compare do
  jorge_sort_by { ARY.sort_by {|el| [el.to_i, el] } }
  jorg_sort     { ARY.map {|el| [el.to_i, el] }.sort.map(&:last) }
end
# >> Running each test once. Test will take about 10 seconds.
# >> cary_to_i_sort_by is faster than cary_to_i_sort by 2x ± 1.0
# >> Running each test once. Test will take about 26 seconds.
# >> jorg_sort is similar to jorge_sort_by

The Wikepedia article has a good efficiency analysis and example that explains why sort_by is preferred for costly comparisons.

Ruby's sort_by documentation also covers this well.

I don't think the size of the array will make much difference. If anything, as the array size grows, if the calculation for the intermediate value is costly, sort_by will still be faster because of its caching. Remember, sort_by is all compiled code, whereas using a Ruby-script-based transform is subject to slower execution as the array is transformed, handed off to sort and then the original object is plucked from the sub-arrays. A larger array means it just has to be done more times.

edited Jan 20, 2015 at 20:58

answered Jan 15, 2015 at 10:20

the Tin Man

161k44 gold badges222 silver badges308 bronze badges

3 Comments

engineersmnky Over a year ago

Very nice solution and I like the use of named groups. Mine was very similar ary.sort_by { |s| s.scan(/(\d+)(\D+)/).flatten.tap{|a| a[0] = a[0].to_i}} but lacks the readability that yours has.

the Tin Man Over a year ago

Thanks. The named-captures (which are similar to named-groups) can make a confusing pattern a lot more readable, especially when using the x flag. And, they can make it a lot easier to extract the values from a complex match, along with making it more readable too. I'm doing a complex nested match against some timestamped filenames that have FQDNs and data-types concatenated and the named-captures let me extract any part with a single match.

Cary Swoveland Over a year ago

Sn, I would expect the sort vs sort_by comparison to depend on the size of the array, with sort generally being favoured for very large arrays. Could you run your benchmark for 100_000 instead of 1_000 in the definition of ARY? This is of course not a knock on sort_by, which is going to be faster for most real-world applications (no pre-sort and post-sort overhead) and is also simpler and reads better. Thanks for the Schwartzian Transform link.

Aleksei Matiushkin · Accepted Answer · 2015-01-15 09:36:29Z

1

▶ a = ["10a","10b","9a","9b","8a","8b"]
▶ a.sort { |a,b| a.to_i == b.to_i ? a <=> b : a.to_i <=> b.to_i }
#=> [
#  [0] "8a",
#  [1] "8b",
#  [2] "9a",
#  [3] "9b",
#  [4] "10a",
#  [5] "10b"
#]

Hope it helps.

answered Jan 15, 2015 at 9:36

Aleksei Matiushkin

121k12 gold badges109 silver badges173 bronze badges

5 Comments

bluexuemei Over a year ago

This answer is perfect.

Splinti Over a year ago

Okay I just had to invert it. Thanks for the answer, was perfect :)

the Tin Man Over a year ago

See @jörgwmittag's answer. His use of sort_by is going to be extremely fast, and more succinct. sort doesn't perform nearly as well as sort_by when the values being sorted have to be manipulated.

Aleksei Matiushkin Over a year ago

I agree with @theTinMan, the Jörg’s answer is better and it should be accepted.

Splinti Over a year ago

It is a good answer, I agree. But that was not what I needed. I also had to invert the array so the 10a comes first, so I had to change the answer @mudasobwa made. And that wouldn't work for my array.

Cary Swoveland · Accepted Answer · 2015-08-21 19:32:13Z

1

Two ways that don't use String#to_i (but rely on the assumption that each string consists of one or more digits followed by one lower case letter).

ary = ["10a","10b","9a","9b","8a","8b","100z", "96b"]

#1

mx = ary.map(&:size).max
ary.sort_by { |s| s.rjust(mx) }
  #=> ["8a", "8b", "9a", "9b", "10a", "10b", "96b", "100z"]

#2

ary.sort_by { |s| s.to_i(36) }
  #=> ["8a", "8b", "9a", "9b", "10a", "10b", "96b", "100z"]

Hmmm, I wonder if:

ary.map { |s| s.rjust(mx) }.sort.map(&:lstrip)

or

ary.map { |s| s.to_i(36) }.sort.map {|i| i.to_s(36)}

would be faster.

edited Aug 21, 2015 at 19:32

answered Jan 19, 2015 at 6:49

Cary Swoveland

111k6 gold badges69 silver badges105 bronze badges

Comments

sawa · Accepted Answer · 2015-01-15 09:41:33Z

0

["10a","10b","9a","9b","8a","8b"]
.sort_by{|s| s.split(/(\D+)/).map.with_index{|s, i| i.odd? ? s : s.to_i}}
#=> ["8a", "8b", "9a", "9b", "10a", "10b"]

["10a3","10a4", "9", "9aa","9b","8a","8b"]
.sort_by{|s| s.split(/(\D+)/).map.with_index{|s, i| i.odd? ? s : s.to_i}}
#=> ["8a", "8b", "9", "9aa", "9b", "10a3", "10a4"]

answered Jan 15, 2015 at 9:41

sawa

169k51 gold badges287 silver badges398 bronze badges

1 Comment

Cary Swoveland Over a year ago

You may be interested in my benchmark of the proposed solutions, which I just posted.

Cary Swoveland · Accepted Answer · 2015-08-21 17:46:49Z

0

Gentlemen, start your engines!

I decided to benchmark the various solutions that have been offered. One of the things I was curious about was the effect of converting sort_by solutions to sort solutions. For example, I compared my method

def cary_to_i(a)
  a.sort_by { |s| s.to_i(36) }
end

to

def cary_to_i_sort(a)
  a.map { |s| s.to_i(36) }.sort.map {|i| i.to_s(36)}
end

This always involves mapping the original array to the transformed values within the sort_by block, sorting that array, then mapping the results back to the elements in the original array (when that can be done).

I tried this sort_by-to-sort conversion with some of the methods that use sort_by. Not surprisingly, the conversion to sort was generally faster, though the amount of improvement varied quite a bit.

Methods compared

module Methods
  def mudasobwa(a)
    a.sort { |a,b| a.to_i == b.to_i ? a <=> b : a.to_i <=> b.to_i }
  end

  def jorg(a)
    a.sort_by {|el| [el.to_i, el] }
  end

  def jorg_sort(a)
    a.map {|el| [el.to_i, el] }.sort.map(&:last)
  end

  def the(a)
    a.sort_by {|e| /(?<digit>\d+)(?<alpha>\D+)/ =~ e
      [digit.to_i, alpha] }
  end

  def the_sort(a)
    a.map {|e| /(?<digit>\d+)(?<alpha>\D+)/ =~ e
     [digit.to_i, alpha]}.sort.map {|d,a| d.to_s+a }
  end

  def engineer(a) a.sort_by { |s|
    s.scan(/(\d+)(\D+)/).flatten.tap{ |a| a[0] = a[0].to_i } }
  end

  def sawa(a) a.sort_by { |s|
    s.split(/(\D+)/).map.with_index { |s, i| i.odd? ? s : s.to_i } }
  end

  def cary_rjust(a)
    mx = a.map(&:size).max
    a.sort_by {|s| s.rjust(mx)}
  end

  def cary_rjust_sort(a)
    mx = a.map(&:size).max
    a.map { |s| s.rjust(mx) }.sort.map(&:lstrip)
  end

  def cary_to_i(a)
    a.sort_by { |s| s.to_i(36) }
  end

  def cary_to_i_sort(a)
    a.map { |s| s.to_i(36) }.sort.map {|i| i.to_s(36)}
  end
end

include Methods
methods = Methods.instance_methods(false)
  #=> [:mudasobwa, :jorg, :jorg_sort, :the, :the_sort,
  #    :cary_rjust, :cary_rjust_sort, :cary_to_i, :cary_to_i_sort]

Test data and helper

def test_data(n)
  a = 10_000.times.to_a.map(&:to_s)
  b = [*'a'..'z']
  n.times.map { a.sample + b.sample }
end

def compute(m,a)
  send(m,a)
end

Confirm methods return the same values

a = test_data(1000)
puts "All methods correct: #{methods.map { |m| compute(m,a) }.uniq.size == 1}"

Benchmark code

require 'benchmark'

indent = methods.map { |m| m.to_s.size }.max

n = 500_000
a = test_data(n)
puts "\nSort random array of size #{n}"
Benchmark.bm(indent) do |bm|
  methods.each do |m|
    bm.report m.to_s do
      compute(m,a)
    end
  end    
end

Test

Sort random array of size 500000
                      user     system      total        real
mudasobwa         4.760000   0.000000   4.760000 (  4.765170)
jorg              2.870000   0.020000   2.890000 (  2.892359)
jorg_sort         2.980000   0.020000   3.000000 (  3.010344)
the               9.040000   0.100000   9.140000 (  9.160944)
the_sort          4.570000   0.090000   4.660000 (  4.668146)
engineer         10.110000   0.070000  10.180000 ( 10.198117)
sawa             27.310000   0.160000  27.470000 ( 27.504958)
cary_rjust        1.080000   0.010000   1.090000 (  1.087788)
cary_rjust_sort   0.740000   0.000000   0.740000 (  0.746132)
cary_to_i         0.570000   0.000000   0.570000 (  0.576570)
cary_to_i_sort    0.460000   0.020000   0.480000 (  0.477372)

Addendum

@theTinMan demonstrated that the comparisons between the sort_by and sort methods is sensitive to the choice of test data. Using the data he used:

def test_data(n)
  (%w[10a 10b 9a 9b 8a 8b] * (n/6)).shuffle
end

I got these results:

Sort random array of size 500000
                      user     system      total        real
mudasobwa         0.620000   0.000000   0.620000 (  0.622566)
jorg              0.620000   0.010000   0.630000 (  0.636018)
jorg_sort         0.640000   0.010000   0.650000 (  0.638493)
the               8.790000   0.090000   8.880000 (  8.886725)
the_sort          2.670000   0.070000   2.740000 (  2.743085)
engineer          3.150000   0.040000   3.190000 (  3.184534)
sawa              3.460000   0.040000   3.500000 (  3.506875)
cary_rjust        0.360000   0.010000   0.370000 (  0.367094)
cary_rjust_sort   0.480000   0.010000   0.490000 (  0.499956)
cary_to_i         0.190000   0.010000   0.200000 (  0.187136)
cary_to_i_sort    0.200000   0.000000   0.200000 (  0.203509)

Notice that the absolute times are also affected.

Can anyone explain the reason for the difference in the benchmarks?

edited Aug 21, 2015 at 17:46

answered Jan 20, 2015 at 4:59

Cary Swoveland

111k6 gold badges69 silver badges105 bronze badges

7 Comments

engineersmnky Over a year ago

Very nice work thank you for the benchmarks and hoorah for your success. Not often when every solution is multiples faster than the others especially when the people that posted here include names like JorgWMittag, theTinMan, and sawa

the Tin Man Over a year ago

I ran your benchmarks using Fruity, and I'm getting opposite results for the sort_by vs the equivalent sort tests. sort_by uses a Schwartzian Transform, which is a well recognized way to reduce the cost/time of sorting complex objects. I'll add the simple tests I ran to my answer.

Cary Swoveland Over a year ago

Readers, for the following to make sense, @JoeFrambach changed the first line of my answer to "Start your engines!." He did the same in several other answers I've given that begin with the same line. I've rolled them all back. Had any women offered answers in any of those questions, I would have said "Ladies and gentlemen, start your engines!", but, after checking, I found all answers were given by members I know to be men, so that made no sense....

Cary Swoveland Over a year ago

...."Gentlemen," or "Ladies and gentlemen," obviously harkens back to the early days of auto racing. To change that to "Start your engines!" misses the point entirely. Joe: get your grubby hands off my answers!

Badger Cat Over a year ago

You said that if women had offered answers you would have said Ladies and Gentlemen. I am a female reader. Does that count or should I provide an answer to be taken into account?

|

Collectives™ on Stack Overflow

Sorting array of string by numbers

6 Answers 6

2 Comments

3 Comments

5 Comments

Comments

1 Comment

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

3 Comments

5 Comments

Comments

1 Comment

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related