2

I want to sort an array like the following:

["10a","10b","9a","9b","8a","8b"]

When I call,

a = a.sort {|a,b| a <=> b}

it will sort like the following:

["10a","10b","8a","8b","9a","9b"]

The 10 is a string and is not handled as a number. When I first sort by integer and then by string, it will just do the same. Does anybody know how I can handle the 10 as a 10 without making it into an integer? That would mess up the letters a, b etc.

2
  • 1
    Show us how your expected output looks. Commented Jan 15, 2015 at 10:21
  • It's interesting that this question has attracted a couple of downvotes and a vote to close, yet there are eight answers to date. Perhaps the wording is not the best, but readers seem to have figured out what was wanted and a diverse collection of interesting approaches have been proposed. Commented Jan 19, 2015 at 8:33

6 Answers 6

7

When I first sort by integer and then by string, it will just do the same.

That would have been my first instinct, and it seems to work perfectly:

%w[10a 10b 9a 9b 8a 8b].sort_by {|el| [el.to_i, el] }
# => ['8a', '8b', '9a', '9b', '10a', '10b']
Sign up to request clarification or add additional context in comments.

2 Comments

I was about to add this to my answer and saw yours. Excellent; This should run like blue-blazes.
@CarySwoveland: Confusion with Kernel#Integer :-D
2

I'd do something like this:

ary = ["10a","10b","9a","9b","8a","8b"]

sorted_ary = ary.sort_by{ |e|
  /(?<digit>\d+)(?<alpha>\D+)/ =~ e 
  [digit.to_i, alpha]
}

ary        # => ["10a", "10b", "9a", "9b", "8a", "8b"]
sorted_ary # => ["8a", "8b", "9a", "9b", "10a", "10b"]

sorted_by is going to be faster than sort for this sort of problem. Because the value being sorted isn't a direct comparison and we need to dig into it to get the values to use for collation, a normal sort would have to do it multiple times for each element. Instead, using sort_by caches the computed value, and then sorts based on it.

/(?<digit>\d+)(?<alpha>\D+)/ =~ e isn't what you'll normally see for a regular expression. The named-captures ?<digit> and ?<alpha> define the names of local variables that can be accessed immediately, when used in that form.

[digit.to_i, alpha] returns an array consisting of the leading numeric convert to an integer, followed by the character. That array is then used for comparison by sort_by.


Benchmarking sort vs. sort_by using Fruity: I added some length to the array being sorted to push the routines a bit harder for better time resolution.

require 'fruity'

ARY = (%w[10a 10b 9a 9b 8a 8b] * 1000).shuffle

compare do
  cary_to_i_sort_by { ARY.sort_by { |s| s.to_i(36) } }
  cary_to_i_sort    { ARY.map { |s| s.to_i(36) }.sort.map {|i| i.to_s(36)} }
end

compare do
  jorge_sort_by { ARY.sort_by {|el| [el.to_i, el] } }
  jorg_sort     { ARY.map {|el| [el.to_i, el] }.sort.map(&:last) }
end
# >> Running each test 2 times. Test will take about 1 second.
# >> cary_to_i_sort_by is faster than cary_to_i_sort by 19.999999999999996% ± 1.0%
# >> Running each test once. Test will take about 1 second.
# >> jorge_sort_by is faster than jorg_sort by 10.000000000000009% ± 1.0%

Ruby's sort_by uses a Schwartzian Transform, which can make a major difference in sort speed when dealing with objects where we have to compute the value to be sorted.


Could you run your benchmark for 100_000 instead of 1_000 in the definition of ARY?

require 'fruity'

ARY = (%w[10a 10b 9a 9b 8a 8b] * 100_000).shuffle

compare do
  cary_to_i_sort_by { ARY.sort_by { |s| s.to_i(36) } }
  cary_to_i_sort    { ARY.map { |s| s.to_i(36) }.sort.map {|i| i.to_s(36)} }
end

compare do
  jorge_sort_by { ARY.sort_by {|el| [el.to_i, el] } }
  jorg_sort     { ARY.map {|el| [el.to_i, el] }.sort.map(&:last) }
end
# >> Running each test once. Test will take about 10 seconds.
# >> cary_to_i_sort_by is faster than cary_to_i_sort by 2x ± 1.0
# >> Running each test once. Test will take about 26 seconds.
# >> jorg_sort is similar to jorge_sort_by

The Wikepedia article has a good efficiency analysis and example that explains why sort_by is preferred for costly comparisons.

Ruby's sort_by documentation also covers this well.

I don't think the size of the array will make much difference. If anything, as the array size grows, if the calculation for the intermediate value is costly, sort_by will still be faster because of its caching. Remember, sort_by is all compiled code, whereas using a Ruby-script-based transform is subject to slower execution as the array is transformed, handed off to sort and then the original object is plucked from the sub-arrays. A larger array means it just has to be done more times.

3 Comments

Very nice solution and I like the use of named groups. Mine was very similar ary.sort_by { |s| s.scan(/(\d+)(\D+)/).flatten.tap{|a| a[0] = a[0].to_i}} but lacks the readability that yours has.
Thanks. The named-captures (which are similar to named-groups) can make a confusing pattern a lot more readable, especially when using the x flag. And, they can make it a lot easier to extract the values from a complex match, along with making it more readable too. I'm doing a complex nested match against some timestamped filenames that have FQDNs and data-types concatenated and the named-captures let me extract any part with a single match.
Sn, I would expect the sort vs sort_by comparison to depend on the size of the array, with sort generally being favoured for very large arrays. Could you run your benchmark for 100_000 instead of 1_000 in the definition of ARY? This is of course not a knock on sort_by, which is going to be faster for most real-world applications (no pre-sort and post-sort overhead) and is also simpler and reads better. Thanks for the Schwartzian Transform link.
1
▶ a = ["10a","10b","9a","9b","8a","8b"]
▶ a.sort { |a,b| a.to_i == b.to_i ? a <=> b : a.to_i <=> b.to_i }
#=> [
#  [0] "8a",
#  [1] "8b",
#  [2] "9a",
#  [3] "9b",
#  [4] "10a",
#  [5] "10b"
#]

Hope it helps.

5 Comments

This answer is perfect.
Okay I just had to invert it. Thanks for the answer, was perfect :)
See @jörgwmittag's answer. His use of sort_by is going to be extremely fast, and more succinct. sort doesn't perform nearly as well as sort_by when the values being sorted have to be manipulated.
I agree with @theTinMan, the Jörg’s answer is better and it should be accepted.
It is a good answer, I agree. But that was not what I needed. I also had to invert the array so the 10a comes first, so I had to change the answer @mudasobwa made. And that wouldn't work for my array.
1

Two ways that don't use String#to_i (but rely on the assumption that each string consists of one or more digits followed by one lower case letter).

ary = ["10a","10b","9a","9b","8a","8b","100z", "96b"]

#1

mx = ary.map(&:size).max
ary.sort_by { |s| s.rjust(mx) }
  #=> ["8a", "8b", "9a", "9b", "10a", "10b", "96b", "100z"] 

#2

ary.sort_by { |s| s.to_i(36) }
  #=> ["8a", "8b", "9a", "9b", "10a", "10b", "96b", "100z"] 

Hmmm, I wonder if:

ary.map { |s| s.rjust(mx) }.sort.map(&:lstrip)

or

ary.map { |s| s.to_i(36) }.sort.map {|i| i.to_s(36)}

would be faster.

Comments

0
["10a","10b","9a","9b","8a","8b"]
.sort_by{|s| s.split(/(\D+)/).map.with_index{|s, i| i.odd? ? s : s.to_i}}
#=> ["8a", "8b", "9a", "9b", "10a", "10b"]

["10a3","10a4", "9", "9aa","9b","8a","8b"]
.sort_by{|s| s.split(/(\D+)/).map.with_index{|s, i| i.odd? ? s : s.to_i}}
#=> ["8a", "8b", "9", "9aa", "9b", "10a3", "10a4"]

1 Comment

You may be interested in my benchmark of the proposed solutions, which I just posted.
0

Gentlemen, start your engines!

I decided to benchmark the various solutions that have been offered. One of the things I was curious about was the effect of converting sort_by solutions to sort solutions. For example, I compared my method

def cary_to_i(a)
  a.sort_by { |s| s.to_i(36) }
end 

to

def cary_to_i_sort(a)
  a.map { |s| s.to_i(36) }.sort.map {|i| i.to_s(36)}
end

This always involves mapping the original array to the transformed values within the sort_by block, sorting that array, then mapping the results back to the elements in the original array (when that can be done).

I tried this sort_by-to-sort conversion with some of the methods that use sort_by. Not surprisingly, the conversion to sort was generally faster, though the amount of improvement varied quite a bit.

Methods compared

module Methods
  def mudasobwa(a)
    a.sort { |a,b| a.to_i == b.to_i ? a <=> b : a.to_i <=> b.to_i }
  end

  def jorg(a)
    a.sort_by {|el| [el.to_i, el] }
  end

  def jorg_sort(a)
    a.map {|el| [el.to_i, el] }.sort.map(&:last)
  end

  def the(a)
    a.sort_by {|e| /(?<digit>\d+)(?<alpha>\D+)/ =~ e
      [digit.to_i, alpha] }
  end

  def the_sort(a)
    a.map {|e| /(?<digit>\d+)(?<alpha>\D+)/ =~ e
     [digit.to_i, alpha]}.sort.map {|d,a| d.to_s+a }
  end

  def engineer(a) a.sort_by { |s|
    s.scan(/(\d+)(\D+)/).flatten.tap{ |a| a[0] = a[0].to_i } }
  end

  def sawa(a) a.sort_by { |s|
    s.split(/(\D+)/).map.with_index { |s, i| i.odd? ? s : s.to_i } }
  end  

  def cary_rjust(a)
    mx = a.map(&:size).max
    a.sort_by {|s| s.rjust(mx)}
  end

  def cary_rjust_sort(a)
    mx = a.map(&:size).max
    a.map { |s| s.rjust(mx) }.sort.map(&:lstrip)
  end

  def cary_to_i(a)
    a.sort_by { |s| s.to_i(36) }
  end

  def cary_to_i_sort(a)
    a.map { |s| s.to_i(36) }.sort.map {|i| i.to_s(36)}
  end
end

include Methods
methods = Methods.instance_methods(false)
  #=> [:mudasobwa, :jorg, :jorg_sort, :the, :the_sort,
  #    :cary_rjust, :cary_rjust_sort, :cary_to_i, :cary_to_i_sort]

Test data and helper

def test_data(n)
  a = 10_000.times.to_a.map(&:to_s)
  b = [*'a'..'z']
  n.times.map { a.sample + b.sample }
end

def compute(m,a)
  send(m,a)
end

Confirm methods return the same values

a = test_data(1000)
puts "All methods correct: #{methods.map { |m| compute(m,a) }.uniq.size == 1}"

Benchmark code

require 'benchmark'

indent = methods.map { |m| m.to_s.size }.max

n = 500_000
a = test_data(n)
puts "\nSort random array of size #{n}"
Benchmark.bm(indent) do |bm|
  methods.each do |m|
    bm.report m.to_s do
      compute(m,a)
    end
  end    
end

Test

Sort random array of size 500000
                      user     system      total        real
mudasobwa         4.760000   0.000000   4.760000 (  4.765170)
jorg              2.870000   0.020000   2.890000 (  2.892359)
jorg_sort         2.980000   0.020000   3.000000 (  3.010344)
the               9.040000   0.100000   9.140000 (  9.160944)
the_sort          4.570000   0.090000   4.660000 (  4.668146)
engineer         10.110000   0.070000  10.180000 ( 10.198117)
sawa             27.310000   0.160000  27.470000 ( 27.504958)
cary_rjust        1.080000   0.010000   1.090000 (  1.087788)
cary_rjust_sort   0.740000   0.000000   0.740000 (  0.746132)
cary_to_i         0.570000   0.000000   0.570000 (  0.576570)
cary_to_i_sort    0.460000   0.020000   0.480000 (  0.477372)

Addendum

@theTinMan demonstrated that the comparisons between the sort_by and sort methods is sensitive to the choice of test data. Using the data he used:

def test_data(n)
  (%w[10a 10b 9a 9b 8a 8b] * (n/6)).shuffle
end

I got these results:

Sort random array of size 500000
                      user     system      total        real
mudasobwa         0.620000   0.000000   0.620000 (  0.622566)
jorg              0.620000   0.010000   0.630000 (  0.636018)
jorg_sort         0.640000   0.010000   0.650000 (  0.638493)
the               8.790000   0.090000   8.880000 (  8.886725)
the_sort          2.670000   0.070000   2.740000 (  2.743085)
engineer          3.150000   0.040000   3.190000 (  3.184534)
sawa              3.460000   0.040000   3.500000 (  3.506875)
cary_rjust        0.360000   0.010000   0.370000 (  0.367094)
cary_rjust_sort   0.480000   0.010000   0.490000 (  0.499956)
cary_to_i         0.190000   0.010000   0.200000 (  0.187136)
cary_to_i_sort    0.200000   0.000000   0.200000 (  0.203509)

Notice that the absolute times are also affected.

Can anyone explain the reason for the difference in the benchmarks?

7 Comments

Very nice work thank you for the benchmarks and hoorah for your success. Not often when every solution is multiples faster than the others especially when the people that posted here include names like JorgWMittag, theTinMan, and sawa
I ran your benchmarks using Fruity, and I'm getting opposite results for the sort_by vs the equivalent sort tests. sort_by uses a Schwartzian Transform, which is a well recognized way to reduce the cost/time of sorting complex objects. I'll add the simple tests I ran to my answer.
Readers, for the following to make sense, @JoeFrambach changed the first line of my answer to "Start your engines!." He did the same in several other answers I've given that begin with the same line. I've rolled them all back. Had any women offered answers in any of those questions, I would have said "Ladies and gentlemen, start your engines!", but, after checking, I found all answers were given by members I know to be men, so that made no sense....
...."Gentlemen," or "Ladies and gentlemen," obviously harkens back to the early days of auto racing. To change that to "Start your engines!" misses the point entirely. Joe: get your grubby hands off my answers!
You said that if women had offered answers you would have said Ladies and Gentlemen. I am a female reader. Does that count or should I provide an answer to be taken into account?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.