0

I'm looking to run a search through some files to see if they have comments on top of the file.

Here's what I'm searching for:

#++
#    app_name/dir/dir/filename
#    $Id$
#--

I had this as a REGEX and came up short:

:doc => { :test => '^#--\s+[filename]\s+\$Id'
if @file_text =~ Regexp.new(@rules[rule][:test])
....

Any suggestions?

3
  • 1
    That's not a valid pattern. Perhaps it'd help if you experimented using Rubular. Also, your "requirements" aren't clear at all. Are you looking for files with four commented lines at the top of the file, or that plus they MUST contain the path to the file and the $ID$ string, embedded inside ++ and --? Commented Jul 7, 2014 at 23:09
  • Dear Mr Tin Man, I have tried Rubular. My expressions were failing so I came here. My "requirements" Need to find find all four lines or At least three lines #--, # filename, #-- - P.S. I Edited the my problem Commented Jul 7, 2014 at 23:18
  • 2
    What are @file_text, @rules, rule? Commented Jul 7, 2014 at 23:49

3 Answers 3

5

Check this example:

string = <<EOF
#++
##    app_name/dir/dir/filename
##    $Id$
##--

foo bar
EOF

puts /#\+\+.*\n##.*\n##.*\n##--/.match(string)

The pattern matches two lines starting with ## between two lines starting with #++ and ending with #-- plus including those boundaries into the match. If I got the question right, this should be what you want.

You can generalize the pattern to match everything between the first #++ and the first #-- (including them) using the following pattern:

puts /#\+\+.*?##--/m.match(string)
Sign up to request clarification or add additional context in comments.

7 Comments

Im running it through my script. I think this works.
You are welcome! I'm new to ruby, the regex engine looks very intuitive in the first place! Nice :)
The ruby regex engine is very similar to the PCRE regex engine in the meaning that all the main features of PCRE can be found in the Ruby regex engine. However, and as with PCRE, you don't need to escape "-" since it isn't a special character. There is a singleline mode available but instead of "s", you must use "m" to switch it on (it's a little counter-intuitive). The last pattern can be written like that: /#\+\+.*?\n##--/m
@CasimiretHippolyte Or even skip the final \n.. Like this /#\+\+.*?##--/m . Many thanks for all the information!
You could also write puts string[/#\+\+.*\n##.*\n##.*\n##--/].
|
1

Rather than try to do it all in a single pattern, which will become difficult to maintain as your file headers change/grow, instead use several small tests which give you granularity. I'd do something like:

lines = '#++
#    app_name/dir/dir/filename
#    $Id$
#--
'

Split the text so you can retrieve the lines you want, and normalize them:

l1, l2, l3, l4 = lines.split("\n").map{ |s| s.strip.squeeze(' ') }

This is what they contain now:

[l1, l2, l3, l4] # => ["#++", "# app_name/dir/dir/filename", "# $Id$", "#--"]

Here's a set of tests, one for each line:

!!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/]) # => true

Here's what is being tested and what each returns:

l1[/^#\+\+/] # => "#++"
l2[/^#\s[\w\/]+/] # => "# app_name/dir/dir/filename"
l3[/^#\s\$Id\$/i] # => "# $Id$"
l4[/^#--/] # => "#--"

There are many different ways to grab the first "n" rows of a file. Here's a few:

File.foreach('test.txt').to_a[0, 4] # => ["#++\n", "#    app_name/dir/dir/filename\n", "#    $Id$\n", "#--\n"]
File.readlines('test.txt')[0, 4]    # => ["#++\n", "#    app_name/dir/dir/filename\n", "#    $Id$\n", "#--\n"]
File.read('test.txt').split("\n")[0, 4] # => ["#++", "#    app_name/dir/dir/filename", "#    $Id$", "#--"]

The downside of these is they all "slurp" the input file, which, on a huge file will cause problems. It's trivial to write a piece of code that'd open a file, read the first four lines, and return them in an array. This is untested but looks about right:

def get_four_lines(path)

  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end

  ary

end

Here's a quick little benchmark to show why I'd go this way:

require 'fruity'

def slurp_file(path)
  File.read(path).split("\n")[0,4] rescue []
end

def read_first_four_from_file(path)
  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end
  ary
rescue
  []
end

PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }

compare do
  slurp {
    FILES.each do |f|
      slurp_file(f)
    end
  }

  read_four {
    FILES.each do |f|
      read_first_four_from_file(f)
    end
  }
end

Running that as root outputs:

Running each test once. Test will take about 1 second.
read_four is faster than slurp by 2x ± 1.0

That's reading approximately 105 files in my /etc directory.

Modifying the test to actually parse the lines and test to return a true/false:

require 'fruity'

def slurp_file(path)
  ary = File.read(path).split("\n")[0,4] 
  !!(/#\+\+\n(.|\n)*?##\-\-/.match(ary.join("\n")))
rescue
  false # return a consistent value to fruity
end

def read_first_four_from_file(path)
  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end
  l1, l2, l3, l4 = ary
  !!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/])
rescue
  false # return a consistent value to fruity
end

PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }

compare do
  slurp {
    FILES.each do |f|
      slurp_file(f)
    end
  }

  read_four {
    FILES.each do |f|
      read_first_four_from_file(f)
    end
  }
end

Running that again returns:

Running each test once. Test will take about 1 second.
read_four is faster than slurp by 2x ± 1.0

Your benchmark isn't fair.

Here's one that's "fair":

require 'fruity'

def slurp_file(path)
  text = File.read(path)
  !!(/#\+\+\n(.|\n)*?##\-\-/.match(text))
rescue
  false # return a consistent value to fruity
end

def read_first_four_from_file(path)
  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end
  l1, l2, l3, l4 = ary
  !!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/])
rescue
  false # return a consistent value to fruity
end

PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }

compare do
  slurp {
    FILES.each do |f|
      slurp_file(f)
    end
  }

  read_four {
    FILES.each do |f|
      read_first_four_from_file(f)
    end
  }
end

Which outputs:

Running each test once. Test will take about 1 second.
read_four is similar to slurp

joining the split strings back into a longer string prior to doing the match was the wrong path, so working from the full file's content is a more-even test.

[...] Just read the first four lines and apply the pattern, that's it

That's not just it. A multiline regex written to find information spanning multiple lines can't be passed single text lines and return accurate results, so it needs to get a long string. Determining how many characters make up four lines would only add overhead, and slow the algorithm; That's what the previous benchmark did and it wasn't "fair".

Depends on your input data. If you would run this code over a complete (bigger) source code folder, it will slow down it significantly.

There were 105+ files in the directory. That's a reasonably large number of files, but iterating over a large number of files will not show a difference as Ruby's ability to open files isn't the issue, it's the I/O speed of reading a file in one pass vs. line-by-line. And, from experience I know the line-by-line I/O is fast. Again, a benchmark says:

require 'fruity'

LITTLEFILE = 'little.txt'
MEDIUMFILE = 'medium.txt'
BIGFILE = 'big.txt'

LINES = '#++
#    app_name/dir/dir/filename
#    $Id$
#--
'

LITTLEFILE_MULTIPLIER = 1
MEDIUMFILE_MULTIPLIER = 1_000
BIGFILE_MULTIPLIER = 100_000

File.write(BIGFILE, LINES * BIGFILE_MULTIPLIER)

def _slurp_file(path)
  File.read(path)
  true # return a consistent value to fruity
end

def _read_first_four_from_file(path)
  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end
  l1, l2, l3, l4 = ary
  true # return a consistent value to fruity
end

[
  [LITTLEFILE, LITTLEFILE_MULTIPLIER],
  [MEDIUMFILE, MEDIUMFILE_MULTIPLIER],
  [BIGFILE,    BIGFILE_MULTIPLIER]
].each do |file, mult|

  File.write(file, LINES * mult)
  puts "Benchmarking against #{ file }"
  puts "%s is %d bytes" % [ file, File.size(file)]

  compare do
    slurp                     { _slurp_file(file)                }
    read_first_four_from_file { _read_first_four_from_file(file) }
  end

  puts
end

With the output:

Benchmarking against little.txt
little.txt is 49 bytes
Running each test 128 times. Test will take about 1 second.
slurp is similar to read_first_four_from_file

Benchmarking against medium.txt
medium.txt is 49000 bytes
Running each test 128 times. Test will take about 1 second.
read_first_four_from_file is faster than slurp by 39.99999999999999% ± 10.0%

Benchmarking against big.txt
big.txt is 4900000 bytes
Running each test 128 times. Test will take about 4 seconds.
read_first_four_from_file is faster than slurp by 100x ± 10.0

Reading a small file of four lines, read is as fast as foreach but once the file size increases the overhead of reading the entire file starts to impact the times.

Any solution relying on slurping files is known to be a bad thing; It's not scalable, and can actually cause code to halt due to memory allocation if BIG files are encountered. Reading the first four lines will always run at a consistent speed independent of the file sizes, so use that technique EVERY time there is a chance that the file sizes will vary. Or, at least, be very aware of the impact on run times and potential problems that can be caused by slurping files.

6 Comments

I'm new to Ruby, but this looks like too much overhead compared to my solution above. (Especially splitting the line looks weird as Ruby handles multiline-patterns like a charm). Am I wrong?
Like I said, you can do it with a single pattern, however, patterns are prone to hiding problems. It's usually much better to break down the problem into smaller patterns. While there might be a bit more overhead, the time spent debugging problems with a single pattern can outweigh the tiny difference in I/O or parsing caused by several smaller tests. Consider what happens if a file's header changes the order of two rows, and what you have to do with the pattern vs. smaller tests.
Depends on your input data. If you would run this code over a complete (bigger) source code folder, it will slow down it significantly. However, for more elaborated parsers, I'm with you. Breaking down the problem into several minor problems will help to keep it maintainable.. (I did not downvoted it)
I agree with Hek2mgl, This is precursor to a gem to run comments through a Rails app. Your code though very thorough, seems like your building a mountain out of an ant hill. All it really needs to do is check the file and move on.
Benchmarks have shown us that reading separate lines from a file using foreach is neck-and-neck with slurping, so that's not going to slow down the code. Patterns that are anchored, especially small ones that don't rely on greediness, are extremely fast, unlike those that have to span multiple lines or are unanchored. Again, benchmarks have proven this. I'd recommend carefully testing various ways off accomplishing the same thing as the results can be very counter-intuitive, and can be very surprising.
|
0

You might want to try the following parttern: \#\+{2}(?:.|[\r\n])*?\#\-{2}

Regular expression visualization

Working demo @ regex101

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.