Ruby Regex or Parser

Question

I have a string I want to parse that looks a bit like github markdown, but I really don't want the full implementation. The string will be a mixture of "code" blocks and "text" blocks. The code blocks will be three backticks followed by an optional "language" then some code and finally three more backticks. Non-code will be pretty much everything else. I don't (but possibly should) care if the user can't input three backticks in the "text" blocks. Here's an example ...

This is some text followed by a code block
```ruby
def function
   "hello"
end
```
Some more text

Of course there may be more code and text blocks interspersed. I've tried writing a regex for this and it seemed to work but I couldn't get the groups (in parens) to give me all of the matches and scan() loses the ordering. I've looked at using a couple of ruby parsers (treetop, parselet), but the look a bit big for what I want, but I am willing to go that route if that's my best option.

Thoughts?

A couple of people have asked for the RE I was trying (many variations of below) ...

re = 
  /
    ```\s*\w+\s*          # 3 backticks followed by the language
      (?!```).*?          # The code everything that's not 3 backticks
    ```                   # 3 more backticks
    |                     # OR
    (?!```).*             # Some text that doesn't include 3 backticks
  /x                      # Ignore white space in RE

It seems though that even in simple cases for example

md = /(a|b)*/.match("abaaabaa")

I'm not able to get all of the a's and b's. from say md[3] which doesn't exist. Hope that makes more sense and that's why I don't think a RE will work in my case, but I wouldn't mind being proven wrong.

Without knowing what you're using for a regex it's difficult to help. If you're certain there are no other places three backticks will occur (risky, IMO, but doable) I'm not sure what the issue is. At worst you could line-by-line it. — Dave Newton
– Dave Newton, Commented Aug 18, 2012 at 15:40
Your question is a bit vague, but I would suggest going line-by-line and using Regex to match the lines. Matching the code blocks for example would then be simply matching three backticks with no previous characters on the line, followed by a valid language token. You would then scan-by-line until you match a line with only three backticks. You can avoid backticks in text by using the rules above (no chars before or after the back ticks except a valid language. That's at least where I'd start. — xiy
– xiy, Commented Aug 18, 2012 at 17:59
a regular expression engine (at least a traditional nfa-like one, like the one in ruby) only searches until it matches. also there is no multiple capture in regex, a capture (from group parentheses) will only get you the last match it captured. Also trying to get all of this by using one regex-match call should be really slow... — 0robustus1
– 0robustus1, Commented Aug 18, 2012 at 20:46

0robustus1 · Accepted Answer · 2012-08-19 10:54:43Z

1

I will be making some assumptions here, based on my knowledge about Markdown(github-, stackoverflow-flavors) and your question (which isn't very precise as to the rest of the text).

1. Every code block starts with a singular line, that only includes three backticks, an optional language-name and the newline-char.

2. Every code block ends with a singular line only containing three backticks.

3. A code block is not empty.

If you can accept these assumptions, the following code should work (assuming the text is in the str variable):

regex = %r{
  ^```[[:blank:]]*(?<lang>\w+)?[[:blank:]]*\n # matches start of codeblock, and captures optional :lang.
    (?<content>.+?) # matches codeblock content and captures in :content
  \n[[:blank:]]*```[[:blank:]]*\n # matches ending of codeblock.
}xm # free-space mode and . matches newline.
position = 0
matches = []
while(match = regex.match(str,position)) do
  position = match.end 0
  matches << [match[:lang], match[:content]]
end

After this matches contains an array of arrays, in which an inner array represents a match with the first element being the (optional) language, which may be nil, and the second element being the content.

If you have more assumptions on the text, i could alter the regular expression.

This is the teststring i used:

str = %{
this is some random text.
```ruby
  def print
    puts "this is a code block with lang-argument"
  end
```

some other text follows here.
i want some ``` backticks here.

```
  def print
    puts "this is a code block without lang-argument"
  end
```
}

edited Aug 19, 2012 at 10:54

answered Aug 18, 2012 at 18:51

0robustus1

3,7161 gold badge29 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

slabounty Over a year ago

This looks pretty reasonable, but I can't seem to get it to work (the regex that is). What is the string you're using to test with? Here's one I've been testing with ... <pre> code = "ruby\\ndef\\nabc\\nend\\n\\nThis is some text.\\nMore text.ruby\\ndef\\ndef\\nend\\n\nStill more text." </pre> I think I can probably get your mechanism and regex to work if I can see what you're testing with. P.S. Can not get it formatted correctly. Hopefully, you can see what I'm getting at.

slabounty Over a year ago

I'm not acutally seeing any matches when we're through with the loop. In fact, it doesn't seem like it's entering the loop so I don't think there's any matches. I tried grabbing directly from the edit block so I made sure I had the right code. I'm running ruby 1.9.3 are you possibly running a different version that might make a difference?

slabounty Over a year ago

I got it. The str variable when I pasted had spaces in front of the backticks. Thanks for the help. Marking solved.

Collectives™ on Stack Overflow

Ruby Regex or Parser

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related