3

I have hard time porting POSIX regex to Lua string patterns.

I'm dealing with html response from which I would like to filter checkboxes that are checked. Particularly I'm interested in value and name fields of each checked checkbox:

Here are examples of checkboxes I'm interested in:

<input class="rid-2 form-checkbox" id="edit-2-access-comments" name="2[access comments]" value="access comments" checked="checked" type="checkbox">

<input class="rid-3 form-checkbox real-checkbox" id="edit-3-administer-comments" name="3[administer comments]" value="administer comments" checked="checked" type="checkbox">

as opposed I'm not interested in this (unchecked checkbox):

<input class="rid-2 form-checkbox" id="edit-2-access-printer-friendly-version" name="2[access printer-friendly version]" value="access printer-friendly version" type="checkbox">

Using POSIX regex I've used following pattern in Python: pattern=r'name="(.*)" value="(.*)" checked="checked"' and it just worked.

My first approach in Lua was simply to use this: pattern ='name="(.-)" value="(.-)" checked="checked"' but it gave strange results (first capture was as expected but the second one returned lots of unneeded html).

I've also tried following pattern: pattern = 'name="(%d?%[.-%])" value="(.-)"%s?(c?).-="?c.-"%s?type="checkbox"'

This time, in second capture content of value was returned but all checkboxes where matched (not only those with checked="checked" field)

For completeness, here's the Lua code (snippet from my Nmap NSE script) that attempts to do this pattern matching:

  pattern = 'name="(.-)" value="(.-)" checked="checked"' 
  data = {}
  for name, value in string.gmatch(res.body, pattern) do
    stdnse.debug(1, string.format("%s %s", name, value))
  end
2
  • 1
    pattern = 'name="([^"]*)" value="([^"]*)" checked="checked"' Commented Oct 1, 2015 at 10:43
  • Thanks Egor it works perfectly now. Commented Oct 2, 2015 at 10:25

2 Answers 2

1

I've used following pattern in Python: pattern=r'name="(.*)" value="(.*)" checked="checked"' and it just worked.

Python re is not POSIX compliant and . matches any char but a newline char there (in POSIX and Lua, . matches any char including a newline).

If you want to match a string that has 3 attributes above one after another, you should use something like

local pattern = 'name="([^"]*)"%s+value="([^"]*)"%s+checked="checked"'

Why not [^\r\n]-? Because in case there are two tags on one line with the first having the first and/or second attribute and the second having the second and third or just second (and even if there is a third tag with the third attribute while the first one contains the first two attributes), there will be match, as [^\r\n] matches < and > and can "overfire" across the tags.

Note that [^"]*, a negated bracket expression, will only match 0+ chars other than " thus restricting the matches within one tag.

See Lua demo:

local rx = 'name="([^"]*)"%s+value="([^"]*)"%s+checked="checked"'
local s = '<li name="n1"\nvalue="v1"><li name="n2"\nvalue="v1" checked="checked"><li name="n3"\nvalue="v3"   checked="checked">'
for name, value in string.gmatch(s, rx) do
  print(name, value)
end

Output:

n2  v1
n3  v3
Sign up to request clarification or add additional context in comments.

Comments

0

(Updated based on comments) The pattern doesn't work when a line that doesn't have checked="checked" is before a line with checked="checked" in the input as .- expression captures unnecessary parts. There are several ways to avoid this; one suggested by @EgorSkriptunoff is to use ([^"]*) as the pattern; another is to exclude new lines ([^\r\n]-). The following example prints what you expect:

local s = [[
<input class="rid-2 form-checkbox" id="edit-2-access-comments" name="2[access comments]" value="access comments" checked="checked" type="checkbox">
<input class="rid-2 form-checkbox" id="edit-2-access-printer-friendly-version" name="2[access printer-friendly version]" value="access printer-friendly version" type="checkbox">
<input class="rid-3 form-checkbox real-checkbox" id="edit-3-administer-comments" name="3[administer comments]" value="administer comments" checked="checked" type="checkbox">
]]
local pattern = 'name="([^\r\n]-)" value="([^\r\n]-)" checked="checked"' 
for name, value in string.gmatch(s, pattern) do
  print(name, value)
end

The output:

2[access comments]  access comments
3[administer comments]  administer comments

2 Comments

You will see the problem if first item is unchecked (doesn't have checked="checked") and second item is checked.
Right; then new lines or quotes need to be forbidden in the pattern as @EgorSkriptunoff suggested earlier.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.