2

I have a form like this:

<form id="search" method="get" action="search.php">
      <input type="text" name="query" value="Search"/>
      <input type="submit" value="Submit">
</form> 

And i want the values in this oder: method action names

["get", "search.php", ["query"]] 

I don't know how to do it in regex. Because this is also multilined string. I am also very new to regex.

3
  • You wouldn't do it with regex. Why would you want to do it with regex? Just don't do it. Commented Mar 1, 2015 at 15:06
  • According to me best way to go with any xml parsing module Commented Mar 1, 2015 at 15:11
  • I would have a read of stackoverflow.com/a/1732454/1319998 before trying to parse HTML with regex :-) Commented Mar 1, 2015 at 15:38

3 Answers 3

3

As a proper way for parsing a HTML or XML document you should use a html(or xml) parser like beautifulsoup or lxml or ... . but if you just want to use regex that not be recommended you can use re.findall as following :

>>> [i for j in re.findall(r'method="([^ >"]*)"|action="([^ >"]*)"|name="([^ >"]*)"',s) for i in j if i]
['get', 'search.php', 'query']

[^ >]* match a string that not contain space and >.

Sign up to request clarification or add additional context in comments.

6 Comments

You should remove the double quotes from your character classes to avoid a lot of backtraking and possible false results.
@CasimiretHippolyte Yes but then it doesn't be the OP's expected result!
I removed the double quotes in the pattern. And stripped/removed the double quotes after the matching.
@Emyen yep it will be a better idea!
@CasimiretHippolyte ahan, yes, sorry! thanks for reminding and teaching i didn't notice that!
|
1

I do agree with Michal Charemza's comment to go ahead and read the following post.

I will give an example using Lxml. It's a very powerful tool to parse and analyze HTML.

import lxml
from lxml.html import fromstring

html = fromstring("""<form id="search" method="get" action="search.php">
                     <input type="text" name="query" value="Search"/>
                     <input type="submit" value="Submit">
                     </form> """)
form = html.forms[0] # selecting the first form in the HTML page

# Extracting the data out of the form
print form.action, form.method, form.inputs.keys()

Enjoy,

Abdul

Comments

0

You could use BeautifulSoup library.

>>> from bs4 import BeautifulSoup
>>> s = '''<form id="search" method="get" action="search.php">
      <input type="text" name="query" value="Search"/>
      <input type="submit" value="Submit">
</form> '''
>>> soup = BeautifulSoup(s)
>>> k = []
>>> for i in soup.find_all('form'):
        k.append(i['method'])
        k.append(i['action'])
        k.append([j['name'] for j in i.find_all('input', attrs={'name':True})])

    
>>> k
['get', 'search.php', ['query']]

1 Comment

Why even use re here? Just add the name argument to the list as you already are, no need to regex out the name from the element converted to a string...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.