Scraping HTML forms with regex

Question

I have a form like this:

<form id="search" method="get" action="search.php">
      <input type="text" name="query" value="Search"/>
      <input type="submit" value="Submit">
</form>

And i want the values in this oder: method action names

["get", "search.php", ["query"]]

I don't know how to do it in regex. Because this is also multilined string. I am also very new to regex.

You wouldn't do it with regex. Why would you want to do it with regex? Just don't do it. — Daniel Roseman
– Daniel Roseman, Commented Mar 1, 2015 at 15:06
According to me best way to go with any xml parsing module — Vivek Sable
– Vivek Sable, Commented Mar 1, 2015 at 15:11
I would have a read of stackoverflow.com/a/1732454/1319998 before trying to parse HTML with regex :-) — Michal Charemza
– Michal Charemza, Commented Mar 1, 2015 at 15:38

Kasravnd · Accepted Answer · 2015-03-01 15:28:33Z

3

As a proper way for parsing a HTML or XML document you should use a html(or xml) parser like beautifulsoup or lxml or ... . but if you just want to use regex that not be recommended you can use re.findall as following :

>>> [i for j in re.findall(r'method="([^ >"]*)"|action="([^ >"]*)"|name="([^ >"]*)"',s) for i in j if i]
['get', 'search.php', 'query']

[^ >]* match a string that not contain space and >.

edited Mar 1, 2015 at 15:28

answered Mar 1, 2015 at 15:11

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Casimir et Hippolyte Over a year ago

You should remove the double quotes from your character classes to avoid a lot of backtraking and possible false results.

Kasravnd Over a year ago

@CasimiretHippolyte Yes but then it doesn't be the OP's expected result!

user3679917 Over a year ago

I removed the double quotes in the pattern. And stripped/removed the double quotes after the matching.

Kasravnd Over a year ago

@Emyen yep it will be a better idea!

Kasravnd Over a year ago

@CasimiretHippolyte ahan, yes, sorry! thanks for reminding and teaching i didn't notice that!

|

Community · Accepted Answer · 2017-05-23 12:22:19Z

1

I do agree with Michal Charemza's comment to go ahead and read the following post.

I will give an example using Lxml. It's a very powerful tool to parse and analyze HTML.

import lxml
from lxml.html import fromstring

html = fromstring("""<form id="search" method="get" action="search.php">
                     <input type="text" name="query" value="Search"/>
                     <input type="submit" value="Submit">
                     </form> """)
form = html.forms[0] # selecting the first form in the HTML page

# Extracting the data out of the form
print form.action, form.method, form.inputs.keys()

Enjoy,

Abdul

edited May 23, 2017 at 12:22

CommunityBot

11 silver badge

answered Mar 1, 2015 at 17:20

abdul

1441 silver badge11 bronze badges

Comments

Winand · Accepted Answer · 2021-05-10 09:38:18Z

0

You could use BeautifulSoup library.

>>> from bs4 import BeautifulSoup
>>> s = '''<form id="search" method="get" action="search.php">
      <input type="text" name="query" value="Search"/>
      <input type="submit" value="Submit">
</form> '''
>>> soup = BeautifulSoup(s)
>>> k = []
>>> for i in soup.find_all('form'):
        k.append(i['method'])
        k.append(i['action'])
        k.append([j['name'] for j in i.find_all('input', attrs={'name':True})])

    
>>> k
['get', 'search.php', ['query']]

edited May 10, 2021 at 9:38

Winand

2,4633 gold badges32 silver badges49 bronze badges

answered Mar 1, 2015 at 15:27

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

1 Comment

Jon Clements Over a year ago

Why even use re here? Just add the name argument to the list as you already are, no need to regex out the name from the element converted to a string...

Collectives™ on Stack Overflow

Scraping HTML forms with regex

3 Answers 3

6 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related