0

I have the following python code and trying to print the user and its number when tried to regex I did the following:

import re


txt = '''Element.update("to_users2", "\n\n\n<div class=\"label-field-pair\">\n  <div class=\"label-field-pair11\">\n    <label for=\"student_grade\">Select member</label>\n    <div class =\"scrolable\" >\n      <div class=\"scroll-inside\">\n        <div class=\"hover\"><a href=\"#\" class=\"all\" onClick=\"add_all_recipient('0000000,1111111,2222222,3333333,4444444,5555555,6666666,7777777,8888888,9999999')\">Select All  <span> Add </span></a>\n\n        </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(0000000)\" success=\"Element.hide('loader')\">user zero M ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(1111111)\" success=\"Element.hide('loader')\">user One S ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(2222222)\" success=\"Element.hide('loader')\">user Two A ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(3333333)\" success=\"Element.hide('loader')\">user three H ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(4444444)\" success=\"Element.hide('loader')\">user four M ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(5555555)\" success=\"Element.hide('loader')\">user Five O ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(6666666)\" success=\"Element.hide('loader')\">user six F ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(7777777)\" success=\"Element.hide('loader')\">user Seven Mo ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(8888888)\" success=\"Element.hide('loader')\">user eight ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(9999999)\" success=\"Element.hide('loader')\">\u0650user nine M ...<span> Add </span></a>\n\n          </div>\n        \n      </div>\n    </div>\n  </div>\n</div>\n\n\n");'''


regexp = re.findall(
            r"add_recipient\(([0-9]+)\)\" success=.+>([a-zA-Z0-9\w]+) ", txt)

for x in regexp:
    print(x[1],  x[0])

executing the above python code it prints as follows:

user 0000000
user 1111111
User 2222222
user 3333333
user 4444444
user 5555555
user 6666666
user 7777777
user 8888888

I needed to get the output as:

user Zero 0000000
user One 1111111
...

How can I get such output? in some cases the re.findall returns only user 8888888 and I don't know why. but how can I get the full match?

3 Answers 3

2

Using regex to parse XML/HTML is bad practice, use a parser (with a bit of regex help) for that:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(txt)

out = []
for e in soup.find_all('a', onclick=True):
    m = re.search('(?<=add_recipient\().*(?=\))', e['onclick'])
    if m:
        a = m.group()
        out.append((e.contents[0], a))

output:

[('user zero M ...', '0000000'),
 ('user One S ...', '1111111'),
 ('user Two A ...', '2222222'),
 ('user three H ...', '3333333'),
 ('user four M ...', '4444444'),
 ('user Five O ...', '5555555'),
 ('user six F ...', '6666666'),
 ('user Seven Mo ...', '7777777'),
 ('user eight ...', '8888888'),
 ('ِuser nine M ...', '9999999')]

alternative output (only first 2 words of name), replace the last line with:

out.append((' '.join(e.contents[0].split(maxsplit=2)[:2]), a))

output:

[('user zero', '0000000'),
 ('user One', '1111111'),
 ('user Two', '2222222'),
 ('user three', '3333333'),
 ('user four', '4444444'),
 ('user Five', '5555555'),
 ('user six', '6666666'),
 ('user Seven', '7777777'),
 ('user eight', '8888888'),
 ('ِuser nine', '9999999')]
Sign up to request clarification or add additional context in comments.

Comments

0

You can add an extra capture group, and change the order in which you print the group values.

Note that you can write [a-zA-Z0-9\w]+ as \w+ because that also matches a-zA-Z0-9.

Instead of .+> you can use [^<>]*> to prevent some backtracking, not crossing the angle brackets with a negated character class.

import re

txt = '''Element.update("to_users2", "\n\n\n<div class=\"label-field-pair\">\n  <div class=\"label-field-pair11\">\n    <label for=\"student_grade\">Select member</label>\n    <div class =\"scrolable\" >\n      <div class=\"scroll-inside\">\n        <div class=\"hover\"><a href=\"#\" class=\"all\" onClick=\"add_all_recipient('0000000,1111111,2222222,3333333,4444444,5555555,6666666,7777777,8888888,9999999')\">Select All  <span> Add </span></a>\n\n        </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(0000000)\" success=\"Element.hide('loader')\">user zero M ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(1111111)\" success=\"Element.hide('loader')\">user One S ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(2222222)\" success=\"Element.hide('loader')\">user Two A ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(3333333)\" success=\"Element.hide('loader')\">user three H ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(4444444)\" success=\"Element.hide('loader')\">user four M ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(5555555)\" success=\"Element.hide('loader')\">user Five O ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(6666666)\" success=\"Element.hide('loader')\">user six F ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(7777777)\" success=\"Element.hide('loader')\">user Seven Mo ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(8888888)\" success=\"Element.hide('loader')\">user eight ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(9999999)\" success=\"Element.hide('loader')\">\u0650user nine M ...<span> Add </span></a>\n\n          </div>\n        \n      </div>\n    </div>\n  </div>\n</div>\n\n\n");'''

for x in re.findall(r"add_recipient\(([0-9]+)\)\" success=[^<>]*>(\w+) (\w+)", txt):
    print(x[1], x[2], x[0])

Output

user zero 0000000
user One 1111111
user Two 2222222
user three 3333333
user four 4444444
user Five 5555555
user six 6666666
user Seven 7777777
user eight 8888888

Comments

0

I'm not an expert to regex

You can try:

out = re.findall(r"add_recipient\(([0-9]+)\)\" success=.+>(\w+\s+\w+)", txt)
print(*[' '.join(i[::-1]) for i in out], sep='\n')

# Output
user zero 0000000
user One 1111111
user Two 2222222
user three 3333333
user four 4444444
user Five 5555555
user six 6666666
user Seven 7777777
user eight 8888888

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.