How to extract substrings from a string with whitespaces in python using regex?

Question

I have a long string that I got from webscraping using python. I wanna be able to get an output in a form like {'XXXXXXXX':'AAAAAAAA','YYYYYYYY':'BBBBBBBB} and hopefully put everything in a dataframe.

This is a sample of the very long string:

\\n    display:block\\u0022\\u003E\\n                                  div class= span_6\\u0022\\u003E\\n                                     li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n   span\\u003E1. XXXXXXXX\\/span\\u003E\\n                                strong class=\\u0022floatright\\u0022\\u003EAAAAAAAA\\/strong\\u003E\\n       \\/li\\u003E\\n                                                        li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n   span\\u003E2. YYYYYYYY\\/span\\u003E\\n                                strong class=\\u0022floatright\\u0022\\u003EBBBBBBBB\\/strong\\u003E\\n

#Blockquoting for clarity:

\n display:block\u0022\u003E\n
div class= span_6\u0022\u003E\n
li class=\u0022borderbottom padleft pad20 nomargin\u0022\u003E\n
span\u003E1. XXXXXXXX\/span\u003E\n
strong class=\u0022floatright\u0022\u003EAAAAAAAA\/strong\u003E\n
\/li\u003E\n
li class=\u0022borderbottom padleft pad20 nomargin\u0022\u003E\n
span\u003E2. YYYYYYYY\/span\u003E\n
strong class=\u0022floatright\u0022\u003EBBBBBBBB\/strong\u003E\n

I'm trying to do this:

#s = the string 
pattern = "u003E\|(.*?)\|\\/strong"
substring = re.search(pattern, s).group(1) 
print(substring)

but its failing. What's the best way to do this?

Edit: Expected output is two lists:

list1 = ['XXXXXXXX','YYYYYYYY']
list2 = ['AAAAAAAA','BBBBBBBB']

Hope to be able to put it in a dictionary like d ={'XXXXXXXX':'AAAAAAAA','YYYYYYYY':'BBBBBBBB}, by doing loops. But I cant even extract the 'XXXXXXXX', let alone put it in a dictionary. sorry for being a newbie — nununu
– nununu, Commented Sep 22, 2021 at 8:41
Butchering your HTML so it no longer makes sense is probably the root problem here. — tripleee
– tripleee, Commented Sep 22, 2021 at 8:55
Did my answer help? I can adjust it if you provide more details. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Sep 23, 2021 at 7:08

Wiktor Stribiżew · Accepted Answer · 2021-09-22 09:06:51Z

You can use a solution like

import re
s = '\\n    display:block\\u0022\\u003E\\n                                  div class= span_6\\u0022\\u003E\\n                                     li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n   span\\u003E1. XXXXXXXX\\/span\\u003E\\n                                strong class=\\u0022floatright\\u0022\\u003EAAAAAAAA\\/strong\\u003E\\n       \\/li\\u003E\\n                                                        li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n   span\\u003E2. YYYYYYYY\\/span\\u003E\\n                                strong class=\\u0022floatright\\u0022\\u003EBBBBBBBB\\/strong\\u003E\\n'
unescaped_s = s.encode('latin-1', 'backslashreplace').decode('unicode-escape')
pattern = r">\d+\.\s*([^<>]*)\\/span>\s*[^>]*>([^<>]*)\\/strong"
substrings = re.findall(pattern, unescaped_s)
print(dict(substrings))

See the online Python demo. First, the string is unescaped, and the regex is applied to the unescaped input string version.

The regex is

>\d+\.\s*([^<>]*)\\/span>\s*[^>]*>([^<>]*)\\/strong

Details:

> - a > char
\d+ - one or more digits
\. - a dot
\s* - zero or more whitespaces
([^<>]*) - Group 1: zero or more chars other than < and >
\\/span> - \/span> text
\s* - zero or more whitespaces
[^>]*> - any zero or more chars other than > and then a > char
([^<>]*) - Group 2: zero or more chars other than < and >
\\/strong - a \/strong> text.

Collectives™ on Stack Overflow

How to extract substrings from a string with whitespaces in python using regex?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related