0

I have a long string that I got from webscraping using python. I wanna be able to get an output in a form like {'XXXXXXXX':'AAAAAAAA','YYYYYYYY':'BBBBBBBB} and hopefully put everything in a dataframe.

This is a sample of the very long string:

\\n    display:block\\u0022\\u003E\\n                                  div class= span_6\\u0022\\u003E\\n                                     li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n   span\\u003E1. XXXXXXXX\\/span\\u003E\\n                                strong class=\\u0022floatright\\u0022\\u003EAAAAAAAA\\/strong\\u003E\\n       \\/li\\u003E\\n                                                        li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n   span\\u003E2. YYYYYYYY\\/span\\u003E\\n                                strong class=\\u0022floatright\\u0022\\u003EBBBBBBBB\\/strong\\u003E\\n

#Blockquoting for clarity:

\n display:block\u0022\u003E\n
div class= span_6\u0022\u003E\n
li class=\u0022borderbottom padleft pad20 nomargin\u0022\u003E\n
span\u003E1. XXXXXXXX\/span\u003E\n
strong class=\u0022floatright\u0022\u003EAAAAAAAA\/strong\u003E\n
\/li\u003E\n
li class=\u0022borderbottom padleft pad20 nomargin\u0022\u003E\n
span\u003E2. YYYYYYYY\/span\u003E\n
strong class=\u0022floatright\u0022\u003EBBBBBBBB\/strong\u003E\n

I'm trying to do this:

#s = the string 
pattern = "u003E\|(.*?)\|\\/strong"
substring = re.search(pattern, s).group(1) 
print(substring)

but its failing. What's the best way to do this?

Edit: Expected output is two lists:

list1 = ['XXXXXXXX','YYYYYYYY']
list2 = ['AAAAAAAA','BBBBBBBB']
7
  • First use strip() function Commented Sep 22, 2021 at 8:38
  • What's your expected output? Commented Sep 22, 2021 at 8:38
  • Hope to be able to put it in a dictionary like d ={'XXXXXXXX':'AAAAAAAA','YYYYYYYY':'BBBBBBBB}, by doing loops. But I cant even extract the 'XXXXXXXX', let alone put it in a dictionary. sorry for being a newbie Commented Sep 22, 2021 at 8:41
  • 3
    Butchering your HTML so it no longer makes sense is probably the root problem here. Commented Sep 22, 2021 at 8:55
  • 1
    Did my answer help? I can adjust it if you provide more details. Commented Sep 23, 2021 at 7:08

1 Answer 1

2

You can use a solution like

import re
s = '\\n    display:block\\u0022\\u003E\\n                                  div class= span_6\\u0022\\u003E\\n                                     li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n   span\\u003E1. XXXXXXXX\\/span\\u003E\\n                                strong class=\\u0022floatright\\u0022\\u003EAAAAAAAA\\/strong\\u003E\\n       \\/li\\u003E\\n                                                        li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n   span\\u003E2. YYYYYYYY\\/span\\u003E\\n                                strong class=\\u0022floatright\\u0022\\u003EBBBBBBBB\\/strong\\u003E\\n'
unescaped_s = s.encode('latin-1', 'backslashreplace').decode('unicode-escape')
pattern = r">\d+\.\s*([^<>]*)\\/span>\s*[^>]*>([^<>]*)\\/strong"
substrings = re.findall(pattern, unescaped_s)
print(dict(substrings))

See the online Python demo. First, the string is unescaped, and the regex is applied to the unescaped input string version.

The regex is

>\d+\.\s*([^<>]*)\\/span>\s*[^>]*>([^<>]*)\\/strong

Details:

  • > - a > char
  • \d+ - one or more digits
  • \. - a dot
  • \s* - zero or more whitespaces
  • ([^<>]*) - Group 1: zero or more chars other than < and >
  • \\/span> - \/span> text
  • \s* - zero or more whitespaces
  • [^>]*> - any zero or more chars other than > and then a > char
  • ([^<>]*) - Group 2: zero or more chars other than < and >
  • \\/strong - a \/strong> text.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.