python scraping multiple string with different conditions

Question

My text looks like this:

<COREF ID="1">Salman</COREF> <COREF ID="2">Khan</COREF> (pronunciation born <COREF ID="3" REF="2">Abdul Rashid Salim Salman Khan</COREF> on 27 December 1965)[3] is an <COREF ID="14">Indian</COREF> film <COREF ID="15">actor</COREF>, <COREF ID="17">producer</COREF>, television <COREF ID="19">presenter</COREF>, and <COREF ID="20">philanthropist</COREF> known for <COREF ID="4" REF="2">his</COREF> Hindi films. <COREF ID="5" REF="2">He</COREF> is the <COREF ID="21">son</COREF> of <COREF ID="16" REF="15">actor</COREF> and screenwriter Salim <COREF ID="6" REF="2">Khan</COREF>. <COREF ID="7" REF="2">Khan</COREF> began <COREF ID="8" REF="2">his</COREF> acting career with <COREF ID="22">Biwi Ho</COREF> To <COREF ID="24">Aisi</COREF> but <COREF ID="18" REF="17">it</COREF> was <COREF ID="9" REF="2">his</COREF> second film <COREF ID="25">Maine Pyar</COREF> <COREF ID="26">Kiya</COREF>(1989), in which <COREF ID="10" REF="2">he</COREF> acted in a lead role, that garnered <COREF ID="11" REF="2">him</COREF> the Filmfare Award for Best Male Debut. <COREF ID="12" REF="2">Khan</COREF> has starred in several commercially successful films, such as <COREF ID="28">Saajan</COREF> (1991), <COREF ID="29">Hum Aapke Hain Koun</COREF>..! (1994), <COREF ID="30">Karan Arjun</COREF> (1995),<COREF ID="31">Judwaa</COREF> (1997), <COREF ID="32">Pyar</COREF> <COREF ID="27" REF="26">Kiya</COREF> To Darna <COREF ID="33">Kya</COREF> (1998), <COREF ID="23" REF="22">Biwi</COREF> No.1 (1999), and Hum Saath <COREF ID="34">Saath Hain</COREF> (1999), having appeared in the highest grossing film nine separate years during <COREF ID="13" REF="2">his</COREF> career, a record that remains unbroken.[4]

What I want to do is

Getting each ID with it's string
Getting only those id which has REF. Result should give ID string and REF string. If we have ID and REF num then we can collect the string from result 1 using map data structure

I tried in this way:

def doit(text):      
  import re
  matches=re.findall(r'\>(.+?)\<',text)
  # matches is now ['String 1', 'String 2', 'String3']
  return ",".join(matches)
print doit(string)

which results all strings individually

Now to scrap each ID I did in this way:

def doit(text):      
    import re
    #matches = re.findall((?<="ID=")(.*)(?=""))
    matches = re.findall(r'ID=\"(\d+)', text)
    return ",".join(matches)

print doit(string)

To scrap content between ID=" and " i.e. ID number but it gives error

SyntaxError: invalid syntax

What wrong I am doing. Any better alternative?

UPDATE:

string = "<COREF ID="1">Salman</COREF> <COREF ID="2">Khan</COREF> (pronunciation born <COREF ID="3" REF="2">Abdul Rashid Salim Salman Khan</COREF> on 27 December 1965)[3] is an <COREF ID="14">Indian</COREF> film <COREF ID="15">actor</COREF>, <COREF ID="17">producer</COREF>, television <COREF ID="19">presenter</COREF>, and <COREF ID="20">philanthropist</COREF> known for <COREF ID="4" REF="2">his</COREF> Hindi films. <COREF ID="5" REF="2">He</COREF> is the <COREF ID="21">son</COREF> of <COREF ID="16" REF="15">actor</COREF> and screenwriter Salim <COREF ID="6" REF="2">Khan</COREF>. <COREF ID="7" REF="2">Khan</COREF> began <COREF ID="8" REF="2">his</COREF>"

def doit(text):      
    import re
    #matches = re.findall((?<="ID=")(.*)(?=""))
    matches = re.findall(r'ID=\"(\d+)', text)
    return ",".join(matches)

print doit(string)

You've forgot an opening quote. Also, obligatory stackoverflow.com/a/1732454/113586 — wRAR
– wRAR, Commented Sep 19, 2014 at 9:53
@wRAR: Is this what you meanre.findall((?<="ID=")(.*)(?="")) — user3449212
– user3449212, Commented Sep 19, 2014 at 10:03
Don't you see that you are not passing a string to findall? — wRAR
– wRAR, Commented Sep 19, 2014 at 10:41

Jose Varez · Accepted Answer · 2014-09-19 09:56:43Z

1

If you just want the ID and they are all numeric, try this:

re.findall(r'ID=\"(\d+)', text)

d+ will only capture numbers.

answered Sep 19, 2014 at 9:56

Jose Varez

2,0771 gold badge14 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user3449212 Over a year ago

thanks dear, but it gives SyntaxError: invalid syntax

user3449212 Over a year ago

Oh I found the issue, enclosing string with " " gives error. Just replaced it with ` ' ' ` now it works

user3449212 Over a year ago

well is it possible to print ID : string in this way?

Jose Varez Over a year ago

Yes, try: [x.replace('=',': ') for x in re.findall(r'ID=\"\d+\"', text)]

Collectives™ on Stack Overflow

python scraping multiple string with different conditions

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related