How can I make sublists from a list based on strings in python?

Question

I have a list like this:

['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

From this I want to make sublists like:

id = ["32a45", "32a47", "32a48"]
date=["2017-01-01", "2017-01-05", "2017-01-07"]

How can I do that?

Thanks.

Edit: This was the original question It is a broken xml file and tags are messed up, hence I cannot use xmltree. So I am trying something else.

And how do you get that file (looks like broken xml/html)

Antti Haapala
– Antti Haapala

2017-08-07 07:59:14 +00:00
Commented Aug 7, 2017 at 7:59 — Antti Haapala
– Antti Haapala, Commented Aug 7, 2017 at 7:59
With regex or parse xml, have you tried anything?

Anton vBR
– Anton vBR

2017-08-07 08:01:33 +00:00
Commented Aug 7, 2017 at 8:01 — Anton vBR
– Anton vBR, Commented Aug 7, 2017 at 8:01

RomanPerekhrest · Accepted Answer · 2017-08-07 08:07:23Z

5

Simple solution using re.search() function:

import re

l = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

ids, dates = [], []
for i in l:
    ids.append(re.search(r'id="([^"]+)"', i).group(1))
    dates.append(re.search(r'date="([^"]+)"', i).group(1))

print(ids)    # ['32a45', '32a47', '32a48']
print(dates)  # ['2017-01-01', '2017-01-05', '2017-01-07']

edited Aug 7, 2017 at 8:07

answered Aug 7, 2017 at 8:03

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user5017783 Over a year ago

Thanks. This is what I was looking for.

Anton vBR · Accepted Answer · 2017-08-07 08:36:30Z

Parsing with ET:

import xml.etree.ElementTree as ET
strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

id_ = []
date = []
for string in strings:
    tree = ET.fromstring(string+"</text>") #corrects wrong format
    id_.append(tree.get("id"))
    date.append(tree.get("date"))

print(id_) #  ['32a45', '32a47', '32a48']
print(date) # ['2017-01-01', '2017-01-05', '2017-01-07']

Update, full compact example: According to your original problem described here: How can I build an sqlite table from this xml/txt file using python?

import xml.etree.ElementTree as ET
import pandas as pd

strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

cols = ["id","language","date","time","timezone"]
data = [[ET.fromstring(string+"</text>").get(col) for col in cols] for string in strings]    
df = pd.DataFrame(data,columns=cols)

    id  language    date    time    timezone
0   32a45   ENG     2017-01-01  11:00   Eastern
1   32a47   ENG     2017-01-05  1:00    Central
2   32a48   ENG     2017-01-07  3:00    Pacific

Now you can use: df.to_sql()

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Andrey Lukyanenko · Accepted Answer · 2017-08-07 08:03:16Z

0

id = [i.split(' ')[1].split('=')[1].strip('"') for i in list]
date = [i.split(' ')[3].split('=')[1].strip('"') for i in list]

But the file looks strange, if the original file is html or xml, there are better ways to get data.

answered Aug 7, 2017 at 8:03

Andrey Lukyanenko

3,8712 gold badges21 silver badges21 bronze badges

5 Comments

user5017783 Over a year ago

It is a broken xml file. I wish I could use xmltree on it. Thanks for the answer.

user5017783 Over a year ago

@AntonvBR Yes. That's after building the list. The original file looks something like this: stackoverflow.com/questions/45540531/… That's my primary goal. If you can help with that, that would be awesome. Thanks again :)

Anton vBR Over a year ago

@0x1 Well seeing you extracted the thing already, have a look at my updated answer.

user5017783 Over a year ago

@AntonvBR That's great! Would you post your updated answer to the original question so that I can mark that as the correct answer?

Anton vBR Over a year ago

@0x1 No worries, glad I could help.

Jake Conkerton-Darby · Accepted Answer · 2017-08-07 08:13:02Z

0

As your provided data appears to be broken/partial xml fragments I would personally try repairing the xml and using the xml.etree module to extract the data. However if you have correct xml that you have got your current list from, then it would be easier to use the xml.etree module on that data.

An example solution using xml.etree:

from xml.etree import ElementTree as ET

data = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

ids = []
dates = []
for element in data:
    #This wraps the element in a root tag and gives it a closing tag to
    #  repair the xml to a valid format.
    root = ET.fromstring('{}</text>'.format(element))

    #As we have formatted the xml ourselves we can guarantee that it's first
    #  child will always be the desired element.
    ids.append(root.attrib['id'])
    dates.append(root.attrib['date'])

print(ids)    # ['32a45', '32a47', '32a48']
print(dates)  # ['2017-01-01', '2017-01-05', '2017-01-07']

answered Aug 7, 2017 at 8:13

Jake Conkerton-Darby

1,1118 silver badges32 bronze badges

3 Comments

Anton vBR Over a year ago

You can remove <root> (ok you just did.. but now it is a copy of what I wrote?)

Jake Conkerton-Darby Over a year ago

Yeah, I was mid-writing mine when yours came up. So we do kind of have duplicate answers, and yours was the reason I simplified by removing the <root> element.

Anton vBR Over a year ago

No worries, happens all the time when you write answers. People are so quick.

anteAdamovic · Accepted Answer · 2017-08-07 08:13:20Z

0

Along with other answers who are better you can parse the data manually (more simple):

for line in lines:
    id = line[line.index('"')+1:]
    line = id
    line = id[line.index('"')+1:]
    id = id[:id.index('"')]
    print('id: ' + id)

You can then simply push it in the new list, repeat the same process for other values below simply change the variable name.

answered Aug 7, 2017 at 8:13

anteAdamovic

1,44312 silver badges25 bronze badges

Comments

AGN Gazer · Accepted Answer · 2017-08-07 08:20:06Z

0

Not as elegant as @RomanPerekhrest solution using re but here it goes:

def extract(lst, kwd):
   out = []
   for t in lst:
       index1 = t.index(kwd) + len(kwd) + 1
       index2 = index1 + t[index1:].index('"') + 1
       index3 = index2 + t[index2:].index('"')
       out.append(t[index2:index3])
   return out

Then

>>> extract(lst, kwd='id')
['32a45', '32a47', '32a48']

answered Aug 7, 2017 at 8:20

AGN Gazer

8,4272 gold badges31 silver badges49 bronze badges

Comments

Rachit kapadia · Accepted Answer · 2017-08-07 08:35:40Z

0

More easier way to understand with re module: Here is the code :

l = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">', 
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
 '<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

import re
id =[]
dates= []
for i in l:
    id.append(re.search(r'id="(.+?)"',i, re.M|re.I).group(1))
    dates.append(re.search(r'date="(.+?)"',item, re.M|re.I).group(1))

Output:

print id     #id= ['32a45', '32a47', '32a48']
print dates  #dates= ['2017-01-07', '2017-01-07', '2017-01-07']

answered Aug 7, 2017 at 8:35

Rachit kapadia

6997 silver badges18 bronze badges

Collectives™ on Stack Overflow

How can I make sublists from a list based on strings in python?

7 Answers 7

1 Comment

Comments

5 Comments

3 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

1 Comment

Comments

5 Comments

3 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related