2

I have a list like this:

['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

From this I want to make sublists like:

id = ["32a45", "32a47", "32a48"]
date=["2017-01-01", "2017-01-05", "2017-01-07"]

How can I do that?

Thanks.

Edit: This was the original question It is a broken xml file and tags are messed up, hence I cannot use xmltree. So I am trying something else.

2
  • 3
    And how do you get that file (looks like broken xml/html) Commented Aug 7, 2017 at 7:59
  • 2
    With regex or parse xml, have you tried anything? Commented Aug 7, 2017 at 8:01

7 Answers 7

5

Simple solution using re.search() function:

import re

l = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

ids, dates = [], []
for i in l:
    ids.append(re.search(r'id="([^"]+)"', i).group(1))
    dates.append(re.search(r'date="([^"]+)"', i).group(1))

print(ids)    # ['32a45', '32a47', '32a48']
print(dates)  # ['2017-01-01', '2017-01-05', '2017-01-07']
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. This is what I was looking for.
1

Parsing with ET:

import xml.etree.ElementTree as ET
strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

id_ = []
date = []
for string in strings:
    tree = ET.fromstring(string+"</text>") #corrects wrong format
    id_.append(tree.get("id"))
    date.append(tree.get("date"))

print(id_) #  ['32a45', '32a47', '32a48']
print(date) # ['2017-01-01', '2017-01-05', '2017-01-07']

Update, full compact example: According to your original problem described here: How can I build an sqlite table from this xml/txt file using python?

import xml.etree.ElementTree as ET
import pandas as pd

strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

cols = ["id","language","date","time","timezone"]
data = [[ET.fromstring(string+"</text>").get(col) for col in cols] for string in strings]    
df = pd.DataFrame(data,columns=cols)

    id  language    date    time    timezone
0   32a45   ENG     2017-01-01  11:00   Eastern
1   32a47   ENG     2017-01-05  1:00    Central
2   32a48   ENG     2017-01-07  3:00    Pacific

Now you can use: df.to_sql()

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Comments

0
id = [i.split(' ')[1].split('=')[1].strip('"') for i in list]
date = [i.split(' ')[3].split('=')[1].strip('"') for i in list]

But the file looks strange, if the original file is html or xml, there are better ways to get data.

5 Comments

It is a broken xml file. I wish I could use xmltree on it. Thanks for the answer.
@AntonvBR Yes. That's after building the list. The original file looks something like this: stackoverflow.com/questions/45540531/… That's my primary goal. If you can help with that, that would be awesome. Thanks again :)
@0x1 Well seeing you extracted the thing already, have a look at my updated answer.
@AntonvBR That's great! Would you post your updated answer to the original question so that I can mark that as the correct answer?
@0x1 No worries, glad I could help.
0

As your provided data appears to be broken/partial xml fragments I would personally try repairing the xml and using the xml.etree module to extract the data. However if you have correct xml that you have got your current list from, then it would be easier to use the xml.etree module on that data.

An example solution using xml.etree:

from xml.etree import ElementTree as ET

data = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

ids = []
dates = []
for element in data:
    #This wraps the element in a root tag and gives it a closing tag to
    #  repair the xml to a valid format.
    root = ET.fromstring('{}</text>'.format(element))

    #As we have formatted the xml ourselves we can guarantee that it's first
    #  child will always be the desired element.
    ids.append(root.attrib['id'])
    dates.append(root.attrib['date'])

print(ids)    # ['32a45', '32a47', '32a48']
print(dates)  # ['2017-01-01', '2017-01-05', '2017-01-07']

3 Comments

You can remove <root> (ok you just did.. but now it is a copy of what I wrote?)
Yeah, I was mid-writing mine when yours came up. So we do kind of have duplicate answers, and yours was the reason I simplified by removing the <root> element.
No worries, happens all the time when you write answers. People are so quick.
0

Along with other answers who are better you can parse the data manually (more simple):

for line in lines:
    id = line[line.index('"')+1:]
    line = id
    line = id[line.index('"')+1:]
    id = id[:id.index('"')]
    print('id: ' + id)

You can then simply push it in the new list, repeat the same process for other values below simply change the variable name.

Comments

0

Not as elegant as @RomanPerekhrest solution using re but here it goes:

def extract(lst, kwd):
   out = []
   for t in lst:
       index1 = t.index(kwd) + len(kwd) + 1
       index2 = index1 + t[index1:].index('"') + 1
       index3 = index2 + t[index2:].index('"')
       out.append(t[index2:index3])
   return out

Then

>>> extract(lst, kwd='id')
['32a45', '32a47', '32a48']

Comments

0

More easier way to understand with re module: Here is the code :

l = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">', 
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
 '<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

import re
id =[]
dates= []
for i in l:
    id.append(re.search(r'id="(.+?)"',i, re.M|re.I).group(1))
    dates.append(re.search(r'date="(.+?)"',item, re.M|re.I).group(1))

Output:

print id     #id= ['32a45', '32a47', '32a48']
print dates  #dates= ['2017-01-07', '2017-01-07', '2017-01-07']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.