0

I have an xml/txt file like this:

<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">

<s id="1">
foo
bar
</s>
<d>
11235
</d>

<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">

<s id="2">
foo
bar
</s>
<d>
11235
</d>

<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">

<s id="3">
foo
bar
</s>
<d>
11235
</d>

I want to build an sqlite table like the following using python:

id  language    date        timezone    s           d

32a45   ENG     2017-01-01  Eastern     foo bar     11235
32a47   ENG     2017-01-05  Central     baz qux     11235
32a48   ENG     2017-01-07  Pacific     foo bar     11235

Any idea how can I do this? I cannot use xmltree module because the xml tags in the original file is messed up. I would really appreciate the help. Thanks.

Edit: I can easily take each text as a list inside a list. Like this:

['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">', '<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">', '<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

But I don't know how to take the id, language etc. from each list separately.

4
  • Any particular reason why you're tagging R in this? Commented Aug 7, 2017 at 6:38
  • Yes. If there is a built in library to do this in R, that would work for me as well. I know there is a RSQlite package in R. However, I have not used it before. That being said, I don't think R can be any more useful here than python. Just keeping an option open. Commented Aug 7, 2017 at 6:43
  • 1
    Is this about parsing XML or about putting something into SQLite? Seems like those are two separate problems, and simple enough that they've been answered individually multiple times on SO. (For instance, the first google hit searching "python xml parse" brings up python2 help on the xml module, rife with example data and code.) Commented Aug 7, 2017 at 6:56
  • It is mainly about putting something in sqlite. And I cannot use xmltree in this case because of some tagging issues. Commented Aug 7, 2017 at 7:02

1 Answer 1

0

Redirected from here:

How can I make sublists from a list based on strings in python?

import xml.etree.ElementTree as ET
import pandas as pd

strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

cols = ["id","language","date","time","timezone"]
data = [[ET.fromstring(string+"</text>").get(col) for col in cols] for string in strings]    
df = pd.DataFrame(data,columns=cols)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.