1

I am trying to convert HTML table to json using beautifulsoup() function python, i was able to convert some other tables to json but this is of different kind.

I am trying to achieve something like this:

[{
    "name": "abc",
    "age": "21",
    "sex": "m",
    "loction": "us",
    "language": "en"
}, {
    "name": "xyz",
    "age": "25",
    "sex": "f",
    "loction": "us",
    "language": "en"
}]

Table is :

<table><colgroup><col /><col /><col /><col /><col /></colgroup>
<tbody>
<tr>
<th><span>name</span></th>
<th><span>age</span></th>
<th><span>sex</span></th>
<th><span>location</span></th>
<th><span>language</span></th>
</tr>
<tr>
<td colspan="1">
<p><span>abc</span></p>
</td>
<td colspan="1"><span>21</span></td>
<td colspan="1"><span>m</span></td>
<td colspan="1">us</td>
<td colspan="1">en</td>
</tr>
<tr>
<td colspan="1">
<p><span>xyz</span></p>
</td>
<td colspan="1"><span>25</span></td>
<td colspan="1">f</td>
<td colspan="1">us</td>
<td colspan="1">en</td>
</tr>
</tbody>
</table>

Some help is appreciated

2 Answers 2

1

You can, of course, make a list of dictionaries manually, but we can also do it without doing 0 explicit HTML parsing at all by transitioning through a pandas.DataFrame using pandas.read_html():

from pprint import pprint
import pandas as pd

data = """your HTML"""

df = pd.read_html(data, flavor="lxml")[0]

new_header = df.iloc[0]
df = df[1:]
df.columns = new_header

pprint(df.to_dict('records'))

Prints:

[{'age': '21', 'language': 'en', 'location': 'us', 'name': 'abc', 'sex': 'm'},
 {'age': '25', 'language': 'en', 'location': 'us', 'name': 'xyz', 'sex': 'f'}]
Sign up to request clarification or add additional context in comments.

1 Comment

I have a problem installing pandas, it has a dependency on numpy which wont install with this error. DEPRECATION: Uninstalling a distutils installed project (numpy) has been deprecated and will be removed in a future version
1

I wrote a library to convert HTML to JSON which can do this: html-to-json.

You can installing the library using pypi. Once the library is installed, you would run:

import html_to_json

s = '''<table><colgroup><col /><col /><col /><col /><col /></colgroup>
<tbody>
<tr>
<th><span>name</span></th>
<th><span>age</span></th>
<th><span>sex</span></th>
<th><span>location</span></th>
<th><span>language</span></th>
</tr>
<tr>
<td colspan="1">
<p><span>abc</span></p>
</td>
<td colspan="1"><span>21</span></td>
<td colspan="1"><span>m</span></td>
<td colspan="1">us</td>
<td colspan="1">en</td>
</tr>
<tr>
<td colspan="1">
<p><span>xyz</span></p>
</td>
<td colspan="1"><span>25</span></td>
<td colspan="1">f</td>
<td colspan="1">us</td>
<td colspan="1">en</td>
</tr>
</tbody>
</table>'''


html_to_json.convert_tables(s)

And this will give you (the \n around values for the name key are because the <td> containing each name has newlines in it):

[
  [
    {
      "name": "\nabc\n",
      "age": "21",
      "sex": "m",
      "location": "us",
      "language": "en"
    },
    {
      "name": "\nxyz\n",
      "age": "25",
      "sex": "f",
      "location": "us",
      "language": "en"
    }
  ]
]

Hope this is helpful!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.