Pandas JSON Table

Question

I would like to use the first type object (always starts with "document_") from this json as a row and all others as a column

[
  [
    {
      "type": "document_type_bank_statement"
    },
    {
      "type": "sender_iban"
    },
    {
      "type": "sender_vat_id"
    }
  ],
  [
    {
      "type": "document_type_bank_statement"
    },
    {
      "type": "sender_iban"
    },
    {
      "type": "sender_vat_id"
    }
  ],
  [
    {
      "type": "document_type_invoice"
    },
    {
      "type": "sender_zip"
    }
  ]
]

Example:

	sender_iban	sender_vat_id	sender_zip
document_type_bank_statement	2	2	0
document_type_invoice	0	0	1

This will give me the first object:

for type in pagewise_data:
    print(type[0]['type'])

So all the others:

for type in pagewise_data:
    for i in type:
        if not i['type'].startswith('document'):
            print(i['type'])

This is how i get all types

for type in pagewise_data:
    for i in type:
        print(i['type'])

My question now is, how do i get this to work like in my given example table with pandas? so that you can also count the existing types that occur in an array block?

How should that be possible? document_type_invoice has far fewer columns then document_type_bank_statement ? — Patrick Artner
– Patrick Artner, Commented Mar 7, 2022 at 11:27
i have updated the json and table example with exact data as i imagine it, if no field occurs then 0 should be displayed. — Jannik Buscha
– Jannik Buscha, Commented Mar 7, 2022 at 12:05

aaossa · Accepted Answer · 2022-03-07 13:36:05Z

We can start by loading our data using pd.DataFrame.from_dict.

>>> df = pd.DataFrame.from_dict(data)
>>> df                                            0  ...                          2
0  {'type': 'document_type_bank_statement'}  ...  {'type': 'sender_vat_id'}
1  {'type': 'document_type_bank_statement'}  ...  {'type': 'sender_vat_id'}
2         {'type': 'document_type_invoice'}  ...                       None

Then, we can .melt our dataframe to create pairs of values (first value as future ID and the second value that will be counted in a column at the end). We'll drop NaN values too:

>>> df = df.melt(id_vars=[0]).drop("variable", axis=1)
>>> df = df.dropna()
>>> df
                                          0                      value
0  {'type': 'document_type_bank_statement'}    {'type': 'sender_iban'}
1  {'type': 'document_type_bank_statement'}    {'type': 'sender_iban'}
2         {'type': 'document_type_invoice'}     {'type': 'sender_zip'}
3  {'type': 'document_type_bank_statement'}  {'type': 'sender_vat_id'}
4  {'type': 'document_type_bank_statement'}  {'type': 'sender_vat_id'}

Since each value is a dict with a "type" key, we'll use .applymap to take the corresponding value of that key for each cell in our dataframe:

>>> df = df.applymap(lambda d: d["type"])
>>> df
                              0          value
0  document_type_bank_statement    sender_iban
1  document_type_bank_statement    sender_iban
2         document_type_invoice     sender_zip
3  document_type_bank_statement  sender_vat_id
4  document_type_bank_statement  sender_vat_id

Finally, let's use pd.pivot_table to format our data and count the frequency of each term. Use the fill_value argument to fill the output with zeros. Also, we can format our index:

>>> df = pd.pivot_table(df, index=[0], columns=["value"], aggfunc=len, fill_value=0)
>>> df.index = df.index.map(lambda _id: _id[9:])
>>> df
value                sender_iban  sender_vat_id  sender_zip
0                                                          
type_bank_statement            2              2           0
type_invoice                   0              0           1

Resulting code:

df = pd.DataFrame.from_dict(data)
df = df.melt(id_vars=[0]).drop("variable", axis=1)
df = df.dropna()
df = df.applymap(lambda d: d["type"])
df = pd.pivot_table(df, index=[0], columns=["value"], aggfunc=len, fill_value=0)
df.index = df.index.map(lambda _id: _id[9:])

wow, works perfect! but i would like to have nothing instead of "value" in the top left, how do i do that? i get an error as soon as i remove this
It's just the name of the column index, you can remove it using df.columns.name = None

Jan Tkacik · Accepted Answer · 2022-03-07 14:18:46Z

Following code should work:

from collections import Counter
from itertools import chain
import pandas as pd

def extract_type(value):
    if value is not None:
        return value["type"]
    return value


def count_values(df):
    counter = Counter(
        list(chain.from_iterable([list(row[1]) for row in df.iterrows()]))
    )
    del counter[None]
    return pd.DataFrame.from_dict(counter.items()).set_index(0).T


original_data = pd.DataFrame.from_records(test_data)
extracted_types = original_data.applymap(extract_type).set_index(0)
counted_values = extracted_types.groupby(0).apply(count_values).fillna(0).droplevel(1)

If test_data is data you provided counted_values should contain:

0	sender_iban	sender_vat_id	sender_zip
document_type_bank_statement	2.0	2.0	0.0
document_type_invoice	0.0	0.0	1.0

Collectives™ on Stack Overflow

Pandas JSON Table

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related