1

I would like to use the first type object (always starts with "document_") from this json as a row and all others as a column

[
  [
    {
      "type": "document_type_bank_statement"
    },
    {
      "type": "sender_iban"
    },
    {
      "type": "sender_vat_id"
    }
  ],
  [
    {
      "type": "document_type_bank_statement"
    },
    {
      "type": "sender_iban"
    },
    {
      "type": "sender_vat_id"
    }
  ],
  [
    {
      "type": "document_type_invoice"
    },
    {
      "type": "sender_zip"
    }
  ]
]

Example:

sender_iban sender_vat_id sender_zip
document_type_bank_statement 2 2 0
document_type_invoice 0 0 1

This will give me the first object:

for type in pagewise_data:
    print(type[0]['type'])

So all the others:

for type in pagewise_data:
    for i in type:
        if not i['type'].startswith('document'):
            print(i['type'])

This is how i get all types

for type in pagewise_data:
    for i in type:
        print(i['type'])

My question now is, how do i get this to work like in my given example table with pandas? so that you can also count the existing types that occur in an array block?

2
  • How should that be possible? document_type_invoice has far fewer columns then document_type_bank_statement ? Commented Mar 7, 2022 at 11:27
  • i have updated the json and table example with exact data as i imagine it, if no field occurs then 0 should be displayed. Commented Mar 7, 2022 at 12:05

2 Answers 2

1

We can start by loading our data using pd.DataFrame.from_dict.

>>> df = pd.DataFrame.from_dict(data)
>>> df                                            0  ...                          2
0  {'type': 'document_type_bank_statement'}  ...  {'type': 'sender_vat_id'}
1  {'type': 'document_type_bank_statement'}  ...  {'type': 'sender_vat_id'}
2         {'type': 'document_type_invoice'}  ...                       None

Then, we can .melt our dataframe to create pairs of values (first value as future ID and the second value that will be counted in a column at the end). We'll drop NaN values too:

>>> df = df.melt(id_vars=[0]).drop("variable", axis=1)
>>> df = df.dropna()
>>> df
                                          0                      value
0  {'type': 'document_type_bank_statement'}    {'type': 'sender_iban'}
1  {'type': 'document_type_bank_statement'}    {'type': 'sender_iban'}
2         {'type': 'document_type_invoice'}     {'type': 'sender_zip'}
3  {'type': 'document_type_bank_statement'}  {'type': 'sender_vat_id'}
4  {'type': 'document_type_bank_statement'}  {'type': 'sender_vat_id'}

Since each value is a dict with a "type" key, we'll use .applymap to take the corresponding value of that key for each cell in our dataframe:

>>> df = df.applymap(lambda d: d["type"])
>>> df
                              0          value
0  document_type_bank_statement    sender_iban
1  document_type_bank_statement    sender_iban
2         document_type_invoice     sender_zip
3  document_type_bank_statement  sender_vat_id
4  document_type_bank_statement  sender_vat_id

Finally, let's use pd.pivot_table to format our data and count the frequency of each term. Use the fill_value argument to fill the output with zeros. Also, we can format our index:

>>> df = pd.pivot_table(df, index=[0], columns=["value"], aggfunc=len, fill_value=0)
>>> df.index = df.index.map(lambda _id: _id[9:])
>>> df
value                sender_iban  sender_vat_id  sender_zip
0                                                          
type_bank_statement            2              2           0
type_invoice                   0              0           1

Resulting code:

df = pd.DataFrame.from_dict(data)
df = df.melt(id_vars=[0]).drop("variable", axis=1)
df = df.dropna()
df = df.applymap(lambda d: d["type"])
df = pd.pivot_table(df, index=[0], columns=["value"], aggfunc=len, fill_value=0)
df.index = df.index.map(lambda _id: _id[9:])
Sign up to request clarification or add additional context in comments.

2 Comments

wow, works perfect! but i would like to have nothing instead of "value" in the top left, how do i do that? i get an error as soon as i remove this
It's just the name of the column index, you can remove it using df.columns.name = None
0

Following code should work:

from collections import Counter
from itertools import chain
import pandas as pd

def extract_type(value):
    if value is not None:
        return value["type"]
    return value


def count_values(df):
    counter = Counter(
        list(chain.from_iterable([list(row[1]) for row in df.iterrows()]))
    )
    del counter[None]
    return pd.DataFrame.from_dict(counter.items()).set_index(0).T


original_data = pd.DataFrame.from_records(test_data)
extracted_types = original_data.applymap(extract_type).set_index(0)
counted_values = extracted_types.groupby(0).apply(count_values).fillna(0).droplevel(1)

If test_data is data you provided counted_values should contain:

0 sender_iban sender_vat_id sender_zip
document_type_bank_statement 2.0 2.0 0.0
document_type_invoice 0.0 0.0 1.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.