I am working on an NLP model that is able to parse questions and subsequent answers from news articles. However, in order to work within minimum RAM constraints, I’ve hade to breakdown each sentence in the article and parse QnA individually and append them. Based on how many articles there were, paragraphs, sentences, after processing it spits out an increasingly complex mess.
[[{'answer': 'Meta Platforms ',
'question': 'What is the name of the company that became two of the most talked-about social media companies in recent months?'},
{'answer': ' Twitter ',
'question': 'What was the name of the two most talked-about social media companies in recent months?'},
{'answer': 'Elon Musk ', 'question': 'Who took a 9.2% stake in Twitter?'}],
[{'answer': '$54.20 ',
'question': 'How much did Musk bid to acquire Twitter?'},
{'answer': ' $44 billion ',
'question': 'How much did Musk bid to acquire Twitter?'}],
[{'answer': 'a "poison pill" defense ',
'question': "What did Twitter initially adopt against Musk's offer?"},
{'answer': 'failing to provide adequate information about its spam and bot accounts ',
'question': 'Why did Musk try to back out of the deal?'}],
[{'answer': 'Twitter ',
'question': 'Which company has a stock price below Musk\'s "best and final" offer?'},
{'answer': ' 26% ',
'question': 'What is Twitter\'s stock price below Musk\'s "best and final" offer?'},
{'answer': 'Getty Images ',
'question': "What is the name of the image source that Twitter's stock price remains 26% below Musk's offer?"},
{'answer': 'February ', 'question': "When did Meta's downfall start?"}],
[{'answer': 'ByteDance ', 'question': 'Who was TikTok?'},
{'answer': ' ⁇!--//--> ⁇! ',
'question': "What was the name of the company's first-quarter report in April?"},
{'answer': ' ⁇!-- googletag.cmd.push ',
'question': "What was the name of the company's first-quarter report?"},
{'answer': 'Sheryl Sandberg ',
'question': "Who was Meta's chief operating officer?"}],
[{'answer': 'deteriorating macro environment ',
'question': 'What caused Snap to reduce its second-quarter guidance in late May?'},
{'answer': ' deteriorating macro environment ',
'question': 'What caused Snap to reduce its second-quarter guidance in late May?'},
{'answer': 'Twitter and Meta ',
'question': 'Which two companies have been terrible investments over the past 12 months?'}],
[{'answer': 'Twitter ',
'question': "What company's stock has tumbled more than 30% during that period?"},
{'answer': ' Meta ', 'question': "Who's stock has plummeted over 40%?"},
{'answer': ' 30% ',
'question': "How much has Twitter's stock tumbled during that period?"}],
This is a partial example. When simply doing:
pd.DataFrame(data)
It results in a 13578 rows × 32 columns DataFrame - so there is a lot of nesting - the depth of which is random and base on the articles provided. I tried modifying flatten-dict and deep flatten to try and get the shape more familiar but both options did not get me any closer.
What I need to happen is to be able to turn the output into 2 columns of question and answer outputs. I’ve tried specifying the columns when flattening, but it always results in errors. Any tips on how to go about this for a universal unknown depth?