I have an unstructured file which I need to parse using Python. After carrying out some initial manipulation whilst retrieving the file, the data is in a following format (the titles are simply dummies, they can be anything e.g. INDEX LENGTH, WIDTH etc.)
data = [
[" title1-a", "title2-a", "title3-a", " title4-a"],
["title1-b ", " title2-b", "title3-b ", "title4-b"],
["title3-c", " title4-c "],
["title1-a ", " title5-a"],
["title1-b", " title5-b"],
["title5-c "]
]
The above data is a dummy. The real data set looks like below
real = [
['TIME', 'YEARS', 'WWPR', 'WWPR', 'WWPR', 'WWPR', 'WOPR', 'WOPR', 'WOPR', 'WOPR'],
['DAYS', 'YEARS', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY'],
['P1', 'P2', 'P3', 'P4', 'P1', 'P2', 'P3', 'P4'],
['TIME', 'WWIR'],
['DAYS', 'STB/DAY'],
['I1']
]
Note, each title is a concatination of three lists! So,
real = [[
['TIME', 'YEARS', 'WWPR', 'WWPR', 'WWPR', 'WWPR', 'WOPR', 'WOPR', 'WOPR', 'WOPR'],
['DAYS', 'YEARS', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY'],
['P1', 'P2', 'P3', 'P4', 'P1', 'P2', 'P3', 'P4']],[
['TIME', 'WWIR'],
['DAYS', 'STB/DAY'],
['I1']
]]
The real data would be parsed to achieve following strings
TIME DAYS
YEARS YEARS
WWPR STB/DAY P1
WWPR STB/DAY P2
WWPR STB/DAY P3
WWPR STB/DAY P4
WOPR STB/DAY P1
WOPR STB/DAY P2
WOPR STB/DAY P3
WOPR STB/DAY P4
WWIR STB/DAY I1
The objectives are as follows
- Concatinate associated title entries;
- The order of titles MUST be preserved;
- No duplication allowed;
- Minimize copy operation where possible;
Based on the dummy data, the desired output would look like the one below
output = [
"title1-a title1-b",
"title2-a title2-b",
"title3-a title3-b title3-c",
"title4-a title4-b title4-c",
"title5-a title5-b title5-c"
]
I have developed a solution. This said, there must be a cleaner and more efficient way. Hence, I would be keen to investigate alternative solutions. Following is code I developed to get the above data into the desired output format.
def _getTitleData(title_data):
seen = set()
titleRows = 3
# bundle title row(s)
titles = [
title_data[index:index + titleRows]
for index in range(0, len(title_data), titleRows)
]
# apply padding to simplify concatination
for title in titles:
firstRow = title[0]
lastRow = title[len(title) - 1]
lengthFirstRow = len(firstRow)
lengthLastRow = len(lastRow)
if(lengthFirstRow > lengthLastRow):
for index in range(lengthFirstRow - lengthLastRow):
lastRow.insert(0, '')
# strip and concatinate titles
titles = [
' '.join(word).strip()
for title in titles
for word in zip(*title)
]
# remove duplicate entries
titles = [
title
for title in titles
if not (title in seen or seen.add(title))
]
[print(title) for title in titles]
return titles