Data Manipulation using Python

Question

I have an unstructured file which I need to parse using Python. After carrying out some initial manipulation whilst retrieving the file, the data is in a following format (the titles are simply dummies, they can be anything e.g. INDEX LENGTH, WIDTH etc.)

data = [
    [" title1-a", "title2-a", "title3-a", " title4-a"], 
    ["title1-b ", " title2-b", "title3-b ", "title4-b"], 
    ["title3-c", " title4-c  "],
    ["title1-a ", "  title5-a"],
    ["title1-b", " title5-b"],
    ["title5-c "]
]

The above data is a dummy. The real data set looks like below

real = [
    ['TIME', 'YEARS', 'WWPR', 'WWPR', 'WWPR', 'WWPR', 'WOPR', 'WOPR', 'WOPR', 'WOPR'],
    ['DAYS', 'YEARS', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY'],
    ['P1', 'P2', 'P3', 'P4', 'P1', 'P2', 'P3', 'P4'],
    ['TIME', 'WWIR'],
    ['DAYS', 'STB/DAY'],
    ['I1']
]

Note, each title is a concatination of three lists! So,

real = [[
    ['TIME', 'YEARS', 'WWPR', 'WWPR', 'WWPR', 'WWPR', 'WOPR', 'WOPR', 'WOPR', 'WOPR'],
    ['DAYS', 'YEARS', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY'],
    ['P1', 'P2', 'P3', 'P4', 'P1', 'P2', 'P3', 'P4']],[
    ['TIME', 'WWIR'],
    ['DAYS', 'STB/DAY'],
    ['I1']
]]

The real data would be parsed to achieve following strings

TIME DAYS
YEARS YEARS
WWPR STB/DAY P1
WWPR STB/DAY P2
WWPR STB/DAY P3
WWPR STB/DAY P4
WOPR STB/DAY P1
WOPR STB/DAY P2
WOPR STB/DAY P3
WOPR STB/DAY P4
WWIR STB/DAY I1

The objectives are as follows

Concatinate associated title entries;
The order of titles MUST be preserved;
No duplication allowed;
Minimize copy operation where possible;

Based on the dummy data, the desired output would look like the one below

output = [
    "title1-a title1-b", 
    "title2-a title2-b",
    "title3-a title3-b title3-c",
    "title4-a title4-b title4-c",
    "title5-a title5-b title5-c"
]

I have developed a solution. This said, there must be a cleaner and more efficient way. Hence, I would be keen to investigate alternative solutions. Following is code I developed to get the above data into the desired output format.

def _getTitleData(title_data):
    seen = set()
    titleRows = 3

    # bundle title row(s)
    titles = [
                 title_data[index:index + titleRows] 
                 for index in range(0, len(title_data), titleRows)
             ]

    # apply padding to simplify concatination
    for title in titles:
        firstRow = title[0]
        lastRow = title[len(title) - 1]

        lengthFirstRow = len(firstRow)
        lengthLastRow = len(lastRow)

        if(lengthFirstRow > lengthLastRow):
            for index in range(lengthFirstRow - lengthLastRow):
                lastRow.insert(0, '')

    # strip and concatinate titles
    titles = [
                 ' '.join(word).strip() 
                 for title in titles 
                 for word in zip(*title)
             ]

    # remove duplicate entries
    titles = [
                 title 
                 for title in titles 
                 if not (title in seen or seen.add(title))
             ]

    [print(title) for title in titles]
    return titles

:s will there really be no other suitable "pythonic" solution to the above? — e.doroskevic
– e.doroskevic, Commented Aug 10, 2018 at 19:16

Nhi Vương · Accepted Answer · 2018-08-10 15:12:29Z

2

Please take a look at my suggestion:

data = [
    [" title1-a", "title2-a", "title3-a", " title4-a"], 
    ["title1-b ", " title2-b", "title3-b ", "title4-b"], 
    ["title3-c", " title4-c  "],
    ["title1-a ", "  title5-a"],
    ["title1-b", " title5-b"],
    ["title5-c "]
]

unique = set()

for i in data:
    for j in i:
        unique.add(j.strip(" ") )

print(sorted(list(unique)))

answered Aug 10, 2018 at 15:12

Nhi Vương

464 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Vasilis G. Over a year ago

Slightly modify your code to give the OP's desired output which a concatenation of titles for each row. Your code is working great btw.

Nhi Vương Over a year ago

oops i just realized that i missed the concatenation part

e.doroskevic Over a year ago

Excellent solution. I will vote it up. This said, I cannot use sorted because titles in original post are simply dummies. These titles are just there to demonstrate concept. Nevertheless, I need to make sure I preserve the order of titles.

Vasilis G. · Accepted Answer · 2018-08-10 16:30:21Z

1

Based on the real data you provided, this is the solution that I came up with:

real = [[
    ['TIME', 'YEARS', 'WWPR', 'WWPR', 'WWPR', 'WWPR', 'WOPR', 'WOPR', 'WOPR', 'WOPR'],
    ['DAYS', 'YEARS', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY'],
    ['P1', 'P2', 'P3', 'P4', 'P1', 'P2', 'P3', 'P4']],[
    ['TIME', 'WWIR'],
    ['DAYS', 'STB/DAY'],
    ['I1']
]]

maxLists = 3
numOfSublists = len(real)
lengths = [len(elem[0]) for elem in real]
for i in range(numOfSublists):
    real[i][2] = [' '] * (lengths[i]-len(real[i][2])) + real[i][2]

dups = set()
output = [" ".join(j) for i in range(numOfSublists) for j in list(zip(*real[i])) if not (" ".join(j) in dups or dups.add(" ".join(j)))]
for i in output:
    print(i)

Output:

TIME DAYS  
YEARS YEARS  
WWPR STB/DAY P1
WWPR STB/DAY P2
WWPR STB/DAY P3
WWPR STB/DAY P4
WOPR STB/DAY P1
WOPR STB/DAY P2
WOPR STB/DAY P3
WOPR STB/DAY P4
WWIR STB/DAY I1

edited Aug 10, 2018 at 16:30

answered Aug 10, 2018 at 15:13

Vasilis G.

7,9074 gold badges23 silver badges32 bronze badges

4 Comments

e.doroskevic Over a year ago

Also an excellent solution. This said, I cannot use sorted. The titles in dummy data are only for show. The real titles can be anything e.g. LENGTH, WIDTH, COST etc. This said, I do need to make sure I preserve the order.

Vasilis G. Over a year ago

Thanks for the feedback. I will edit my answer but it will be a bit longer.

e.doroskevic Over a year ago

I appreciate it very much. Python is not a language I normally write in. I can already see a massive improvement to my current code. Your time is very appreciated

e.doroskevic Over a year ago

I have added the real case. The data which I parse, and the result of it being parsed. I hope these edits make the objectives more explicit.

Collectives™ on Stack Overflow

Data Manipulation using Python

2 Answers 2

3 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related