1

I have an unstructured file which I need to parse using Python. After carrying out some initial manipulation whilst retrieving the file, the data is in a following format (the titles are simply dummies, they can be anything e.g. INDEX LENGTH, WIDTH etc.)

data = [
    [" title1-a", "title2-a", "title3-a", " title4-a"], 
    ["title1-b ", " title2-b", "title3-b ", "title4-b"], 
    ["title3-c", " title4-c  "],
    ["title1-a ", "  title5-a"],
    ["title1-b", " title5-b"],
    ["title5-c "]
]

The above data is a dummy. The real data set looks like below

real = [
    ['TIME', 'YEARS', 'WWPR', 'WWPR', 'WWPR', 'WWPR', 'WOPR', 'WOPR', 'WOPR', 'WOPR'],
    ['DAYS', 'YEARS', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY'],
    ['P1', 'P2', 'P3', 'P4', 'P1', 'P2', 'P3', 'P4'],
    ['TIME', 'WWIR'],
    ['DAYS', 'STB/DAY'],
    ['I1']
]

Note, each title is a concatination of three lists! So,

real = [[
    ['TIME', 'YEARS', 'WWPR', 'WWPR', 'WWPR', 'WWPR', 'WOPR', 'WOPR', 'WOPR', 'WOPR'],
    ['DAYS', 'YEARS', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY'],
    ['P1', 'P2', 'P3', 'P4', 'P1', 'P2', 'P3', 'P4']],[
    ['TIME', 'WWIR'],
    ['DAYS', 'STB/DAY'],
    ['I1']
]]

The real data would be parsed to achieve following strings

TIME DAYS
YEARS YEARS
WWPR STB/DAY P1
WWPR STB/DAY P2
WWPR STB/DAY P3
WWPR STB/DAY P4
WOPR STB/DAY P1
WOPR STB/DAY P2
WOPR STB/DAY P3
WOPR STB/DAY P4
WWIR STB/DAY I1

The objectives are as follows

  1. Concatinate associated title entries;
  2. The order of titles MUST be preserved;
  3. No duplication allowed;
  4. Minimize copy operation where possible;

Based on the dummy data, the desired output would look like the one below

output = [
    "title1-a title1-b", 
    "title2-a title2-b",
    "title3-a title3-b title3-c",
    "title4-a title4-b title4-c",
    "title5-a title5-b title5-c"
]

I have developed a solution. This said, there must be a cleaner and more efficient way. Hence, I would be keen to investigate alternative solutions. Following is code I developed to get the above data into the desired output format.

def _getTitleData(title_data):
    seen = set()
    titleRows = 3

    # bundle title row(s)
    titles = [
                 title_data[index:index + titleRows] 
                 for index in range(0, len(title_data), titleRows)
             ]

    # apply padding to simplify concatination
    for title in titles:
        firstRow = title[0]
        lastRow = title[len(title) - 1]

        lengthFirstRow = len(firstRow)
        lengthLastRow = len(lastRow)

        if(lengthFirstRow > lengthLastRow):
            for index in range(lengthFirstRow - lengthLastRow):
                lastRow.insert(0, '')

    # strip and concatinate titles
    titles = [
                 ' '.join(word).strip() 
                 for title in titles 
                 for word in zip(*title)
             ]

    # remove duplicate entries
    titles = [
                 title 
                 for title in titles 
                 if not (title in seen or seen.add(title))
             ]

    [print(title) for title in titles]
    return titles
1
  • :s will there really be no other suitable "pythonic" solution to the above? Commented Aug 10, 2018 at 19:16

2 Answers 2

2

Please take a look at my suggestion:

data = [
    [" title1-a", "title2-a", "title3-a", " title4-a"], 
    ["title1-b ", " title2-b", "title3-b ", "title4-b"], 
    ["title3-c", " title4-c  "],
    ["title1-a ", "  title5-a"],
    ["title1-b", " title5-b"],
    ["title5-c "]
]

unique = set()

for i in data:
    for j in i:
        unique.add(j.strip(" ") )

print(sorted(list(unique)))
Sign up to request clarification or add additional context in comments.

3 Comments

Slightly modify your code to give the OP's desired output which a concatenation of titles for each row. Your code is working great btw.
oops i just realized that i missed the concatenation part
Excellent solution. I will vote it up. This said, I cannot use sorted because titles in original post are simply dummies. These titles are just there to demonstrate concept. Nevertheless, I need to make sure I preserve the order of titles.
1

Based on the real data you provided, this is the solution that I came up with:

real = [[
    ['TIME', 'YEARS', 'WWPR', 'WWPR', 'WWPR', 'WWPR', 'WOPR', 'WOPR', 'WOPR', 'WOPR'],
    ['DAYS', 'YEARS', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY'],
    ['P1', 'P2', 'P3', 'P4', 'P1', 'P2', 'P3', 'P4']],[
    ['TIME', 'WWIR'],
    ['DAYS', 'STB/DAY'],
    ['I1']
]]

maxLists = 3
numOfSublists = len(real)
lengths = [len(elem[0]) for elem in real]
for i in range(numOfSublists):
    real[i][2] = [' '] * (lengths[i]-len(real[i][2])) + real[i][2]

dups = set()
output = [" ".join(j) for i in range(numOfSublists) for j in list(zip(*real[i])) if not (" ".join(j) in dups or dups.add(" ".join(j)))]
for i in output:
    print(i)

Output:

TIME DAYS  
YEARS YEARS  
WWPR STB/DAY P1
WWPR STB/DAY P2
WWPR STB/DAY P3
WWPR STB/DAY P4
WOPR STB/DAY P1
WOPR STB/DAY P2
WOPR STB/DAY P3
WOPR STB/DAY P4
WWIR STB/DAY I1

4 Comments

Also an excellent solution. This said, I cannot use sorted. The titles in dummy data are only for show. The real titles can be anything e.g. LENGTH, WIDTH, COST etc. This said, I do need to make sure I preserve the order.
Thanks for the feedback. I will edit my answer but it will be a bit longer.
I appreciate it very much. Python is not a language I normally write in. I can already see a massive improvement to my current code. Your time is very appreciated
I have added the real case. The data which I parse, and the result of it being parsed. I hope these edits make the objectives more explicit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.