1

Given array1 I want to find every unique, first occurrence of each csv entry. The array is already ordered by date. So the first occurrence will be the most recent.

array1=(url://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/ url://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/ url://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/ url://root/sub1/sub2/2022-07-22/ a.csv/)

I want to return an array with every unique, most recent occurrence of each csv entry (with the full paths)

array2=(url://root/sub1/sub2/2022-10-22/a.csv/ url://root/sub1/sub2/2022-10-22/b.csv/ url://root/sub1/sub2/2022-08-22/c.csv/ url://root/sub1/sub2/2022-08-22/d.csv/)

an array of all the duplicate entries (with the full paths)

array3=(url://root/sub1/sub2/2022-09-22/a.csv/ url://root/sub1/sub2/2022-09-22/b.csv/ url://root/sub1/sub2/2022-08-22/a.csv/ url://root/sub1/sub2/2022-08-22/b.csv/ url://root/sub1/sub2/2022-07-22/a.csv/)

My thought process is as follows - Loop through the array, if the element is a url path check the preceding elements and write the url path and csv files to a new array. Stop when the preceding element is another url path. If the following url path contains the same csv files write to a duplicate array. If the following url path contains new csv files append to the new array.

0

1 Answer 1

1

Would you please try the following:

#!/bin/bash

declare -A seen                                                         # check if the csv element has appeared

array1=(s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/ s3://root/sub1/sub2/2022-07-22/ a.csv/)
array2=(); array3=()

while read -r first others; do                                          # split the line into "s3:.." and others
    read -r -a ary <<< "$others"                                        # split others into list of csv's
    dup=(); new=()                                                      # temporary arrays
    for i in "${ary[@]}"; do                                            # loop over the csv's
        (( seen[$i]++ )) && dup+=( "$i" ) || new+=( "$i" )              # sort the csv's depending on the history
    done

    for i in "${new[@]}"; do                                            # loop over the array of unique entries
        array2+=( "${first}${i}" )                                      # append the full path to array2
    done
    for i in "${dup[@]}"; do                                            # loop over the array of duplicate entries
        array3+=( "${first}${i}" )                                      # append the full path to array3
    done
done < <(sed -E 's# (s3://)#'\\$'\n''\1#g' <<< "${array1[*]}")          # construct 2-d structure from array1

echo "${array2[@]}"
echo "${array3[@]}"

Output:

s3://root/sub1/sub2/2022-10-22/a.csv/ s3://root/sub1/sub2/2022-10-22/b.csv/ s3://root/sub1/sub2/2022-08-22/c.csv/ s3://root/sub1/sub2/2022-08-22/d.csv/
s3://root/sub1/sub2/2022-09-22/a.csv/ s3://root/sub1/sub2/2022-09-22/b.csv/ s3://root/sub1/sub2/2022-08-22/a.csv/ s3://root/sub1/sub2/2022-08-22/b.csv/ s3://root/sub1/sub2/2022-07-22/a.csv/

As array1 looks like having a 2-d structure, I've first rearranged the elements with sed into:

s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/
s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/
s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/
s3://root/sub1/sub2/2022-07-22/ a.csv/

then process them line by line.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, I’ve edited my question regarding the output. Can you please update your answer to reflect that?
I've updated my answer accordingly. BR.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.