3

TL;DR How to filter an ls/find output using grep with an array as a pattern?

Background story: I have a pipeline which I have to rerun for datasets which run into an error. Which datasets are run into an error is saved in a tab separated file. I want to delete the files where the pipeline has run into an error.

To do so I extracted the dataset names from another file containing the finished dataset and saved them in a bash array {ds1 ds2 ...} but now I am stuck because I cannot figure out how to exclude the datasets in the array from my deletion step.

This is the folder structure (X=1-30): datasets/dsX/results/dsX.tsv

Not excluding the finished datasets, meaning deleting the folders of the failed and the finished datasets works like a charm

#1. move content to a trash folder
ls /datasets/*/results/*|xargs -I '{}' mv '{}' ./trash/

#2. delete the empty folders
find /datasets/*/. -type d -empty -delete

But since I want to exclude the finished datasets I thought it would be clever to save them in an array:

#find finished datasets by extracting the dataset names from a tab separated log file
mapfile -t -s 1 finished < <(awk '{print $2}' $path/$log_pf)
echo ${finished[@]}

which works as expected but now I am stuck in filtering the ls output using that array: *pseudocode

#trying to ignore the dataset in the array - not working
ls -I${finished[@]} -d /datasets/*/
#trying to reverse grep for the finished datasets - not working
ls /datasets/*/ | grep -v {finished}

What do you think about my current ideas? Is this possible using bash only? I guess in python I could do that easily but for training purposes, I want to do it in bash.

1

2 Answers 2

4

grep can get the patterns from a file using the -f option. Note that file names containing newlines will cause problems.

If you need to process the input somehow, you can use process substitution:

grep -f <(process the input...)
Sign up to request clarification or add additional context in comments.

3 Comments

I know but the file is a tab separated file with several columns, therefore I am extracting the dataset names column and save it in an array
Extending the answer, use -f with a process substitution: grep -f <(printf "%s\n" "${finished[@]}")
@glenn jackman thank you for the quick extension of the other comment. Seems to work :) If you want the points you can add it as an extra answer otherwise I would accept the answer of choroba.
1

I must admit I'm confused about what you're doing but if you're just trying to produce a list of files excluding those stored in column 2 of some other file and your file/directory names can't contain spaces then that'd be:

find /datasets -type f | awk 'NR==FNR{a[$2]; next} !($0 in a)' "$path/$log_pf" -

If that's not all you need then please edit your question to clarify your requirements and add concise testable sample input and expected output.

4 Comments

Hello Ed, sorry for coming back so late... At first thank you for sharing your wiki this convinced me to switch from parsing "ls" to using "find" :). However, since I in the original question asked for how to grep an array I accepted @choroba
Do you mind to elaborate in pseudo code what this AWK function does? I'm still a bash novice
Sorry, it's been too long so I don't remember what the question was about and don't want to re-learn it. Basically though it's saving some field of a file in an array and then if the output of find does not in the array (i.e. was the 2nd field of that file) then it prints the find output for that line.
Fair enough, I figured it out in the meanwhile: if someone needs to understand it is as well look in the link below at the answer of Walter A. he has written a brilliant takedown of this oneliner. stackoverflow.com/questions/32481877/what-is-nr-fnr-in-awk

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.