select and move unique files based on some pattern

Question

I have a list of files on a Linux machine which are differ by some date, so I have to search for unique files and need to place them in some other directory. "Unique" here refers to the name of the file until the second _, so 100001_ABC and 100001_XYZ in the example below.

100001_ABC_25Sep2020_1200-25Sep2020_1300.csv  
100001_XYZ_30Sep2020_1300-30Sep2020_1400.csv  
100001_XYZ_30Sep2020_1400-30Sep2020_1500.csv

I want the uniquely named to be placed under this directory:

/home/vikrant_singh_rana/uniquefiles/

The script should only copy the files below:

100001_ABC_25Sep2020_1200-25Sep2020_1300.csv  
100001_XYZ_30Sep2020_1300-30Sep2020_1400.csv

Here's my shell script

#!/bin/bash
set +o posix
#reading file names into file_array
readarray -t file_array < <(
    cd "/home/vikrant_singh_rana/unzipfiles"
    printf "%s\n" * | cut -d"_" -f2 | cut -d"-" -f1 | sort -u )

#print items of array
printf '%s\n' "${file_array[@]}"


for i in "${file_array[@]}"; do
        #echo $i
        find /home/vikrant_singh_rana/unzipfiles/ -type f -name "*$i*.csv" -exec awk '!seen[$0]++' {} +
done

The script can find the unique names correctly, but I can't find how to move them to the other directory.

Is uniqueness defined only by this substring 25Sep2020? Do you want always to move the first alphabetically from the uniques? — thanasisp
– thanasisp, Commented Oct 16, 2020 at 5:14
If you had one more line in your example, XYZ_01Sep2020 and ABC_01Sep2020 it would be clear for all. — thanasisp
– thanasisp, Commented Oct 16, 2020 at 5:22
files will be like that only.. it is parsing and selecting unique file correctly. it just that I need to route those selected files to some other directory — Vikrant Singh Rana
– Vikrant Singh Rana, Commented Oct 16, 2020 at 5:27
I have accepted one answer only but thank you all for such a wonderful answers. — Vikrant Singh Rana
– Vikrant Singh Rana, Commented Oct 16, 2020 at 22:33

Stéphane Chazelas · Accepted Answer · 2020-10-16 12:36:54Z

With zsh.

typeset -A files
for f (*_*_*.csv(.On)) files[${(M)f#*_*_}]=$f
mv -- $files target-directory/

The . glob qualifier restricts to regular files while On sorts in reverse order so that in the end the associative array contains the first file in alphabetical order for a given key (here the part up to the second _).

Instead of lexical order, you may want to order by modification time instead (consider that 100001_XYZ_01Oct2020_0000-01Oct2020_0100 would come before 100001_XYZ_30Sep2020_2200-30Sep2020_2300 for instance in lexical order), by replacing On with om (which sorts files from newest to oldest), so that you end up moving the oldest file as opposed to the one which comes first in lexical order.

Or you could define a sorting order based on the first timestamp in the file name:

zmodload zsh/datetime
bydate() strftime -rs REPLY %d%b%Y_%H%M ${${REPLY%-*}#*_*_}

And use nO+bydate instead of On/om.

With bash and GNU tools, you could do something approaching (not restricting to regular files, and no sorting by modification time though) with:

shopt -s failglob
printf '%s\0' *_*_*.csv | sort -zsmut_ -k1,2 | xargs -r0 mv -t target-dir --

(all of -z, -s, -r, -0, -t are GNU extensions).

The sorting by timestamp extracted from the file names could be done with:

printf '%s\0' *_*_*.csv |
                   #  key   year       month      day        HHMM
  LC_ALL=C sort -zt_ -k1,2 -k3.6,3.9n -k3.3,3.5M -k3.1,3.2n -k3.11,3.14n |
  LC_ALL=C sort -zsmut_ -k1,2 |
  xargs -r0 mv -t target-dir

If, as the key, you want the part between the first and second occurrences of _, replace ${(M)f#*_*_} with ${${f#*_}%%_*} (or ${${(s[_])f}[2]}) or -k1,2 with -k2,2.

thanasisp · Accepted Answer · 2020-10-16 12:45:30Z

4

This is a solution for any filenames:

target_dir="path/to/dir"

find -maxdepth 1 -type f -name '*.csv' -print0 | sort -z | awk '
    BEGIN {RS=ORS="\0"; FS=OFS="_"}
    !seen[$2]++' | xargs -r0 echo mv -t "$target_dir" --

We use null separator through pipes to protect the filenames, sort to get them in alphabetical order and GNU awk to exclude duplicates. Test it and if it prints a reasonable move command, remove echo to run it.

(Also all the above for null separation are GNU extensions, like -z etc)

This is simpler for if your filenames are so nice, you can simply do:

ls -1 *.csv | awk -F_ '!seen[$2]++' | xargs -d'\n' echo mv -t target/dir --

Note the glob fetches the files in alphabetical order.

edited Oct 16, 2020 at 12:45

answered Oct 16, 2020 at 5:38

thanasisp

8,5322 gold badges29 silver badges40 bronze badges

could you please also suggest a solution in which I have to change a very little in original script

Vikrant Singh Rana
– Vikrant Singh Rana

2020-10-16 05:43:43 +00:00
Commented Oct 16, 2020 at 5:43
3

I added one more solution, I wouldn't use arrays for this, I hope it is helpful and readable.

thanasisp
– thanasisp

2020-10-16 05:52:59 +00:00
Commented Oct 16, 2020 at 5:52

Add a comment |

Stéphane Chazelas · Accepted Answer · 2020-10-16 14:17:26Z

4

I would just use an array to hold the names you've seen and move only the "new" names:

declare -A seen=()
name_seen='seen[$name]++' # work around to avoid ACE vulnerability
for i in /home/vikrant_singh_rana/unzipfiles/*_*_*; do 
    name=${i##*/} # remove directory part
    name=${name%"_${name#*_*_}"} # retain first two fields
    (( name_seen )) || mv -- "$i" /home/vikrant_singh_rana/uniquefiles/
done

edited Oct 16, 2020 at 14:17

Stéphane Chazelas

586k96 gold badges1.1k silver badges1.7k bronze badges

answered Oct 16, 2020 at 9:13

terdon♦

253k69 gold badges481 silver badges719 bronze badges

I like this approach also.. it's very close to the original script. Thanks

Vikrant Singh Rana
– Vikrant Singh Rana

2020-10-16 10:29:12 +00:00
Commented Oct 16, 2020 at 10:29
@StéphaneChazelas D'oh! Of course, thanks. I was testing with local files and only added the path before posting. I changed it to use ${i##*/} instead, which should be safe.

terdon
– terdon ♦

2020-10-16 13:56:08 +00:00
Commented Oct 16, 2020 at 13:56
It also won't work if $name contains ] or \ characters (that's actually a command injection vulnerability, like when there's a file called x] + a[$(reboot)).

Stéphane Chazelas
– Stéphane Chazelas

2020-10-16 14:05:01 +00:00
Commented Oct 16, 2020 at 14:05
@StéphaneChazelas oh man, the mv was wrong in any case, that should have been mv -- $i not mv -- $name. I missed that! And you're right about the \ and \], the latter is particularly likely to be found in media file names. I changed to using basename instead, that should be safe, right?

terdon
– terdon ♦

2020-10-16 14:10:38 +00:00
Commented Oct 16, 2020 at 14:10
See edit for how to avoid the ACE. Also removed cut. Can't use line-based utilities on file names. basename doesn't help here. The problem is with (( seen[$name]++ )) where $name is expanded before the arithmetic expression is evaluated.

Stéphane Chazelas
– Stéphane Chazelas

2020-10-16 14:18:30 +00:00
Commented Oct 16, 2020 at 14:18

| Show 2 more comments

Philippos · Accepted Answer · 2020-10-16 06:30:50Z

3

Why use arrays, loops or awk when there are buildin tools like uniq with option -w (GNU version)?

mv $(ls *csv|uniq -w 10) /home/vikrant_singh_rana/uniquefiles/

edited Oct 16, 2020 at 6:30

answered Oct 16, 2020 at 6:00

Philippos

13.8k2 gold badges42 silver badges82 bronze badges

Performance doesn't matter obviously. According to the comments to the question, the date should NOT be considered, only ABC, XYZ. As all files have the same start, -w 10 does the right thing. The file pattern is given, so no whitespaces or control chars will appear. But thank you for the hint about my sort mistake.

Philippos
– Philippos

2020-10-16 06:30:30 +00:00
Commented Oct 16, 2020 at 6:30
welcome. I am not sure still for the uniqueness, perhaps only ABC finally matters, as I see again the comments. And yes, date is not needed, i will modify that, and thank you too.

thanasisp
– thanasisp

2020-10-16 07:20:25 +00:00
Commented Oct 16, 2020 at 7:20
3

Note that although this works fine for the OP's example, it will fail if the file names contain newlines, spaces, or globbing characters.

terdon
– terdon ♦

2020-10-16 09:10:56 +00:00
Commented Oct 16, 2020 at 9:10
@terdo, also if filenames start with - (could easily be fixed with --) or if some of those csv files are directories (could be fixed with -d).

Stéphane Chazelas
– Stéphane Chazelas

2020-10-16 14:27:15 +00:00
Commented Oct 16, 2020 at 14:27

Add a comment |

Stack Exchange Network

select and move unique files based on some pattern

4 Answers 4

You must log in to answer this question.

Hot Network Questions

select and move unique files based on some pattern

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions