0

For each of the pdf I have in a folder, I have 3 files related to it.

original.pdf
original.txt
original_roc_mrc.pdf
original_roc_mrc_updated.pdf

Now I need an script that would do the following:

  1. check if original.pdf and original_roc_mrc_updated.pdf have the same number of pages.

  2. check if original_roc_mrc.pdf is about 20% larger in size at most than original_roc_mrc_updated.pdf

  3. if the previous ones are true, then delete original.pdf, original.txt and original_roc_mrc.pdf. If 1) or 2) are false, then do nothing on the "pack"

1
  • 1
    What have you tried so far? Commented Aug 14, 2022 at 17:21

1 Answer 1

2

I don't have pdftk installed, and don't want to install all the java stuff it requires, so here's a script that uses poppler-util's pdfinfo to get the number of pages, then does the rest.

#!/usr/bin/env bash

# function to get filesize
filesize() {
    stat -c '%s' "$1"
}

# function to get number of pages
numpages() {
    pdfinfo "$1" | sed -n 's/^Pages:\s*\([0-9]*\)\s*/\1/p'
}

# get number of pages for these two files
pages1="$(numpages original.pdf)"
pages2="$(numpages original_roc_mrc_updated.pdf)"

# get filesizes of these two files
size1="$(filesize original_roc_mrc_updated.pdf)"
size2="$(filesize original_roc_mrc.pdf)"

# determine the maxfilesize to be 20% larger or less
# 120% = the original size plus 1/5th of original size
maxsize=$(( size1 + size1/5 ))

# see if pages1=pages2 and size2 <= maxsize
if [[ pages1 -eq pages2 ]] &&
    [[ size2 -le maxsize ]] ; then
    rm original.pdf original.txt original_roc_mrc.pdf
fi

You could probably replace the function with this if you prefer using pdftk for whatever reason:

numpages() {
    pdftk "$1" dump_data | grep NumberOfPages | awk '{print $2}'
}

To apply the name to all .pdf's in the folder without mrc in their name, you can use a loop like this (using mostly the same code from the question edits):

#!/usr/bin/env bash

# function to get filesize
filesize() {
    stat -c '%s' "$1"
}

# function to get number of pages
numpages() {
    pdfinfo "$1" | sed -n 's/^Pages:\s*\([0-9]*\)\s*/\1/p'
}


for filename in *.pdf ; do

    # skip files with "mrc" in their name
    if [[ "$filename" =~ "mrc" ]] ; then
        continue
    fi

    # determine common part of filenames
    commonname="${filename%.pdf}"
    
    # get number of pages for these two files
    pages1="$(numpages "$filename")"
    pages2="$(numpages "${commonname}_roc_mrc_updated.pdf")"

    # get filesizes of these two files
    size1="$(filesize "${commonname}_roc_mrc_updated.pdf")"
    size2="$(filesize "${commonname}_roc_mrc.pdf")"

    # determine the maxfilesize to be 20% larger or less
    # 120% = the original size plus 1/5th of original size
    maxsize=$(( size1 + size1/5 ))

    if [[ pages1 -eq pages2 ]] &&
        [[ size2 -le maxsize ]] ; then
        rm "$filename" "${commonname}.txt" "${commonname}_roc_mrc.pdf"
    fi
done
4
  • It looks almost perfect. It does not have the for loop to iterate over the originals pdfs only (not around the three complementary files that each of the original ones have). Commented Aug 14, 2022 at 23:19
  • It probably wouldn't be hard to add that loop, but you didn't describe the folder/naming structure for the original PDFs at all, so I wouldn't be able to say how to do that without more information. Feel free to fill in those details if you want. Commented Aug 15, 2022 at 1:01
  • All the pdfs are in the same folder. I have edited the original message because I tried to make the final script and I think It works fine. Tell me if this version is ok. If so, or you prefer other version, please put it in your message for the final one Commented Aug 15, 2022 at 1:42
  • You need to use a code block, because right now the comments are being treated as markdown headers. I'd also put the two functions above the for loop so you're not redefining them over and over. But otherwise, that looks fine. I'll add to my answer as well. Commented Aug 15, 2022 at 1:49

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.