I have been searching for this in forums and on stackoverflow; it must be here somewhere but I couldn't find it.
I'm on a Mac, using the terminal to run a shell script to rename some pdf files based on file content.
I have a directory full of pdfs that I'm exporting to text files using the opensource pdfbox. The resulting files have the same name as the pdf file but end in .txt. I created the text files so that I could find a string inside the file with the format Page xx Question xx; for example Page 43 Question 2. Given this example, I would like to rename the pdf file as pg43_q2.pdf
I think the regular expression I want is this:
/Page\s+(\d+)Question\s+(\d+)
but I'm not sure how to read the two captured numbers and save them into a string that I can use as a filename.
The script I have so far is:
#!/bin/sh
PDF_FILE_PATH=$1
echo "Converting pdfs at $PDF_FILE_PATH"
find "$PDF_FILE_PATH" -name '*.pdf' -print0 | while IFS= read -r -d '' filename; do
echo $filename
java -jar pdfbox-app-1.6.0.jar ExtractText "$filename" "$filename.txt"
NEWNAME=$(sed -n -e '/Page/s/Page\s+\(\d+\)\s+Question\s+\(\d+\).*$/pg\1_q\2/p' "$filename.txt")
echo "Renaming pdf $filename to $NEWNAME"
# I would do this next but the $NEWNAME is empty
# mv "filename" "PDF_FILE_PATH$NEWNAME"
done
... but the sed command is not putting anything into the NEWNAME variable.
I'm not particularly attached to sed, any suggestions would be appreciated
Latest edit to script uses the following sed command:
newname=$(sed -nE -e '/Page/s/^.*Page[[:blank:]]+([0-9]+)[[:blank:]]+Question[[:blank:]]+([0-9]+).*$/pg\1_q\2.pdf/p' "$filename.txt")
This works about 50% of the time, but the rest of the time the newname variable is empty when I go to rename the file.
The third line of a converted file that does work:
Unit 2 Review Page 257 Question 9 a) 12 (2)(2)(3)
The third line of a converted file that doesn't work:
Unit 2 Review Page 258 Question 16 a) (a – 4)(a + 7) = a(a + 7) – 4(a + 7) = a2 + 7a – 4a – 28 = a2 + 3a – 28 b) (2x + 3)(5x + 2) = 2x(5x + 2) + 3(5x + 2) = 10x2 + 4x + 15x + 6 = 10x2 + 19x + 6 c) (–x + 5)(x + 5) = –x(x + 5) + 5(x + 5) = –x2 – 5x + 5x + 25 = –x2 + 25 d) (3y + 4)2 = (3y + 4)(3y + 4) = 3y(3y + 4) + 4(3y + 4) = 9y2 + 12y + 12y + 16 = 9y2 + 24y + 16 e) (a – 3b)(4a – b) = a(4a – b) – 3b(4a – b) = 4a2 – ab – 12ab + 3b2 = 4a2 – 13ab + 3b2 f) (v – 1)(2v2 – 4v – 9) = v(2v2 – 4v – 9) – 1(2v2 – 4v – 9) = 2v3 – 4v2 – 9v – 2v2 + 4v + 9 = 2v3 – 6v2 – 5v + 9