1

I use Ghostscript to strip images from PDF files into jpg and run Tesseract to save txt content like this:

  • Ghostscript located in c:\engine\gs\
  • Tesseract located in c:\engine\tesseract\
  • web located pdf/jpg/txt dir = file/tmp/

Code:

$pathgs = "c:\\engine\\gs\\";
$pathtess = "c:\\engine\\tesseract\\";
$pathfile = "file/tmp/"

// Strip images
putenv("PATH=".$pathgs);
$exec = "gs -dNOPAUSE -sDEVICE=jpeg -r300 -sOutputFile=".$pathfile."strip%d.jpg ".$pathfile."upload.pdf -q -c quit";
shell_exec($exec);

// OCR
putenv("PATH=".$pathtess);
$exec = "tesseract.exe '".$pathfile."strip1.jpg' '".$pathfile."ocr' -l eng";
exec($exec, $msg);
print_r($msg);
echo file_get_contents($pathfile."ocr.txt");

Stripping the image (its just 1 page) works fine, but Tesseract echoes:

Array
  (
    [0] => Tesseract Open Source OCR Engine v3.01 with Leptonica
    [1] => Cannot open input file: 'file/tmp/strip1.jpg'
  )

and no ocr.txt file is generated, thus leading into a 'failed to open stream' error in PHP.

  • Copying strip1.jpg into c:/engine/tesseract/ folder and running Tesseract from command (tesseract strip1.jpg ocr.txt -l eng) runs without any issue.
  • Replacing the putenv() quote by exec(c:/engine/tesseract/tesseract ... ) returns the a.m. error
  • I kept strip1.jpg in the Tesseract folder and ran exec(tesseract 'c:/engine/tesseract/strip1.jpg' ... ) returns the a.m. error
  • Leaving away the apostrophs around path/strip1.jpg returns an empty array as message and does not create the ocr.txt file.
  • writing the command directly into the exec() quote instead of using $exec doesn't make the change.

What am I doing wrong?

2
  • Rather than a relative path (file/tmp/strip1.jpg), try a fully-qualified path? Commented Apr 17, 2012 at 21:12
  • @halfer: I have tried many different paths - also full path from c: to tmp - with and without apostroph - but did not make any change at all. Wrong was to have apostrophs around the path/file name so I left them all away. exec(dir path) gives me clearly the content of the /file/tmp folder and also the strip1.jpg. It looks like tesseract finds the file but crashes before start of operation, returning no $msg as well as no ocr.txt. But why is it working from command line and not in PHP? Ghostscript does not worry about this at all. Commented Apr 19, 2012 at 17:50

2 Answers 2

1

Halfer, you made my day:-)

Not exactly the way as described in your post but like this:

$path = str_replace("index.php", "../".$pathfile, $_SERVER['SCRIPT_FILENAME']);

$descriptors = array(
   0 => array("pipe", "r"),
   1 => array("pipe", "w"),
   2 => array("pipe", "w")
);
$cwd = $pathtess;
$command = "tesseract ".$path."strip1.jpg" ".$path."ocr -l eng";

$process = proc_open($command, $descriptors, $pipes, $cwd);

if(is_resource($process)) {
    fclose($pipes[0]);
    fclose($pipes[1]);
    fclose($pipes[2]);
    proc_close($process);
}

echo file_get_contents($path."ocr.txt");
Sign up to request clarification or add additional context in comments.

3 Comments

Out of interest, what was the problem? I can't see any environment stuff being set in there.
If I would only know; I formally experimented with the full path to /file/tmp but under exec() it did not work out. With proc_open it works and thats the major thing. Anyway I will try running this path under exec() again to rule out mistakes during my studies.
Well... what ever I did during felt-like 2'000 efforts I performed complete bullshit, wasting my and other people's time :-s running $command with exec() works absolutely fine, no complaints, perfect! Suppose I've typed rubbish during my efforts with the full path or kept apostrophs around it or whatever. Well, at least I learned something about proc_open... Please accept my appologies! brgds David
0

Perhaps the missing environment variables in PHP is the problem here. Have a look at my question here to see if setting HOME or PATH sorts this out?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.