1

pdf files does not contains any image, nor text. The text visible on the page is rendered as Path.

Tried PdfPig (https://github.com/UglyToad/PdfPig) using

using PdfDocument document = PdfDocument.Open(stream, SkiaRenderingParsingOptions.Instance );
string ptxt = "";
foreach(PdfPath p in page.Paths) 
  ptxt += p.ToString();
Console.WriteLine(ptxt);

Output is

UglyToad.PdfPig.Graphics.PdfPath

How to convert such pdf files to plain text ? If direct conversion is not possible, how to convert pdf to image to pass to OCR ?

PDFs may also contain text objects from which text can extracted directly.

PdfPig exposes Paths collection which can used to retrieve every Path objekt. How to convert each Path object to image? PDF Viewer source code should contain this.

How to use OpenCv or SkiaSharp for this conversion?

This is .NET 9 ASP.NET MVC application.

6
  • 1
    You need to find a library that does PDF to image conversion (recommendations are now allowed here, you have to ask the questions on softwarerecs.stackexchange.com) and then perform OCR on that image. Commented Nov 11 at 8:03
  • @KJ MVC controller should convert such pdfis without human. How to implement this? How to convert pdf paths to image to pass image to OCR or is there other automated solution? Commented Nov 11 at 8:04
  • @iPDFdev Posted question in softwarerecs.stackexchange.com/questions/94890/… Commented Nov 11 at 8:10
  • @iPDFdev Is it possible to read Path from pdf files and use OpenCv or SkiaSharp to create image from it? Commented Nov 11 at 8:25
  • 1
    For very simple cases you can extract the paths and draw them using SkiaSharp (or any other graphic engine). But for general PDF use you would have to implement a full PDF renderer and simple path extraction would not be enough. Commented Nov 12 at 8:06

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.