pdf files does not contains any image, nor text. The text visible on the page is rendered as Path.
Tried PdfPig (https://github.com/UglyToad/PdfPig) using
using PdfDocument document = PdfDocument.Open(stream, SkiaRenderingParsingOptions.Instance );
string ptxt = "";
foreach(PdfPath p in page.Paths)
ptxt += p.ToString();
Console.WriteLine(ptxt);
Output is
UglyToad.PdfPig.Graphics.PdfPath
How to convert such pdf files to plain text ? If direct conversion is not possible, how to convert pdf to image to pass to OCR ?
PDFs may also contain text objects from which text can extracted directly.
PdfPig exposes Paths collection which can used to retrieve every Path objekt. How to convert each Path object to image? PDF Viewer source code should contain this.
How to use OpenCv or SkiaSharp for this conversion?
This is .NET 9 ASP.NET MVC application.