How do I replace a string in a PDF file using NodeJS?

Question

I have a template PDF file, and I want to replace some marker strings to generate new PDF files and save them. What's the best/simplest way to do this? I don't need to add graphics or anything fancy, just a simple text replacement, so I don't want anything too complicated.

Thanks!

Edit: Just found HummusJS, I'll see if I can make progress and post it here.

Hi there Manuel! Did you find a solution?

Fredefl
– Fredefl

2016-11-17 21:44:39 +00:00
Commented Nov 17, 2016 at 21:44 — Fredefl
– Fredefl, Commented Nov 17, 2016 at 21:44
I am also curious..

Ran Halprin
– Ran Halprin

2016-12-08 22:51:17 +00:00
Commented Dec 8, 2016 at 22:51 — Ran Halprin
– Ran Halprin, Commented Dec 8, 2016 at 22:51
I have same situation, did you find a solution ?

M.Abulsoud
– M.Abulsoud

2019-09-30 12:07:43 +00:00
Commented Sep 30, 2019 at 12:07 — M.Abulsoud
– M.Abulsoud, Commented Sep 30, 2019 at 12:07

ginjaemocoes · Accepted Answer · 2020-12-16 18:40:21Z

16

I found this question by searching, so I think it deserves the answer. I found the answer by BrighTide here: https://github.com/galkahana/HummusJS/issues/71#issuecomment-275956347

Basically, there is this very powerful Hummus package which uses library written in C++ (crossplatform of course). I think the answer given in that github comment can be functionalized like this:

var hummus = require('hummus');

/**
 * Returns a byteArray string
 * 
 * @param {string} str - input string
 */
function strToByteArray(str) {
  var myBuffer = [];
  var buffer = new Buffer(str);
  for (var i = 0; i < buffer.length; i++) {
      myBuffer.push(buffer[i]);
  }
  return myBuffer;
}

function replaceText(sourceFile, targetFile, pageNumber, findText, replaceText) {  
    var writer = hummus.createWriterToModify(sourceFile, {
        modifiedFilePath: targetFile
    });
    var sourceParser = writer.createPDFCopyingContextForModifiedFile().getSourceDocumentParser();
    var pageObject = sourceParser.parsePage(pageNumber);
    var textObjectId = pageObject.getDictionary().toJSObject().Contents.getObjectID();
    var textStream = sourceParser.queryDictionaryObject(pageObject.getDictionary(), 'Contents');
    //read the original block of text data
    var data = [];
    var readStream = sourceParser.startReadingFromStream(textStream);
    while(readStream.notEnded()){
        Array.prototype.push.apply(data, readStream.read(10000));
    }
    var string = new Buffer(data).toString().replace(findText, replaceText);

    //Create and write our new text object
    var objectsContext = writer.getObjectsContext();
    objectsContext.startModifiedIndirectObject(textObjectId);

    var stream = objectsContext.startUnfilteredPDFStream();
    stream.getWriteStream().write(strToByteArray(string));
    objectsContext.endPDFStream(stream);

    objectsContext.endIndirectObject();

    writer.end();
}

// replaceText('source.pdf', 'output.pdf', 0, /REPLACEME/g, 'My New Custom Text');

UPDATE:
The version used at the time of writing an example was 1.0.83, things might change recently.

UPDATE 2: Recently I got an issue with another PDF file which had a different font. For some reason the text got split into small chunks, i.e. string QWERTYUIOPASDFGHJKLZXCVBNM1234567890- got represented as -286(Q)9(WER)24(T)-8(YUIOP)116(ASDF)19(GHJKLZX)15(CVBNM1234567890-) I had no idea what else to do rather than make up a regex.. So instead of this one line:

var string = new Buffer(data).toString().replace(findText, replaceText);

I have something like this now:

var string = Buffer.from(data).toString();

var characters = REPLACE_ME;
var match = [];
for (var a = 0; a < characters.length; a++) {
    match.push('(-?[0-9]+)?(\\()?' + characters[a] + '(\\))?');
}

string = string.replace(new RegExp(match.join('')), function(m, m1) {
    // m1 holds the first item which is a space
    return m1 + '( ' + REPLACE_WITH_THIS + ')';
});

edited Dec 16, 2020 at 18:40

ginjaemocoes

4,6524 gold badges39 silver badges88 bronze badges

answered Jun 28, 2017 at 9:55

Alex K

7,3679 gold badges44 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

18 Comments

Nithin Over a year ago

I am getting the following error: TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function

Alex K Over a year ago

@Nithin the version used was 1.0.83, maybe something changed... but did you try with the simplest pdf file first? is the text selectable when you open the pdf?

Alex K Over a year ago

@Nithin that means the text is vectorized, you cannot replace it as it's being presented as vectors

Alex K Over a year ago

@Nithin I would inspect what pageObject.getDictionary().toJSObject() returns, without trying to guess

M.Abulsoud Over a year ago

@Nithin Where did you reach with this?

|

Syas · Accepted Answer · 2022-04-15 01:08:23Z

Building on Alex's (and other's) solution, I noticed an issue where some non-text data were becoming corrupted. I tracked this down to encoding/decoding the PDF text as utf-8 instead of as a binary string. Anyways here's a modified solution that:

Avoids corrupting non-text data
Uses streams instead of files
Allows multiple patterns/replacements
Uses the MuhammaraJS package which is a maintained fork of HummusJS (should be able to swap in HummusJS just fine as well)
Is written in TypeScript (feel free to remove the types for JS)

import muhammara from "muhammara";

interface Pattern {
  searchValue: RegExp | string;
  replaceValue: string;
}

/**
 * Modify a PDF by replacing text in it
 */
const modifyPdf = ({
  sourceStream,
  targetStream,
  patterns,
}: {
  sourceStream: muhammara.ReadStream;
  targetStream: muhammara.WriteStream;
  patterns: Pattern[];
}): void => {
  const modPdfWriter = muhammara.createWriterToModify(sourceStream, targetStream, { compress: false });
  const numPages = modPdfWriter
    .createPDFCopyingContextForModifiedFile()
    .getSourceDocumentParser()
    .getPagesCount();

  for (let page = 0; page < numPages; page++) {
    const copyingContext = modPdfWriter.createPDFCopyingContextForModifiedFile();
    const objectsContext = modPdfWriter.getObjectsContext();

    const pageObject = copyingContext.getSourceDocumentParser().parsePage(page);
    const textStream = copyingContext
      .getSourceDocumentParser()
      .queryDictionaryObject(pageObject.getDictionary(), "Contents");
    const textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID();

    let data: number[] = [];
    const readStream = copyingContext.getSourceDocumentParser().startReadingFromStream(textStream);
    while (readStream.notEnded()) {
      const readData = readStream.read(10000);
      data = data.concat(readData);
    }

    const pdfPageAsString = Buffer.from(data).toString("binary"); // key change 1

    let modifiedPdfPageAsString = pdfPageAsString;
    for (const pattern of patterns) {
      modifiedPdfPageAsString = modifiedPdfPageAsString.replaceAll(pattern.searchValue, pattern.replaceValue);
    }

    // Create what will become our new text object
    objectsContext.startModifiedIndirectObject(textObjectID);

    const stream = objectsContext.startUnfilteredPDFStream();
    stream.getWriteStream().write(strToByteArray(modifiedPdfPageAsString));
    objectsContext.endPDFStream(stream);

    objectsContext.endIndirectObject();
  }

  modPdfWriter.end();
};

/**
 * Create a byte array from a string, as muhammara expects
 */
const strToByteArray = (str: string): number[] => {
  const myBuffer = [];
  const buffer = Buffer.from(str, "binary"); // key change 2
  for (let i = 0; i < buffer.length; i++) {
    myBuffer.push(buffer[i]);
  }
  return myBuffer;
};

And then to use it:

/**
 * Fill a PDF with template data
 */
export const fillPdf = async (sourceBuffer: Buffer): Promise<Buffer> => {
  const sourceStream = new muhammara.PDFRStreamForBuffer(sourceBuffer);
  const targetStream = new muhammara.PDFWStreamForBuffer();

  modifyPdf({
    sourceStream,
    targetStream,
    patterns: [{ searchValue: "home", replaceValue: "emoh" }], // TODO use actual patterns
  });

  return targetStream.buffer;
};

Getting this kind of results when trying to convert to string "Tm [<0003000400050006000700080006>] TJ" tried with utf-9 and with binary --> Buffer.from(data).toString('utf-8');
Maybe newer versions have changed its api's, this example is not working, I am new to this, if you have time can you update this to work with latest version of MuhammaraJS? Thank you

Tilal Ahmad · Accepted Answer · 2020-08-17 12:06:52Z

There is another Node.js Package asposepdfcloud, Aspose.PDF Cloud SDK for Node.js. You can use it to replace text in your PDF document conveniently. Its free plan offers 150 credits monthly. Here is sample code to replace text in PDF document, don't forget to install asposepdfcloud first.

const { PdfApi } = require("asposepdfcloud");
const { TextReplaceListRequest }= require("asposepdfcloud/src/models/textReplaceListRequest");
const { TextReplace }= require("asposepdfcloud/src/models/textReplace");

// Get App key and App SID from https://aspose.cloud 
pdfApi = new PdfApi("xxxxx-xxxxx-xxxx-xxxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxxxb");

var fs = require('fs');

const name = "02_pages.pdf";
const remoteTempFolder = "Temp";
//const localTestDataFolder = "C:\\Temp";
//const path = remoteTempFolder + "\\" + name;
//var data = fs.readFileSync(localTestDataFolder + "\\" + name);
    
const textReplace= new TextReplace();
        textReplace.oldValue= "origami"; 
        textReplace.newValue= "aspose";
        textReplace.regex= false;

const textReplace1= new TextReplace();
        textReplace1.oldValue= "candy"; 
        textReplace1.newValue= "biscuit";
        textReplace1.regex= false;
    
const trr = new TextReplaceListRequest();
            trr.textReplaces = [textReplace,textReplace1];

// Upload File
//pdfApi.uploadFile(path, data).then((result) => {  
//                     console.log("Uploaded File");    
//                    }).catch(function(err) {
    // Deal with an error
//    console.log(err);
//});


// Replace text
pdfApi.postDocumentTextReplace(name, trr, null, remoteTempFolder).then((result) => {    
    console.log(result.body.code);                  
}).catch(function(err) {
    // Deal with an error
    console.log(err);
});

P.S: I'm developer evangelist at aspose.

"I'm developer evangelist at aspose" - And so I have noticed. And I'm pretty sure advertising here is not exactly allowed.

Collectives™ on Stack Overflow

How do I replace a string in a PDF file using NodeJS?

3 Answers 3

18 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

18 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related