14

I have a template PDF file, and I want to replace some marker strings to generate new PDF files and save them. What's the best/simplest way to do this? I don't need to add graphics or anything fancy, just a simple text replacement, so I don't want anything too complicated.

Thanks!

Edit: Just found HummusJS, I'll see if I can make progress and post it here.

3
  • 1
    Hi there Manuel! Did you find a solution? Commented Nov 17, 2016 at 21:44
  • 1
    I am also curious.. Commented Dec 8, 2016 at 22:51
  • I have same situation, did you find a solution ? Commented Sep 30, 2019 at 12:07

3 Answers 3

16

I found this question by searching, so I think it deserves the answer. I found the answer by BrighTide here: https://github.com/galkahana/HummusJS/issues/71#issuecomment-275956347

Basically, there is this very powerful Hummus package which uses library written in C++ (crossplatform of course). I think the answer given in that github comment can be functionalized like this:

var hummus = require('hummus');

/**
 * Returns a byteArray string
 * 
 * @param {string} str - input string
 */
function strToByteArray(str) {
  var myBuffer = [];
  var buffer = new Buffer(str);
  for (var i = 0; i < buffer.length; i++) {
      myBuffer.push(buffer[i]);
  }
  return myBuffer;
}

function replaceText(sourceFile, targetFile, pageNumber, findText, replaceText) {  
    var writer = hummus.createWriterToModify(sourceFile, {
        modifiedFilePath: targetFile
    });
    var sourceParser = writer.createPDFCopyingContextForModifiedFile().getSourceDocumentParser();
    var pageObject = sourceParser.parsePage(pageNumber);
    var textObjectId = pageObject.getDictionary().toJSObject().Contents.getObjectID();
    var textStream = sourceParser.queryDictionaryObject(pageObject.getDictionary(), 'Contents');
    //read the original block of text data
    var data = [];
    var readStream = sourceParser.startReadingFromStream(textStream);
    while(readStream.notEnded()){
        Array.prototype.push.apply(data, readStream.read(10000));
    }
    var string = new Buffer(data).toString().replace(findText, replaceText);

    //Create and write our new text object
    var objectsContext = writer.getObjectsContext();
    objectsContext.startModifiedIndirectObject(textObjectId);

    var stream = objectsContext.startUnfilteredPDFStream();
    stream.getWriteStream().write(strToByteArray(string));
    objectsContext.endPDFStream(stream);

    objectsContext.endIndirectObject();

    writer.end();
}

// replaceText('source.pdf', 'output.pdf', 0, /REPLACEME/g, 'My New Custom Text');

UPDATE:
The version used at the time of writing an example was 1.0.83, things might change recently.

UPDATE 2: Recently I got an issue with another PDF file which had a different font. For some reason the text got split into small chunks, i.e. string QWERTYUIOPASDFGHJKLZXCVBNM1234567890- got represented as -286(Q)9(WER)24(T)-8(YUIOP)116(ASDF)19(GHJKLZX)15(CVBNM1234567890-) I had no idea what else to do rather than make up a regex.. So instead of this one line:

var string = new Buffer(data).toString().replace(findText, replaceText);

I have something like this now:

var string = Buffer.from(data).toString();

var characters = REPLACE_ME;
var match = [];
for (var a = 0; a < characters.length; a++) {
    match.push('(-?[0-9]+)?(\\()?' + characters[a] + '(\\))?');
}

string = string.replace(new RegExp(match.join('')), function(m, m1) {
    // m1 holds the first item which is a space
    return m1 + '( ' + REPLACE_WITH_THIS + ')';
});
Sign up to request clarification or add additional context in comments.

18 Comments

I am getting the following error: TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function
@Nithin the version used was 1.0.83, maybe something changed... but did you try with the simplest pdf file first? is the text selectable when you open the pdf?
@Nithin that means the text is vectorized, you cannot replace it as it's being presented as vectors
@Nithin I would inspect what pageObject.getDictionary().toJSObject() returns, without trying to guess
@Nithin Where did you reach with this?
|
1

Building on Alex's (and other's) solution, I noticed an issue where some non-text data were becoming corrupted. I tracked this down to encoding/decoding the PDF text as utf-8 instead of as a binary string. Anyways here's a modified solution that:

  • Avoids corrupting non-text data
  • Uses streams instead of files
  • Allows multiple patterns/replacements
  • Uses the MuhammaraJS package which is a maintained fork of HummusJS (should be able to swap in HummusJS just fine as well)
  • Is written in TypeScript (feel free to remove the types for JS)
import muhammara from "muhammara";

interface Pattern {
  searchValue: RegExp | string;
  replaceValue: string;
}

/**
 * Modify a PDF by replacing text in it
 */
const modifyPdf = ({
  sourceStream,
  targetStream,
  patterns,
}: {
  sourceStream: muhammara.ReadStream;
  targetStream: muhammara.WriteStream;
  patterns: Pattern[];
}): void => {
  const modPdfWriter = muhammara.createWriterToModify(sourceStream, targetStream, { compress: false });
  const numPages = modPdfWriter
    .createPDFCopyingContextForModifiedFile()
    .getSourceDocumentParser()
    .getPagesCount();

  for (let page = 0; page < numPages; page++) {
    const copyingContext = modPdfWriter.createPDFCopyingContextForModifiedFile();
    const objectsContext = modPdfWriter.getObjectsContext();

    const pageObject = copyingContext.getSourceDocumentParser().parsePage(page);
    const textStream = copyingContext
      .getSourceDocumentParser()
      .queryDictionaryObject(pageObject.getDictionary(), "Contents");
    const textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID();

    let data: number[] = [];
    const readStream = copyingContext.getSourceDocumentParser().startReadingFromStream(textStream);
    while (readStream.notEnded()) {
      const readData = readStream.read(10000);
      data = data.concat(readData);
    }

    const pdfPageAsString = Buffer.from(data).toString("binary"); // key change 1

    let modifiedPdfPageAsString = pdfPageAsString;
    for (const pattern of patterns) {
      modifiedPdfPageAsString = modifiedPdfPageAsString.replaceAll(pattern.searchValue, pattern.replaceValue);
    }

    // Create what will become our new text object
    objectsContext.startModifiedIndirectObject(textObjectID);

    const stream = objectsContext.startUnfilteredPDFStream();
    stream.getWriteStream().write(strToByteArray(modifiedPdfPageAsString));
    objectsContext.endPDFStream(stream);

    objectsContext.endIndirectObject();
  }

  modPdfWriter.end();
};

/**
 * Create a byte array from a string, as muhammara expects
 */
const strToByteArray = (str: string): number[] => {
  const myBuffer = [];
  const buffer = Buffer.from(str, "binary"); // key change 2
  for (let i = 0; i < buffer.length; i++) {
    myBuffer.push(buffer[i]);
  }
  return myBuffer;
};

And then to use it:

/**
 * Fill a PDF with template data
 */
export const fillPdf = async (sourceBuffer: Buffer): Promise<Buffer> => {
  const sourceStream = new muhammara.PDFRStreamForBuffer(sourceBuffer);
  const targetStream = new muhammara.PDFWStreamForBuffer();

  modifyPdf({
    sourceStream,
    targetStream,
    patterns: [{ searchValue: "home", replaceValue: "emoh" }], // TODO use actual patterns
  });

  return targetStream.buffer;
};

2 Comments

Getting this kind of results when trying to convert to string "Tm [<0003000400050006000700080006>] TJ" tried with utf-9 and with binary --> Buffer.from(data).toString('utf-8');
Maybe newer versions have changed its api's, this example is not working, I am new to this, if you have time can you update this to work with latest version of MuhammaraJS? Thank you
-6

There is another Node.js Package asposepdfcloud, Aspose.PDF Cloud SDK for Node.js. You can use it to replace text in your PDF document conveniently. Its free plan offers 150 credits monthly. Here is sample code to replace text in PDF document, don't forget to install asposepdfcloud first.

const { PdfApi } = require("asposepdfcloud");
const { TextReplaceListRequest }= require("asposepdfcloud/src/models/textReplaceListRequest");
const { TextReplace }= require("asposepdfcloud/src/models/textReplace");

// Get App key and App SID from https://aspose.cloud 
pdfApi = new PdfApi("xxxxx-xxxxx-xxxx-xxxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxxxb");

var fs = require('fs');

const name = "02_pages.pdf";
const remoteTempFolder = "Temp";
//const localTestDataFolder = "C:\\Temp";
//const path = remoteTempFolder + "\\" + name;
//var data = fs.readFileSync(localTestDataFolder + "\\" + name);
    
const textReplace= new TextReplace();
        textReplace.oldValue= "origami"; 
        textReplace.newValue= "aspose";
        textReplace.regex= false;

const textReplace1= new TextReplace();
        textReplace1.oldValue= "candy"; 
        textReplace1.newValue= "biscuit";
        textReplace1.regex= false;
    
const trr = new TextReplaceListRequest();
            trr.textReplaces = [textReplace,textReplace1];

// Upload File
//pdfApi.uploadFile(path, data).then((result) => {  
//                     console.log("Uploaded File");    
//                    }).catch(function(err) {
    // Deal with an error
//    console.log(err);
//});


// Replace text
pdfApi.postDocumentTextReplace(name, trr, null, remoteTempFolder).then((result) => {    
    console.log(result.body.code);                  
}).catch(function(err) {
    // Deal with an error
    console.log(err);
});

P.S: I'm developer evangelist at aspose.

1 Comment

"I'm developer evangelist at aspose" - And so I have noticed. And I'm pretty sure advertising here is not exactly allowed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.