0

I have read a PDF file using PDFBOX in JAVA and have converted the data to text and have saved in a string. I have found that a lot of the text data is surrounded by X'C2A0'. For instance:

X'436C756233AC2A04469616D6F6E64C2A0'       Club:__Diamond__

__ is X'C2A0'

I want to search for "Club:__, then parse between the 2 __ for "Diamond". I have tried something like:

String TAG = "\\xC2A0";                     // Tag in PDF

int pos = text.indexOf(TAG, positionInText);

but I never get any hits. How do I specify TAG?

EDIT:

Maybe some clarification is needed. I used PDFBOX as such:

   public void toText() throws IOException
   {
       this.pdfStripper = null;
       this.pdDoc = null;
       this.cosDoc = null;

       file = new File(filePath);
       parser = new PDFParser(new RandomAccessFile(file,"r"));      // update for PDFBox V 2.0

       parser.parse();
       cosDoc = parser.getDocument();
       pdfStripper = new PDFTextStripper();
       pdDoc = new PDDocument(cosDoc);
       pdDoc.getNumberOfPages();
       pdfStripper.setStartPage(1);
       pdfStripper.setEndPage(10);

       // reading text from page 1 to 10
       // if you want to get text from full pdf file use this code
       // pdfStripper.setEndPage(pdDoc.getNumberOfPages());

       text = pdfStripper.getText(pdDoc);

text is a field defined as String. This text String is what I amd trying to parse.

6
  • Confusing question. What's \\xC2A0? Can you post an actual example? Commented Nov 27, 2016 at 21:04
  • Why not TAG="Club" ? Commented Nov 27, 2016 at 21:05
  • The hex is wrong, the 33 should only be one 3. If you convert the hex to bytes, then decode using UTF-8, you get Club:_Diamond_, where the two underscores are C2A0 (UTF-8) aka 'NO-BREAK SPACE' (U+00A0). It's a 2-byte UTF-8 encoding of the single NBSP character (A0). Commented Nov 27, 2016 at 21:09
  • Is the string above literal.. i.e. String data = "X'436C756233AC2A04469616D6F6E64C2A0'"... or is this from a hexdump / debugger tool? Commented Nov 27, 2016 at 21:26
  • @Andreas one 3 is correct, my misstake, miss typed. Everything you are saying sounds correct. How do I code my TAG for this? String data is hand typed from a hexdump. I can search for "Club", but mainly I want to parse between two x'C2A0'. Commented Nov 27, 2016 at 22:12

2 Answers 2

1

It's not completely clear from your question if the string you are searching is hex-encoded itself or is a normal character string that in the file contains 2-byte sequences with the character values 0xc2 0xa0.

Assuming the latter case, in the file the sequence 0xc2a0 is the UTF-8 encoding for the Unicode code-point 0xA0, which is the non-breaking space that corresponds to the   entity in HTML.

If the file contains these two-byte sequences, then when read into your Java string (assuming you used the UTF-8 encoding to interpret the byte stream), then each of these sequences will become a single 0xA0 in your string.

You should be able to write a regular expression to find data delimited by pairs of these.

Sign up to request clarification or add additional context in comments.

1 Comment

Try searching for \xA0
0

@Jim Garrison your answer got me searching. I still do not understand UTF-8 encoding. Your last 2 paragraphs were right on. I guess PDFBOX is using UTF-8 to read the PDF file. I used the following:

private final String TAG = "\u00A0";                    // Tag &nbsp X'C2A0'

to find and parse data between two x'C2A0' tags.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.