java hex data in string

Question

I have read a PDF file using PDFBOX in JAVA and have converted the data to text and have saved in a string. I have found that a lot of the text data is surrounded by X'C2A0'. For instance:

X'436C756233AC2A04469616D6F6E64C2A0'       Club:__Diamond__

__ is X'C2A0'

I want to search for "Club:__, then parse between the 2 __ for "Diamond". I have tried something like:

String TAG = "\\xC2A0";                     // Tag in PDF

int pos = text.indexOf(TAG, positionInText);

but I never get any hits. How do I specify TAG?

EDIT:

Maybe some clarification is needed. I used PDFBOX as such:

   public void toText() throws IOException
   {
       this.pdfStripper = null;
       this.pdDoc = null;
       this.cosDoc = null;

       file = new File(filePath);
       parser = new PDFParser(new RandomAccessFile(file,"r"));      // update for PDFBox V 2.0

       parser.parse();
       cosDoc = parser.getDocument();
       pdfStripper = new PDFTextStripper();
       pdDoc = new PDDocument(cosDoc);
       pdDoc.getNumberOfPages();
       pdfStripper.setStartPage(1);
       pdfStripper.setEndPage(10);

       // reading text from page 1 to 10
       // if you want to get text from full pdf file use this code
       // pdfStripper.setEndPage(pdDoc.getNumberOfPages());

       text = pdfStripper.getText(pdDoc);

text is a field defined as String. This text String is what I amd trying to parse.

Confusing question. What's \\xC2A0? Can you post an actual example? — shmosel
– shmosel, Commented Nov 27, 2016 at 21:04
The hex is wrong, the 33 should only be one 3. If you convert the hex to bytes, then decode using UTF-8, you get Club:_Diamond_, where the two underscores are C2A0 (UTF-8) aka 'NO-BREAK SPACE' (U+00A0). It's a 2-byte UTF-8 encoding of the single NBSP character (A0). — Andreas
– Andreas, Commented Nov 27, 2016 at 21:09
Is the string above literal.. i.e. String data = "X'436C756233AC2A04469616D6F6E64C2A0'"... or is this from a hexdump / debugger tool? — Adam
– Adam, Commented Nov 27, 2016 at 21:26
@Andreas one 3 is correct, my misstake, miss typed. Everything you are saying sounds correct. How do I code my TAG for this? String data is hand typed from a hexdump. I can search for "Club", but mainly I want to parse between two x'C2A0'. — todivefor
– todivefor, Commented Nov 27, 2016 at 22:12

Jim Garrison · Accepted Answer · 2016-11-27 22:18:41Z

1

It's not completely clear from your question if the string you are searching is hex-encoded itself or is a normal character string that in the file contains 2-byte sequences with the character values 0xc2 0xa0.

Assuming the latter case, in the file the sequence 0xc2a0 is the UTF-8 encoding for the Unicode code-point 0xA0, which is the non-breaking space that corresponds to the   entity in HTML.

If the file contains these two-byte sequences, then when read into your Java string (assuming you used the UTF-8 encoding to interpret the byte stream), then each of these sequences will become a single 0xA0 in your string.

You should be able to write a regular expression to find data delimited by pairs of these.

answered Nov 27, 2016 at 22:18

Jim Garrison

87k20 gold badges162 silver badges197 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jim Garrison Over a year ago

Try searching for \xA0

todivefor · Accepted Answer · 2016-11-28 15:05:15Z

0

@Jim Garrison your answer got me searching. I still do not understand UTF-8 encoding. Your last 2 paragraphs were right on. I guess PDFBOX is using UTF-8 to read the PDF file. I used the following:

private final String TAG = "\u00A0";                    // Tag &nbsp X'C2A0'

to find and parse data between two x'C2A0' tags.

answered Nov 28, 2016 at 15:05

todivefor

12110 bronze badges

Collectives™ on Stack Overflow

java hex data in string

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related