I have read a PDF file using PDFBOX in JAVA and have converted the data to text and have saved in a string. I have found that a lot of the text data is surrounded by X'C2A0'. For instance:
X'436C756233AC2A04469616D6F6E64C2A0' Club:__Diamond__
__ is X'C2A0'
I want to search for "Club:__, then parse between the 2 __ for "Diamond". I have tried something like:
String TAG = "\\xC2A0"; // Tag in PDF
int pos = text.indexOf(TAG, positionInText);
but I never get any hits. How do I specify TAG?
EDIT:
Maybe some clarification is needed. I used PDFBOX as such:
public void toText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File(filePath);
parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(10);
// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
// pdfStripper.setEndPage(pdDoc.getNumberOfPages());
text = pdfStripper.getText(pdDoc);
text is a field defined as String. This text String is what I amd trying to parse.
\\xC2A0? Can you post an actual example?33should only be one3. If you convert the hex to bytes, then decode using UTF-8, you getClub:_Diamond_, where the two underscores areC2A0(UTF-8) aka 'NO-BREAK SPACE' (U+00A0). It's a 2-byte UTF-8 encoding of the single NBSP character (A0).