Get string after string with trailing whitespaces

Question

I currently need to figure out how to use regex and came to a point which i don't seem to figure out: the test strings that are the sources (They actually come from OCR'd PDFs):

string1 = 'Beleg-Nr.:12123-23131'; // no spaces after the colon
string2 = 'Beleg-Nr.:    12121-214331'; // a tab after the colon
string3 = 'Beleg-Nr.:        12-982831'; // a tab and spaces after the colon

I want to get the numbers eplicitly. For that I use this pattern:

pattern = '/(?<=Beleg-Nr\.:[ \t]*)(.*)

This will get me the pure numbers for string1 and string2 but isn't working on string3 (it gives me additional whitespace before the number).

What am I missing here?

Edit: Thanks for all the helpful advises. The software that OCRs on the fly is able to surpress whitespace on its own in regexes. This did the trick. The resulting pattern is:

(?<=Beleg-Nr\.:[\s]*)(.*)

Wait, you just want digits right? Then just use - (\d+)-(\d+)$? — Rohit Jain
– Rohit Jain, Commented Aug 6, 2013 at 10:24

Alma Do · Accepted Answer · 2013-08-06 10:24:20Z

3

You can use "\s" special symbol to include both space and tabs (so, you will not need combine it into a group via []).

answered Aug 6, 2013 at 10:24

Alma Do

37.4k10 gold badges81 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jerone · Accepted Answer · 2013-08-06 10:28:04Z

2

This works for me:

/(Beleg-Nr.:\s*)(.*)/

http://regexr.com?35rj6

answered Aug 6, 2013 at 10:28

jerone

17.1k4 gold badges42 silver badges58 bronze badges

Comments

mishik · Accepted Answer · 2013-08-06 10:31:39Z

2

The problem is that [ ]* will match only spaces. You need to use \s which will match any whitespace character (more specifically \s is [\f\n\r\t\v\u00A0\u2028\u2029]) :

/(?<=Beleg-Nr.:\s*)(.*)/

Side note: * is greedy by default, so it will try to match max number of whitespaces possible, so you do not need to use negative [^\s] in your last () group.

edited Aug 6, 2013 at 10:31

answered Aug 6, 2013 at 10:26

mishik

10k9 gold badges48 silver badges69 bronze badges

3 Comments

Sebastian Over a year ago

This works well for my 2 test documents. But somehow the costumer document still gets messed up and has whitespace before the number.

urzeit Over a year ago

Well, the existance of \s depends on which regex implementation is used, right?

Sebastian Over a year ago

@mishik Unfortunately I can not show the documents right away

urzeit · Accepted Answer · 2013-08-06 10:54:22Z

0

Just replace the (.*) with a more restrictive pattern ([^ ]+$ for example). Also note, that the . after Beleg-Nr matches other chars as well.

The $ in my example matches the end of the line and thus ensures, that all characters are being matched.

I'd suggest to match to tabs as well:

pattern = '/(?<=Beleg-Nr\.:[ \t]*)([^ \t]+)$

edited Aug 6, 2013 at 10:54

answered Aug 6, 2013 at 10:23

urzeit

2,9091 gold badge22 silver badges37 bronze badges

Collectives™ on Stack Overflow

Get string after string with trailing whitespaces

4 Answers 4

Comments

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related