1

I have a column in a pandas data frame called sample_id. Each entry contains a string, from this string I'd like to pull a numeric pattern that will have one of two forms

1-234-5-6789

or

123-4-5648

I'm having trouble defining the correct regex pattern for this. So far I have been experimenting with the following:

re.findall(pattern=r'\b2\w+', string=str(data['sample_id']))

But this is only pulling values that are starting with 2 and only the first chunk of the numeric pattern. How do I express the above patterns with the dashes?

3
  • What is your expected output? Do you want only the numbers? Commented Oct 30, 2018 at 17:22
  • Numbers and hyphens would be best. Commented Oct 30, 2018 at 17:26
  • Something like (?<![\d-])(?:\d-)?\d{3}-\d-\d{4}(?![\d-]) ? Commented Oct 30, 2018 at 17:41

3 Answers 3

1

A vertical pipe | makes an OR in a regular expression, so you can use:

test1='123-4-5648'
test2='1-234-5-6789'

re.findall(pattern=r'[0-9]-[0-9]{3}-[0-9]-[0-9]{4}|[0-9]{3}-[0-9]-[0-9]{4}', string=test1)
re.findall(pattern=r'[0-9]-[0-9]{3}-[0-9]-[0-9]{4}|[0-9]{3}-[0-9]-[0-9]{4}', string=test2)

[0-9] matches a single digit in the range 0 through 9 (inclusive), {4} indicates that four such digits should occur in a row, - means a hyphen, and | means an OR and separates the two patterns you mention.

Sign up to request clarification or add additional context in comments.

Comments

1

You could match an optional part (?:\d-)? to match 1 digit and a hypen, followed by \d{3}-\d-\d{4} which will match the pattern of the digits for both the examples.

(?:\d-)?\d{3}-\d-\d{4}

Regex demo

Instead of using a word boundary \b, if there can not be a non whitespace character before your value, you could prepend the regex with (?<!\S) and if there can not be a non whitespace character after you could add (?!\S) at the end.

Comments

0

If there will only a maximum of one hyphen between two numbers then, ^[0-9]+(-[0-9]+)+$ would work well. It uses the normal*(special normal*)* pattern where normal is [0-9] and special is -.

1 Comment

This matches 1-1111111-1-1-1-1-1-1-1-1-1-1, which is not one of OP's patterns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.