1

First of all, I checked these previous posts, and did not help me. 1 & 2 & 3
I have this string (or a similar case could be) that need to be handled with regex:

"Text Table 6-2: Management of children study and actions"

  1. What I am supposed to do is detect the word Table and the word(s) before if existed
  2. detect the numbers following and they can be in this format: 6 or 6-2 or 66-22 or 66-2
  3. Finally the rest of the string (in this case: Management of children study and actions)

After doing so, the return value must be like this:

return 1 and 2 as one string, the rest as another string
e.g. returned value must look like this: Text Table 6-2, Management of children study and actions

Below is my code:

mystr = "Text Table 6-2:    Management of children study and actions"


if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
    print("True matched")
    parts_of_title = re.search("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr)
    print(parts_of_title)
    print(" ".join(parts_of_title.group().split()[0:3]), parts_of_title.group().split()[-1])

The first requirement is returned true as should be but the second doesn't so, I changed the code and used compile but the regex functionality changed, the code is like this:

mystr = "Text Table 6-2:    Management of children study and actions"


if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
    print("True matched")
    parts_of_title = re.compile("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?").split(mystr)
    print(parts_of_title)

Output:

True matched
['', 'Text ', 'Table', '-2', ':\tManagement of children study and actions']

So based on this, how I can achieve this and stick to a clean and readable code? and why does using compile change the matching?

2 Answers 2

1
+50

The matching changes because:

  • In the first part, you call .group().split() where .group() returns the full match which is a string.

  • In the second part, you call re.compile("...").split() where re.compile returns a regular expression object.

In the pattern, this part will match only a single word [a-zA-Z0-9]+[ ], and if this part should be in a capture group [0-9]([-][0-9]+)? the first (single) digit is currently not part of the capture group.

You could write the pattern writing 4 capture groups:

^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)

See a regex demo.

import re

pattern = r"^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)"
s = "Text Table 6-2:    Management of children study and actions"
m = re.match(pattern, s)
if m:
    print(m.groups())

Output

('Text ', 'Table', '6-2', 'Management of children study and actions')

If you want point 1 and 2 as one string, then you can use 2 capture groups instead.

^((?:.*? )?(?:[Ll]ist|[Tt]able|[Ff]igure)\s+\d+(?:-\d+)?):\s+(.+)

Regex demo

The output will be

('Text Table 6-2', 'Management of children study and actions')
Sign up to request clarification or add additional context in comments.

2 Comments

Is there a way for me to learn the regex as you wrote? @The fourth bird
@Ahmad There a some very informative sites like rexegg.com/regex-quickstart.html and regular-expressions.info
1

you have already had answers but I wanted to try your problem to train myself so I give you all the same what I found if you are interested:

((?:[a-zA-Z0-9]+)? ?(?:[Ll]ist|[Tt]able|[Ff]igure)).*?((?:[0-9]+\-[0-9]+)|(?<!-)[0-9]+): (.*)

And here is the link to my tests: https://regex101.com/r/7VpPM2/1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.