0

Here's the string :

SCOPE OF WORK: Supply &  Flensburg House, MMDA Colony,     PAN#: AAYCS8310G
installation Arumbakkam,Chennai,Tamil Nadu,
  xxxxxx

The things that will change in the string are:

Flensburg House, MMDA Colony,

and

Arumbakkam,Chennai,Tamil Nadu,

And these parts of the strings can contain alphabets , numbers , commas,#,- and _

The remaining parts of the string will remain as it is, including spacings.

Here's the regex I am using

SCOPE OF WORK: Supply &  [A-Za-z,\s]]*PAN#: [A-Z]{5}[0-9]{4}[A-Z]{1}\n    installation [A-Za-z]\n      xxxxxx

Ultimately what I need to obtain is:

Flensburg House, MMDA Colony,     
installation Arumbakkam,Chennai,Tamil Nadu,

I don't think my regex is entirely right and I need help on how to go about this.

1 Answer 1

1

A few things I noticed about your current pattern:

  • You are trying to match more space characters than pressent in text;
  • Your character classes for both substrings differ. There is spaces and comma missing from the 2nd one which is also only matched once. + Both are missing the # symbol and digits currently;

Assuming you need to just get these two substring in groups (excluding the trailing comma), try:

^SCOPE OF WORK: Supply &  ([\w, #-]+),\s+PAN#: [A-Z]{5}[0-9]{4}[A-Z]\s+installation ([\w, #-]+),\s+x{6}$

See an online demo


  • ^ - Start-line anchor;
  • SCOPE OF WORK: Supply & - A literal match of this substring including the two trailing spaces;
  • ([\w, #-]+) - A 1st capture group to match 1+ characters from given class where \w is shorthand for [A-Za-z0-9_], all characters you mentioned it needs to include;
  • ,\s+PAN#: - A literal match of this substring including the trailing comma and 1+ whitespace characters;
  • [A-Z]{5}[0-9]{4}[A-Z] - Verification what follows is 5 uppercase letter, 4 digits and a single uppercase (no need to quantify a single character);
  • \s+installation - 1+ Whitespace characters including newline and trailing spaces upto;
  • ([\w, #-]+) - A 2nd capture group to match the same pattern as 1st group;
  • ,\s+x{6} - Match the trailing comma, 1+ whitespace characters and 6 trailing x's;
  • $ - End-line anchor.

import re

s = """SCOPE OF WORK: Supply &  Flensburg House, MMDA Colony,     PAN#: AAYCS8310G
installation Arumbakkam,Chennai,Tamil Nadu,
  xxxxxx"""
  
l = re.findall(r'^SCOPE OF WORK: Supply &  ([\w, #-]+),\s+PAN#: [A-Z]{5}[0-9]{4}[A-Z]\s+installation ([\w, #-]+),\s+x{6}$', s)

print(l)

Prints:

[('Flensburg House, MMDA Colony', 'Arumbakkam,Chennai,Tamil Nadu')]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.