1

I want to extract the lines between a keyword and a sentence from text data. Here is my data,

CUSTOMER SUPPLIED DATA: 
- topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: [email protected]
- I need help with: Forgot password or ID

  ***  System::[chat.automatonClientOutcome] Hello! How may I help you today?   ***  System::[chat.queueWaitDisplayed] We are currently experiencing very high chat volumes which may cause long delays. An agent will be with you as soon as possible.

Here help me to extract the lines under the key word "CUSTOMER SUPPLIED DATA:", before * system line starts. (extract lines between CUSTOMER SUPPLIED DATA: and * System line).

I have tried the following code,

m = re.search('CUSTOMER SUPPLIED DATA:\s*([^\n]+)', dt["chat_consolidation" 
     [546])

m.group(1)

which gives me only a single line between CUSTOMER SUPPLIED DATA: and *** system line

The output is like this:

[out]: - topic: Sign in & Password Support

But my required output should be like this,

[Out]: - topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: [email protected]
- I need help with: Forgot password or ID

Thanks in advance for helping me.

1
  • If there is a blank line between the fields you want to extract and the *** System, maybe you could use split() function with argument on which you split set to \n\n and get the first element only? Commented Nov 12, 2018 at 8:07

2 Answers 2

1

You would need regex module for this.

x="""CUSTOMER SUPPLIED DATA: 
- topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: [email protected]
- I need help with: Forgot password or ID

  ***  System::[chat.automatonClientOutcome] Hello! How may I help you today?   ***  System::[chat.queueWaitDisplayed] We are currently experiencing very high chat volumes which may cause long delays. An agent will be with you as soon as possible.
- topic: Sign in & Password Support
- First Name: Brenda  
  """
import regex
print regex.findall(r"CUSTOMER SUPPLIED DATA: \n\K|\G(?!^)(-[^\n]+)\n", x, flags=regex.VERSION1)

Output:['', '- topic: Sign in & Password Support', '- First Name: Brenda', '- Last Name: Delacruz', '- Account number: xxxxxxxxx', '- U-verse 4-digit PIN: My PIN is', '- 4 digit PIN: xxxx', '- Email: [email protected]', '- I need help with: Forgot password or ID']

See demo.

https://regex101.com/r/naH3C7/2

Sign up to request clarification or add additional context in comments.

Comments

0

@vks is correct that the regex module would be better if you want to split it up like that. However, if you really just want what you ask for (a string with everything between CUSTOMER SUPPLIED DATA: and "*** System:"), changing the regexp to something like this works as well:

re.search("CUSTOMER SUPPLIED DATA:\s*(.+?)\*\*\*  System:", x, re.DOTALL).

With "([^\n]+)" you ask it to include everything until it hits a \n which is probably not what you want.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.