0

Data Set

Cider
631

Spruce
871

Honda
18813

Nissan
3292

Pine
10621

Walnut
10301

Code

#!/usr/bin/python
import re

text = "Cider\n631\n\nSpruce\n871Honda\n18813\n\nNissan\n3292\n\nPine\n10621\n\nWalnut\n10301\n\n"

f1 = re.findall(r"(Cider|Pine)\n(.*)",text)

print(f1)

Current Result

[('Cider', '631'), ('Pine', '10621')]

Question:

How do I change the regex from matching everything except several specified strings? ex (Honda|Nissan)

Desired Result

[('Cider', '631'), ('Spruce', '871'), ('Pine', '10621'), ('Walnut', '10301')]
1
  • 1
    Exclude them: ^(?!Honda|Nissan)[a-zA-Z]+\n\d+ Demo Commented Oct 14, 2021 at 15:32

2 Answers 2

1

You can exclude matching either of the names or only digits, and then match the 2 lines starting with at least a non whitespace char.

^(?!(?:Honda|Nissan|\d+)$)(\S.*)\n(.*)

The pattern matches:

  • ^ Start of string
  • (?! Negative lookahead, assert not directly to the right
    • (?:Honda|Nissan|\d+)$ Match any of the alternatives at followed by asserting the end of the string
  • ) Close lookahead
  • (\S.*) Capture group 1, match a non whitespace char followed by the rest of the line
  • \n Match a newline
  • (.*) Capture group 2, match any character except a newline

Regex demo

import re

text = ("Cider\n"
            "631\n\n"
            "Spruce\n"
            "871\n\n"
            "Honda\n"
            "18813\n\n"
            "Nissan\n"
            "3292\n\n"
            "Pine\n"
            "10621\n\n"
            "Walnut\n"
            "10301")
f1 = re.findall(r"^(?!(?:Honda|Nissan|\d+)$)(\S.*)\n(.*)", text, re.MULTILINE)

print(f1)

Output

[('Cider', '631'), ('Spruce', '871'), ('Pine', '10621'), ('Walnut', '10301')]

If the line should start with an uppercase char A-Z and the next line should consist of only digits:

^(?!Honda|Nissan)([A-Z].*)\n(\d+)$

This pattern matches:

  • ^ Start of string
  • (?!Honda|Nissan) Negative lookahead, assert not Honda or Nissan directly to the right
  • ([A-Z].*) Capture group 1, match an uppercase char A-Z followed by the rest of the line
  • \n Match a newline
  • (\d+) Capture group 2, match 1+ digits
  • $ End of string

Regex demo

Sign up to request clarification or add additional context in comments.

9 Comments

@Lacer You have to add re.MULTILINE
that worked! thanks.
@Lacer You are welcome. If all strings start with an uppercase char A-Z and the second line should have only digits ^(?!Honda|Nissan)([A-Z].*)\n(\d+)$ regex101.com/r/CNkdLD/1
@Lacer I have added a breakdown of the patterns in the answer.
thank you for the help and the explanation. Greatly appreciate it.
|
1

inverse it with caret ‘^’ symbol.

f1 = re.findall(r"(\s?^(Cider|Pine))\n(.*)",text)

Keep in mind that caret symbol (in regex) has a special meaning if it is used as a first character match which then would alternatively mean to be “does it start at the beginning of a line”.

Thats why one would insert a “non-usable character” in the beginning. I chosed an optional single space to use up that first character thereby rendering the meaning of the caret (^) symbol as NOT to mean “the beginning of the line”, but to get the desired inverse operator.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.