2

I'm using notepad++'s regular expression search function to find all strings in a .txt document that do not contain a specific value (HIJ in the below example), where all strings begin with the same value (ABC in the below example).

How would I go about doing this?

Example

  • Every String starts with ABC
  • ABC is never used in a string other than at the beginning, ABCABC123 would be two strings --"ABC" and "ABC123"
  • HIJ may appear multiple times in a string
  • I need to find the strings that do not contain HIJ
  • Input is one long file with no line breaks, but does contain special characters (*, ^, @, ~, :) and spaces

Example Input:

ABC1234HIJ56ABC7@HIJABC89ABCHIJ0ABE:HIJABC12~34HI456J

Example Input would be viewed as the following strings

ABC1234HIJ56
ABC7@HIJ
ABC89
ABCHIJ0ABE:HIJ
ABC12%34HI456J

The Third and Fifth strings both lack "HIJ" and therefore are included in the output, all others are not included in the output.

Example desired output:

ABC89
ABC12~34HI456J

I am 99% new to RegEx and will be looking more into it in the future, as my job description suddenly changed earlier this week when someone else in the company left suddenly, and therefore I have been doing this manually by searching (ABC|HIJ) and going through the search function's results looking for "ABC" appearing twice in a row. Supposedly the former employee was able to do this in an automated way, but left no documentation.

Any help would be appreciated!

This question is a repeat of a prior question I asked, but I was very very bad at formatting a question and it seems to have sunk beyond noticeable levels.

3 Answers 3

2

You can find the items you want with:

ABC(?:[^HA]+|H(?!IJ)|A(?!BC))*+(?=ABC|$)

Note: in this first pattern, you can replace (?=ABC|$) with (?!HIJ)

pattern details:

ABC
(?:            # non-capturing group
    [^HA]+     # all that is not a H or an A
  |            # OR
    H(?!IJ)    # an H not followed by IJ
  |
    A(?!BC)    # an A not followed by BC
)*+            # repeat the group
(?=ABC|$)      # followed by "ABC" or the end of the string

Note: if you want to remove all that is not the items you want you can make this search replace:

search: (?:ABC(?:[^HA]+|H(?!IJ)|A(?!BC))*+HIJ.*?(?=ABC|$))+|(?=ABC)
replace: \r\n
Sign up to request clarification or add additional context in comments.

5 Comments

you need the /g modifier to get multiple
@blackmind: No, since it is to be used in Notepad++, the user have the choice to click on a "find" or a "findall" button.
missed that part, well FYI to anyone else not using notepad++
@blackmind: note that the g flag doesn't always exist (PHP, Python), in this case, the global research is determined by the function you use.
THANK YOU! This works perfectly! Plus I understand how it works! Great answer!
0

you could use this pattern

(ABC(?:(?!HIJ).)*?)(?=ABC|\R)

Demo

(               # Capturing Group (1)
  ABC           # "ABC"
  (?:           # Non Capturing Group
    (?!         # Negative Look-Ahead
      HIJ       # "HIJ"
    )           # End of Negative Look-Ahead
    .           # Any character except line break
  )             # End of Non Capturing Group
  *?            # (zero or more)(lazy)
)               # End of Capturing Group (1)
(?=             # Look-Ahead
  ABC           # "ABC"
  |             # OR
  \R            # <line break>
)               # End of Look-Ahead

Comments

0

You can use the following expression to match your criterion:

(^ABC(?:(?!HIJ).)*$)

This starts with ABC and looks ahead (negative) for HIJ pattern. The pattern works for the separated strings.

For a single line pattern (as provided in your question), a slight modification of this works (as follows):

(ABC(?:(?!HIJ).)*?)(?=ABC|$)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.