1

I have a messy list of strings (list_strings), where I am able to remove using regex the unwanted characters, but I am struggling to also remove the closing bracket ] . How can I also remove those ? I guess I am very close...

#the list to clean
list_strings = ['[ABC1: text1]', '[[DC: this is a text]]', '[ABC-O: potatoes]', '[[C-DF: hello]]']

#remove from [ up to : 
for string in list_strings:
  cleaned = re.sub(r'[\[A-Z\d\-]+:\s*', '', string)
  print(cleaned)

# current output

>>>text1]
>>>this is a text]]
>>>potatoes]
>>>hello]

Desired output:

text1
this is a text
potatoes
hello

4 Answers 4

4

Have your code this way. Fixing OP's attempt itself here. Your regex is doing all the thing only point is just add an OR condition where we could mention to substitute 1 or more occurrences of ] too.

import re
list_strings = ['[ABC1: text1]', '[[DC: this is a text]]', '[ABC-O: potatoes]', '[[C-DF: hello]]']
for string in list_strings:
  cleaned = re.sub(r'[\[A-Z\d\-]+:\s+|\]+$', '', string)
  print(cleaned)
Sign up to request clarification or add additional context in comments.

Comments

3

I'd go with a different approach to regex using rstrip() and split() functionality:

list_strings = ['[ABC1: text1]', '[[DC: this is a text]]', '[ABC-O: potatoes]', '[[C-DF: hello]]']

cleaned = [s.split(': ')[1].rstrip(']') for s in list_strings]
print(cleaned) # ['text1', 'this is a text', 'potatoes', 'hello']

Comments

3

I would use a list comprehension here:

list_strings = ['[ABC1: text1]', '[[DC: this is a text]]', '[ABC-O: potatoes]', '[[C-DF: hello]]']
cleaned = [x.split(':')[1].strip().replace(']', '') for x in list_strings]
print(cleaned)  # ['text1', 'this is a text', 'potatoes', 'hello']

Comments

2

You can use

cleaned = re.sub(r'^\[+[A-Z\d-]+:\s*|]+$', '', string)

See the Python demo and the regex demo.

Alternatively, to make sure the string starts with [[word: and ends with ]s, you may use

cleaned = re.sub(r'^\[+[A-Z\d-]+:\s*(.*?)\s*]+$', r'\1', string)

See this regex demo and this Python demo.

And, in case you simply want to extract that text inside, you may use

# First match only
m = re.search(r'\[+[A-Z\d-]+:\s*(.*?)\s*]', string)
if m:
    print(m.group(1))

# All matches
matches = re.findall(r'\[+[A-Z\d-]+:\s*(.*?)\s*]', string)

See this regex demo and this Python demo.

Details

  • ^ - start of string
  • \[+ - one or more [ chars
  • [A-Z\d-]+ - one or more uppercase ASCII letters, digits or - chars
  • : - a colon
  • \s* - zero or more whitespaces
  • | - or
  • ]+$ - one or more ] chars at the end of string.

Also, (.*?) is a capturing group with ID 1 that matches any zero or more chars other than line break chars, as few as possible. \1 in the replacement refers to the value stored in this group memory buffer.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.