-2

I've a bytes:

b'\n\x1b\t\xff\xff\xff\x7f@^\x8a?\x11\x00\x00\x00@\xe8HL\xbf\x19\x00\x00\x00\x00\x95\xb0\xd9?\x127\r\xc9\xd5"=\x15\xc9\xd5"=\x1a+\x1a)\n\x1e\x12\x1c\n\x0fMovies"\x07Bollard0\x01\x11\x00\x00\x00\x00\x00\x00\xf0?' b'\n\x1b\t\x01\x00\x00\x00\xa4\x9b\xb0\xbf\x11\x01\x00\x00\xc0/\xe3\x90?\x19\x01\x00\x00\xa0U\xc4\xef?\x127\r|\x934=\x15|\x934=\x1a+\x1a)\n\x1e\x12\x1c\n\x0fMovies"\x07TV Series0\x01\x11\x00\x00\x00\x00\x00\x00\xf0?'

Using regex, I want to extract words(in this case "Movies", "Movies" and "TV Series")

What I tried:

Extract word from string Using python regex

Extracting words from a string, removing punctuation and returning a list with separated words

Python regex for finding all words in a string

1
  • It is not clear what you are doing and why you expect just Movies and TV Series. Please show your code and explain what does not work. Commented Jul 1, 2020 at 9:12

1 Answer 1

0

Usually you would convert bytes into a string using the .decode() method. However, your bytes contain values that are not ASCII or UTF-8.

My suggestion is to go through each byte and try interpreting it as an ASCII value

raw= b'\n\x1b\t\xff\xff\xff\x7f@^\x8a?\x11\x00\x00\x00@\xe8HL\xbf\x19\x00\x00\x00\x00\x95\xb0\xd9?\x127\r\xc9\xd5"=\x15\xc9\xd5"=\x1a+\x1a)\n\x1e\x12\x1c\n\x0fMovies"\x07Bollard0\x01\x11\x00\x00\x00\x00\x00\x00\xf0?' b'\n\x1b\t\x01\x00\x00\x00\xa4\x9b\xb0\xbf\x11\x01\x00\x00\xc0/\xe3\x90?\x19\x01\x00\x00\xa0U\xc4\xef?\x127\r|\x934=\x15|\x934=\x1a+\x1a)\n\x1e\x12\x1c\n\x0fMovies"\x07TV Series0\x01\x11\x00\x00\x00\x00\x00\x00\xf0?'
string = ""
for b in raw:
    string += chr(b)
print(string)

After that, you can use a Regex approach to find words. It's usually a good idea to define a minimum length for a word.

import re
for word in re.split('\W', string):
    if len(word) > 3:
        print(word)

That will give you:

Movies
Bollard0
Movies
Series0

You have not mentioned "Bollard0", but I assume that was a mistake.

If you want spaces to be part of your string, you'll need to adapt the Regex. \W splits on word boundaries and Space is considered a boundary.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.