0

Using Python 3.

I would like to parse a set of strings which are of the same format. One example I have is a list of books in the format:

title (year), author

e.g. "The Hitchhiker's Guide to the Galaxy (1979), Douglas Adams"

I'd like to extract the book's title, year and author from these strings using something elegant.

Something like:

book = "The Hitchhiker's Guide to the Galaxy (1979), Douglas Adams"
data = parsing_function(book, format)

where:

  • format is some input that describes the format of the input string. A coded way of saying "author first, then the year in the brackets, then the author after the comma". Something like format = '{title} ({year}), {author}'
  • data is the extracted title, year, etc. This could be a list or even better a dictionary.

This is inspired by the way Pandas parses date/time strings into datetime variables - see pandas.to_datetime here. A format variable is passed in to the function to show how the date/time is represented, like:

pandas.to_datetime('13000101', format='%Y%m%d', errors='ignore')
 >>> datetime.datetime(1300, 1, 1, 0, 0)

Is there a similar method of separating data in a string into different variables?

I can see a way to write a function for this specific case (e.g. using str.split() on the brackets/comma and separating that way), but I'm looking for a generic function that can be used on strings in any consistent format.

Thank you

1 Answer 1

2

You could use regular expressions matching your structured string:

import re

book = "The Hitchhiker's Guide to the Galaxy (1979), Douglas Adams"

m = re.match(r'(.+) \((.+)\), (.+)', book)
title, year, author = m.groups()

You can name the capturing groups (the stuff between unescaped parentheses "(...)") to make things more explicit.

m = re.match(r'(?P<title>.+) \((?P<year>.+)\), (?P<author>.+)', book)
m.group("title")
# "The Hitchhiker's Guide to the Galaxy"
m.group("year")
# '1979'
m.group("author")
# 'Douglas Adams'
Sign up to request clarification or add additional context in comments.

3 Comments

If its getting more complex, something like a word tokenization package could be a good way too. Something that is used in natural language processing.
I am not sure that you would gain much. Regexes are quite powerful in parsing strings of consistent patterns. Unless you are venturing into context-sensitive territory (where simple tokenization won't be enough, but serious NLP parsing algorithms are required) this should be fine. And it comes with the standard library.
Thanks @schwobaseggl, exactly what I was looking for

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.