Separating data in a string into different variables

Question

Using Python 3.

I would like to parse a set of strings which are of the same format. One example I have is a list of books in the format:

title (year), author

e.g. "The Hitchhiker's Guide to the Galaxy (1979), Douglas Adams"

I'd like to extract the book's title, year and author from these strings using something elegant.

Something like:

book = "The Hitchhiker's Guide to the Galaxy (1979), Douglas Adams"
data = parsing_function(book, format)

where:

format is some input that describes the format of the input string. A coded way of saying "author first, then the year in the brackets, then the author after the comma". Something like format = '{title} ({year}), {author}'
data is the extracted title, year, etc. This could be a list or even better a dictionary.

This is inspired by the way Pandas parses date/time strings into datetime variables - see pandas.to_datetime here. A format variable is passed in to the function to show how the date/time is represented, like:

pandas.to_datetime('13000101', format='%Y%m%d', errors='ignore')
 >>> datetime.datetime(1300, 1, 1, 0, 0)

Is there a similar method of separating data in a string into different variables?

I can see a way to write a function for this specific case (e.g. using str.split() on the brackets/comma and separating that way), but I'm looking for a generic function that can be used on strings in any consistent format.

Thank you

user2390182 · Accepted Answer · 2020-05-02 07:30:28Z

2

You could use regular expressions matching your structured string:

import re

book = "The Hitchhiker's Guide to the Galaxy (1979), Douglas Adams"

m = re.match(r'(.+) \((.+)\), (.+)', book)
title, year, author = m.groups()

You can name the capturing groups (the stuff between unescaped parentheses "(...)") to make things more explicit.

m = re.match(r'(?P<title>.+) \((?P<year>.+)\), (?P<author>.+)', book)
m.group("title")
# "The Hitchhiker's Guide to the Galaxy"
m.group("year")
# '1979'
m.group("author")
# 'Douglas Adams'

edited May 2, 2020 at 7:30

answered May 2, 2020 at 7:24

user2390182

73.7k6 gold badges71 silver badges95 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

The Fool Over a year ago

If its getting more complex, something like a word tokenization package could be a good way too. Something that is used in natural language processing.

user2390182 Over a year ago

I am not sure that you would gain much. Regexes are quite powerful in parsing strings of consistent patterns. Unless you are venturing into context-sensitive territory (where simple tokenization won't be enough, but serious NLP parsing algorithms are required) this should be fine. And it comes with the standard library.

Chris Browne Over a year ago

Thanks @schwobaseggl, exactly what I was looking for

Collectives™ on Stack Overflow

Separating data in a string into different variables

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related