1

I'm new to Python and trying to analyse some data. So I've imported and concatenated all the csv files in a folder into a single dataframe. I'm trying to extract part of the file name to use as a header and after searching, I find that you'd normally use regex.

The filenames are like this: 'Varying Concentration2_20190712-145158_Base Media.csv', 'Varying Concentration2_20190712-145158_250 g per l.csv', etc So the part I'm trying to extract is after the _ and before the .csv.

I've tried:

for fname in all_data:
    res = re.findall("(?<=_)(\w+).csv$", fname)
    if not res: continue
    print (res)

and also "(?<=[0-9]+_)(\w+)" but it does not seem to work.

The desired output would be a list containing 'Base Media', '150g per l' and so on.

0

4 Answers 4

1

Here is an option which avoid regex and instead uses the base split string function, twice:

filename = 'Varying Concentration2_20190712-145158_Base Media.csv'
parts = filename.split('_')
nameonly = parts[len(parts)-1].split('.')[0]
print(nameonly)

Output:

Base Media

If the full filename could also contains dots, then this answer might need to be adjusted.

Sign up to request clarification or add additional context in comments.

3 Comments

or just extract = (filename.split('_')[-1]).split('.')[0]
header = filename.rsplit('_', 1)[-1].rsplit('.', 1)[0].
When I saw your better versions I was just like :-O
0

You can do:

(?<=_)[^_]+(?=\.csv$)
  • (?<=_) is zero-width positive lookbehind that matches _

  • [^_]+ matches one or more characters that are not _, this is our desired portion

  • (?=\.csv$) is zero-width positive lookahead makes sure we have csv at the end after the match

If you don't want to use lookarounds, you can use plain patterns and put the desired match in first (and only) captured group (and get the output by match.group(1) instead of match.group()):

_([^_]+)\.csv$ 

Example:

In [38]: text = 'Varying Concentration2_20190712-145158_Base Media.csv'

In [39]: re.search(r'(?<=_)[^_]+(?=\.csv$)', text).group()
Out[39]: 'Base Media'

In [40]: text = 'Varying Concentration2_20190712-145158_250 g per l.csv'

In [41]: re.search(r'(?<=_)[^_]+(?=\.csv$)', text).group()
Out[41]: '250 g per l'

Comments

0

Use the following:

^.*_(.*)\.csv$

All this does is skips everything until _ then captures everything until .csv.

Demo

Comments

0

You can use:

_([^._]+).csv

and take the first captured group.

Demo

Explanation:

_([^._]+) you find _ and to ensure it's the last on in the string, you exculde _ from the repetition [^_]. You also exculude a dot, to avoid matching the extension .csv and that's why you repeat [^._]+. It's wrapped in brackets ([^._]+) making it a capturing group that you can use later.

In python:

>>> text = 'Varying Concentration2_20190712-145158_Base Media.csv'
>>> re.search(r'_([^._]+).csv', text).group(1)
'Base Media'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.