Simliar strings and want to create 3 separate dataframes using RegEx, Pandas in python

Question

I am currently trying to create groups for numbers in two very similar strings. I can't seem to separate the expressions, I started learning RegEx recently. I want to have 3 dataframes. A dataframe for "V1", "V2", and "V3". I only want the first value, within each bracket. So for example in V1, 1-22, I just want 75.43. Hopefully that makes sense, i'm a bit stuck.

TEXT,TEXT,20190726,TEXT,TEXT00000,,NORMAL;
*
TEXT,TEXT-LT.V1,,,4.0,TEXT,NORMAL;
1-22,,{(75.43,0.0),(75.43,110.0),(75.45,119.0),(96.54,139.0),(109.25,159.0)},
23,,{(20.82,0.0),(20.82,110.0),(20.84,119.0),(41.93,139.0),(54.64,159.0)},
24,,{(81.26,0.0),(81.26,110.0),(81.28,119.0),(102.37,139.0),(115.08,159.0)},
*
*
TEXT,TEXT,20190726,TEXT,TEXT00000,,NORMAL;
*
TEXT,TEXT-TEXT.V2,,,4.0,TEXT,NORMAL;
1-22,,{(74.93,0.0),(74.93,110.0),(74.95,119.0),(74.95,139.0),(74.95,163.0)},
23,,{(24.98,0.0),(24.98,110.0),(25.00,119.0),(25.00,139.0),(25.00,163.0)},
24,,{(80.76,0.0),(80.76,110.0),(80.78,119.0),(80.78,139.0),(80.78,163.0)},
*
*
TEXT,TEXT,20190726,TEXT,TEXT00000,,NORMAL;
*
TEXT,TEXT-TEXT.V3,,,2.0,TEXT,NORMAL;
1-22,,{(74.94,0.0),(74.94,70.0),(75.46,147.0),(96.54,167.0),(109.25,186.0),(109.27,210.0)},
23-24,,{(80.77,0.0),(80.77,70.0),(81.29,147.0),(102.37,167.0),(115.08,186.0),(115.10,210.0)},
*

What I tried
f = open("TextFile.txt","r")
TextFile_str = f.read()
Value_Only = re.compile(r'(\d+-?\d+),+\{\((\d+\.\d+),\d+\.\d+\),\((\d+\.\d+),\d+\.\d+\),\((\d+\.\d+),\d+\.\d+\),\((\d+\.\d+),\d+\.\d+\),\((\d+\.\d+),\d+\.\d+\),*\(*(\d*\.*\d*),*\d*\.*\d*\)*\}*')
match_Value = Value_Only.findall(TextFile_str)
match_Value_df = pd.DataFrame(match_Value)
match_Value_df.columns = ['Hour', 'Value 1', 'Value 2', 'Value 3', 'Value 4', 'Value 5', 'Value 6']

#How it looks 
    Hour Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
0   1-22   75.43   75.43   75.45   96.54  109.25        
1     23   20.82   20.82   20.84   41.93   54.64        
2     24   81.26   81.26   81.28  102.37  115.08        
3   1-22   74.93   74.93   74.95   74.95   74.95        
4     23   24.98   24.98   25.00   25.00   25.00        
5     24   80.76   80.76   80.78   80.78   80.78        
6   1-22   74.94   74.94   75.46   96.54  109.25  109.27
7  23-24   80.77   80.77   81.29  102.37  115.08  115.10

Ideally I want to have 3 separate dataframes for V1, V2, and V3.

Expected Result
Dataframe 1 
    Hour Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
0   1-22   75.43   75.43   75.45   96.54  109.25        
1     23   20.82   20.82   20.84   41.93   54.64        
2     24   81.26   81.26   81.28  102.37  115.08

Dataframe 2
    Hour Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
0   1-22   74.93   74.93   74.95   74.95   74.95        
1     23   24.98   24.98   25.00   25.00   25.00        
2     24   80.76   80.76   80.78   80.78   80.78 

Dataframe 3
    Hour Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
0   1-22   74.94   74.94   75.46   96.54  109.25  109.27
1  23-24   80.77   80.77   81.29  102.37  115.08  115.10

I haven't run the code fully, but you might try shortening it with the code r'((\d+-?\d+),+\{((((\d+\.\d+),\d+\.\d+)),)+\},)' and then creating three dataframes based off of the indices of matches in your findall. I always find the site regexpal.com helpful when testing Regex code. — Todd Burus
– Todd Burus, Commented Jul 26, 2019 at 14:54
I thought about creating databases based on the indices off my findall statement before. But sometimes the text file will have more indices than this. Like the hours could be like 1-2,3-4,5-6,7-12,12-24. Thanks i'll try shortening it also i've recently found out about this site regexr.com. Both seem to work just fine! — flamingbird123
– flamingbird123, Commented Jul 26, 2019 at 14:58
You could add a call for the V's like r'(V\d+)|((\d+-?\d+),+\{((((\d+\.\d+)\,+(\d+\.\d+)),)+\},))'. This would give you the opportunity to find the indices between them to break things up. — Todd Burus
– Todd Burus, Commented Jul 26, 2019 at 15:16

Code Different · Accepted Answer · 2019-07-26 20:26:11Z

If I understand you correctly, you want to split the dataframe whenever Hour1 = 1-22. Try this:

s = (match_Value_df['Hour'] == '1-22').cumsum()
dfs = []
for i in range(s.min(), s.max() + 1):
    subDF = match_Value_df.loc[s == i]
    dfs.append(subDF)

Result:

dfs[0]:
   Hour Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
0  1-22   75.43   75.43   75.45   96.54  109.25        
1    23   20.82   20.82   20.84   41.93   54.64        
2    24   81.26   81.26   81.28  102.37  115.08        

dfs[1]:
   Hour Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
3  1-22   74.93   74.93   74.95   74.95   74.95        
4    23   24.98   24.98   25.00   25.00   25.00        
5    24   80.76   80.76   80.78   80.78   80.78        

dfs[2]:
    Hour Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
6   1-22   74.94   74.94   75.46   96.54  109.25  109.27
7  23-24   80.77   80.77   81.29  102.37  115.08  115.10

If you want to get them into 3 different variables:

v1, v2, v3 = dfs[slice(0, 3)]

Collectives™ on Stack Overflow

Simliar strings and want to create 3 separate dataframes using RegEx, Pandas in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related