2

I am currently trying to create groups for numbers in two very similar strings. I can't seem to separate the expressions, I started learning RegEx recently. I want to have 3 dataframes. A dataframe for "V1", "V2", and "V3". I only want the first value, within each bracket. So for example in V1, 1-22, I just want 75.43. Hopefully that makes sense, i'm a bit stuck.

TEXT,TEXT,20190726,TEXT,TEXT00000,,NORMAL;
*
TEXT,TEXT-LT.V1,,,4.0,TEXT,NORMAL;
1-22,,{(75.43,0.0),(75.43,110.0),(75.45,119.0),(96.54,139.0),(109.25,159.0)},
23,,{(20.82,0.0),(20.82,110.0),(20.84,119.0),(41.93,139.0),(54.64,159.0)},
24,,{(81.26,0.0),(81.26,110.0),(81.28,119.0),(102.37,139.0),(115.08,159.0)},
*
*
TEXT,TEXT,20190726,TEXT,TEXT00000,,NORMAL;
*
TEXT,TEXT-TEXT.V2,,,4.0,TEXT,NORMAL;
1-22,,{(74.93,0.0),(74.93,110.0),(74.95,119.0),(74.95,139.0),(74.95,163.0)},
23,,{(24.98,0.0),(24.98,110.0),(25.00,119.0),(25.00,139.0),(25.00,163.0)},
24,,{(80.76,0.0),(80.76,110.0),(80.78,119.0),(80.78,139.0),(80.78,163.0)},
*
*
TEXT,TEXT,20190726,TEXT,TEXT00000,,NORMAL;
*
TEXT,TEXT-TEXT.V3,,,2.0,TEXT,NORMAL;
1-22,,{(74.94,0.0),(74.94,70.0),(75.46,147.0),(96.54,167.0),(109.25,186.0),(109.27,210.0)},
23-24,,{(80.77,0.0),(80.77,70.0),(81.29,147.0),(102.37,167.0),(115.08,186.0),(115.10,210.0)},
*
What I tried
f = open("TextFile.txt","r")
TextFile_str = f.read()
Value_Only = re.compile(r'(\d+-?\d+),+\{\((\d+\.\d+),\d+\.\d+\),\((\d+\.\d+),\d+\.\d+\),\((\d+\.\d+),\d+\.\d+\),\((\d+\.\d+),\d+\.\d+\),\((\d+\.\d+),\d+\.\d+\),*\(*(\d*\.*\d*),*\d*\.*\d*\)*\}*')
match_Value = Value_Only.findall(TextFile_str)
match_Value_df = pd.DataFrame(match_Value)
match_Value_df.columns = ['Hour', 'Value 1', 'Value 2', 'Value 3', 'Value 4', 'Value 5', 'Value 6']

#How it looks 
    Hour Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
0   1-22   75.43   75.43   75.45   96.54  109.25        
1     23   20.82   20.82   20.84   41.93   54.64        
2     24   81.26   81.26   81.28  102.37  115.08        
3   1-22   74.93   74.93   74.95   74.95   74.95        
4     23   24.98   24.98   25.00   25.00   25.00        
5     24   80.76   80.76   80.78   80.78   80.78        
6   1-22   74.94   74.94   75.46   96.54  109.25  109.27
7  23-24   80.77   80.77   81.29  102.37  115.08  115.10

Ideally I want to have 3 separate dataframes for V1, V2, and V3.

Expected Result
Dataframe 1 
    Hour Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
0   1-22   75.43   75.43   75.45   96.54  109.25        
1     23   20.82   20.82   20.84   41.93   54.64        
2     24   81.26   81.26   81.28  102.37  115.08

Dataframe 2
    Hour Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
0   1-22   74.93   74.93   74.95   74.95   74.95        
1     23   24.98   24.98   25.00   25.00   25.00        
2     24   80.76   80.76   80.78   80.78   80.78 

Dataframe 3
    Hour Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
0   1-22   74.94   74.94   75.46   96.54  109.25  109.27
1  23-24   80.77   80.77   81.29  102.37  115.08  115.10
3
  • I haven't run the code fully, but you might try shortening it with the code r'((\d+-?\d+),+\{((((\d+\.\d+),\d+\.\d+)),)+\},)' and then creating three dataframes based off of the indices of matches in your findall. I always find the site regexpal.com helpful when testing Regex code. Commented Jul 26, 2019 at 14:54
  • I thought about creating databases based on the indices off my findall statement before. But sometimes the text file will have more indices than this. Like the hours could be like 1-2,3-4,5-6,7-12,12-24. Thanks i'll try shortening it also i've recently found out about this site regexr.com. Both seem to work just fine! Commented Jul 26, 2019 at 14:58
  • You could add a call for the V's like r'(V\d+)|((\d+-?\d+),+\{((((\d+\.\d+)\,+(\d+\.\d+)),)+\},))'. This would give you the opportunity to find the indices between them to break things up. Commented Jul 26, 2019 at 15:16

1 Answer 1

1

If I understand you correctly, you want to split the dataframe whenever Hour1 = 1-22. Try this:

s = (match_Value_df['Hour'] == '1-22').cumsum()
dfs = []
for i in range(s.min(), s.max() + 1):
    subDF = match_Value_df.loc[s == i]
    dfs.append(subDF)

Result:

dfs[0]:
   Hour Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
0  1-22   75.43   75.43   75.45   96.54  109.25        
1    23   20.82   20.82   20.84   41.93   54.64        
2    24   81.26   81.26   81.28  102.37  115.08        

dfs[1]:
   Hour Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
3  1-22   74.93   74.93   74.95   74.95   74.95        
4    23   24.98   24.98   25.00   25.00   25.00        
5    24   80.76   80.76   80.78   80.78   80.78        

dfs[2]:
    Hour Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
6   1-22   74.94   74.94   75.46   96.54  109.25  109.27
7  23-24   80.77   80.77   81.29  102.37  115.08  115.10

If you want to get them into 3 different variables:

v1, v2, v3 = dfs[slice(0, 3)]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.