python dataframe operations

Question

I have a dataset of historical precipitation records (1990-2010) for different locations (latitude and longitude), having a table with 5 attributes (lat,lon,year,month,prec). The dataset is organized by defining groups by latitude, longitude and time. For example:

INPUT

lan/lon/year/month/prec
-17/18/1990/1/0.4
-17/18/1990/2/0.02
-17/18/1990/3/0.12
-17/18/1990/4/0.06
.
.
.
-17/18/2020/12/0.35
-17/20/1990/1/0.2
-17/20/1990/2/0.2
-17/20/1990/3/0.2
-17/20/1990/4/0.2
.
.
.
-17/20/2020/12/0.08
-18/20/1990/1/0.11
-18/20/1990/2/0.11
-18/20/1990/3/0.11
.
.
.
.

EXPECTED OUTPUT (accumulation period=3)

lan/lon/year/month/prec/prec_3
-17/18/1990/1/0.4/-
-17/18/1990/2/0.02/-
-17/18/1990/3/0.12/0.54
-17/18/1990/4/0.06/0.2
.
.
.
-17/18/2020/12/0.35/12.58
-17/20/1990/1/0.2/-
-17/20/1990/2/0.2/-
-17/20/1990/3/0.2/0.6
-17/20/1990/4/0.2/0.8
.
.
.
-17/20/2020/12/0.08/35.0
-18/20/1990/1/0.11/-
-18/20/1990/2/0.11/-
-18/20/1990/3/0.11/0.33
.
.
.
.

I want to perform an analysis on that time series and that analysis consists of performing calculations on the precipitation variable, such as adding up different accumulation periods, for example, 3 and 6 months for the time period by coordinate pair, and then adjusting the data to a probability distribution. Does anyone know how to perform these ''sums'' taking into account that it should be in the given time period and should not use the information related to another given latitude and longitude? Additional information There are monthly records from 1990 to 2020, the calculation must be restarted when the longitude or latitude changes since that indicate that it is another point and the data (all the record) are in CSV format. the information is organized and doesn't have nan values

1. How frequent should the accumulation reset? 2. Is your data a text file in the format you presented? 3. Are you summing the most frequent 3 months, or summing every month but start taking the sum value from the third month? — Bill Huang
– Bill Huang, Commented Nov 2, 2020 at 17:05
Thank you, for each location, there are monthly records from 1990 to 2020, the calculation must be restarted when the longitude or latitude changes since that indicate that it is another point and the data (all the record) are in CSV format. the information are organized and don't have nan values. — KaSan
– KaSan, Commented Nov 2, 2020 at 17:09
You didn't answer my Q3. Your expected output doesn't look like a 3-month accumulation as you described in the text. May you explain? — Bill Huang
– Bill Huang, Commented Nov 2, 2020 at 17:15
If I have a series of one year (12 months) and I want to make an accumulation for 3 months, the field where I will store the data (prec_3) will begin to have values from March being the sum of January, February and March, for April it will be the sum of February March and April and so on. The accumulation is carried out until the series ends, in this case for the question it will be when the latitude or longitude changes since it is another location. — KaSan
– KaSan, Commented Nov 2, 2020 at 17:21

Bill Huang · Accepted Answer · 2020-11-02 17:42:08Z

It looks like .rolling(period).sum() is what you are looking for.

Input csv file

lan/lon/year/month/prec
-17/18/1990/1/0.4
-17/18/1990/2/0.02
-17/18/1990/3/0.12
-17/18/1990/4/0.06
-17/18/2020/12/0.35
-17/20/1990/1/0.2
-17/20/1990/2/0.2
-17/20/1990/3/0.2
-17/20/1990/4/0.2
-17/20/2020/12/0.08
-18/20/1990/1/0.11
-18/20/1990/2/0.11
-18/20/1990/3/0.11

Code

df = pd.read_csv(path_to_file, sep="/").sort_values(["lan","lon","year","month"])
df["prec_3"] = df.groupby(["lan","lon"])["prec"].rolling(3).sum().values

Note that column ordering is changed by pre-sorting in order to match the cumsum output. It can be recovered by df.sort_index() if needed.

Output

print(df)  # the original ordering is preserved in the index

    lan  lon  year  month  prec  prec_3
10  -18   20  1990      1  0.11     NaN
11  -18   20  1990      2  0.11     NaN
12  -18   20  1990      3  0.11    0.33
0   -17   18  1990      1  0.40     NaN
1   -17   18  1990      2  0.02     NaN
2   -17   18  1990      3  0.12    0.54
3   -17   18  1990      4  0.06    0.20
4   -17   18  2020     12  0.35    0.53
5   -17   20  1990      1  0.20     NaN
6   -17   20  1990      2  0.20     NaN
7   -17   20  1990      3  0.20    0.60
8   -17   20  1990      4  0.20    0.60
9   -17   20  2020     12  0.08    0.48

Collectives™ on Stack Overflow

python dataframe operations

1 Answer 1

Input csv file

Code

Output

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Input csv file

Code

Output

Comments

Your Answer

Sign up or log in

Post as a guest

Related