5

I have a large data set i need to conduct time-series analysis on. The data is currently in excel formated as below:

+----+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+
| ID | Q1 09 variable X | Q2 09 variable X | Q3 09 variable X | Q4 09 variable X | Q1 10 variable X | Q2 10 variable X | Q3 10 variable X | Q4 10 variable X | Q1 09 variable Y | Q2 09 variable Y | Q3 09 variable Y | Q4 09 variable Y | Q1 10 variable Y | Q2 10 variable Y | Q3 10 variable Y | Q4 10 variable Y |
+----+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+
|  1 | X                | X                | X                | X                | X                | X                | X                | X                | Y                | Y                | Y                | Y                | Y                | Y                | Y                | Y                |
|  2 | X                | X                | X                | X                | X                | X                | X                | X                | Y                | Y                | Y                | Y                | Y                | Y                | Y                | Y                |
|  3 | X                | X                | X                | X                | X                | X                | X                | X                | Y                | Y                | Y                | Y                | Y                | Y                | Y                | Y                |
|  4 | X                | X                | X                | X                | X                | X                | X                | X                | Y                | Y                | Y                | Y                | Y                | Y                | Y                | Y                |
|  5 | X                | X                | X                | X                | X                | X                | X                | X                | Y                | Y                | Y                | Y                | Y                | Y                | Y                | Y                |
|  6 | X                | X                | X                | X                | X                | X                | X                | X                | Y                | Y                | Y                | Y                | Y                | Y                | Y                | Y                |
|  7 | X                | X                | X                | X                | X                | X                | X                | X                | Y                | Y                | Y                | Y                | Y                | Y                | Y                | Y                |
|  8 | X                | X                | X                | X                | X                | X                | X                | X                | Y                | Y                | Y                | Y                | Y                | Y                | Y                | Y                |
|  9 | X                | X                | X                | X                | X                | X                | X                | X                | Y                | Y                | Y                | Y                | Y                | Y                | Y                | Y                |
+----+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+

The above table is an example. The dataset contains data on all quarters up to Q1 2019. I need the data to look like this:

+----+-------+------------+------------+
| ID |   T   | Variable X | Variable Y |
+----+-------+------------+------------+
|  1 | Q1 09 | X          | Y          |
|  1 | Q2 09 | X          | Y          |
|  1 | Q3 09 | X          | Y          |
|  1 | Q4 09 | X          | Y          |
|  1 | Q1 10 | X          | Y          |
|  1 | Q2 10 | X          | Y          |
|  1 | Q3 10 | X          | Y          |
|  1 | Q4 10 | X          | Y          |
|  2 | Q1 09 | X          | Y          |
|  2 | Q2 09 | X          | Y          |
|  2 | Q3 09 | X          | Y          |
|  2 | Q4 09 | X          | Y          |
|  2 | Q1 10 | X          | Y          |
|  2 | Q2 10 | X          | Y          |
|  2 | Q3 10 | X          | Y          |
|  2 | Q4 10 | X          | Y          |
|  3 | Q1 09 | X          | Y          |
|  3 | Q2 09 | X          | Y          |
|  3 | Q3 09 | X          | Y          |
|  3 | Q4 09 | X          | Y          |
|  3 | Q1 10 | X          | Y          |
|  3 | Q2 10 | X          | Y          |
|  3 | Q3 10 | X          | Y          |
|  3 | Q4 10 | X          | Y          |
|  4 | Q1 09 | X          | Y          |
|  4 | Q2 09 | X          | Y          |
|  4 | Q3 09 | X          | Y          |
|  4 | Q4 09 | X          | Y          |
|  4 | Q1 10 | X          | Y          |
|  4 | Q2 10 | X          | Y          |
|  4 | Q3 10 | X          | Y          |
|  4 | Q4 10 | X          | Y          |
|  5 | Q1 09 | X          | Y          |
|  5 | Q2 09 | X          | Y          |
|  5 | Q3 09 | X          | Y          |
|  5 | Q4 09 | X          | Y          |
|  5 | Q1 10 | X          | Y          |
|  5 | Q2 10 | X          | Y          |
|  5 | Q3 10 | X          | Y          |
|  5 | Q4 10 | X          | Y          |
|  6 | Q1 09 | X          | Y          |
|  6 | Q2 09 | X          | Y          |
|  6 | Q3 09 | X          | Y          |
|  6 | Q4 09 | X          | Y          |
|  6 | Q1 10 | X          | Y          |
|  6 | Q2 10 | X          | Y          |
|  6 | Q3 10 | X          | Y          |
|  6 | Q4 10 | X          | Y          |
|  7 | Q1 09 | X          | Y          |
|  7 | Q2 09 | X          | Y          |
|  7 | Q3 09 | X          | Y          |
|  7 | Q4 09 | X          | Y          |
|  7 | Q1 10 | X          | Y          |
|  7 | Q2 10 | X          | Y          |
|  7 | Q3 10 | X          | Y          |
|  7 | Q4 10 | X          | Y          |
|  8 | Q1 09 | X          | Y          |
|  8 | Q2 09 | X          | Y          |
|  8 | Q3 09 | X          | Y          |
|  8 | Q4 09 | X          | Y          |
|  8 | Q1 10 | X          | Y          |
|  8 | Q2 10 | X          | Y          |
|  8 | Q3 10 | X          | Y          |
|  8 | Q4 10 | X          | Y          |
|  9 | Q1 09 | X          | Y          |
|  9 | Q2 09 | X          | Y          |
|  9 | Q3 09 | X          | Y          |
|  9 | Q4 09 | X          | Y          |
|  9 | Q1 10 | X          | Y          |
|  9 | Q2 10 | X          | Y          |
|  9 | Q3 10 | X          | Y          |
|  9 | Q4 10 | X          | Y          |
+----+-------+------------+------------+

The data set is very large, several thousands data points in total.

I previosuly posted regarding this issue, but I guess I did not formulate the question properly. i i tried the code below, but it left out the Y column.

df.columns = [
    df.columns.to_series().groupby(level=0).cumcount().map({0: 'X', 1: 'Y'}),
    df.columns
]

df.stack().rename_axis(['ID', 'T']).reset_index()
1
  • Sidenote: this is definitely a usecase for pd.wide_to_long, just couldn't get it to work with the stubnames argument. Commented Jul 16, 2019 at 8:51

2 Answers 2

2

We can split your dataframe by variables (X, Y) with df.filter. Then we melt your columns to rows with .melt. Finally we concat the seperated data together again with pd.concat.

The last line is to clean up your T column by removing everything with format: variable X

X = df[['ID']].join(df.filter(regex='X$')).melt(id_vars='ID', var_name='T', value_name='Variable X')
Y = df[['ID']].join(df.filter(regex='Y$')).melt(id_vars='ID', var_name='T', value_name='Variable Y')

df = pd.concat([X, Y[['Variable Y']]], axis=1).sort_values(['ID', 'T']).reset_index(drop=True)

df['T'] = df['T'].str.replace('\svariable\s[A-Za-z]', '')

Output

    ID      T Variable X Variable Y
0    1  Q1 09          X          Y
1    1  Q1 10          X          Y
2    1  Q2 09          X          Y
3    1  Q2 10          X          Y
4    1  Q3 09          X          Y
5    1  Q3 10          X          Y
6    1  Q4 09          X          Y
7    1  Q4 10          X          Y
8    2  Q1 09          X          Y
9    2  Q1 10          X          Y
10   2  Q2 09          X          Y
11   2  Q2 10          X          Y
12   2  Q3 09          X          Y
13   2  Q3 10          X          Y
14   2  Q4 09          X          Y
15   2  Q4 10          X          Y
16   3  Q1 09          X          Y
17   3  Q1 10          X          Y
18   3  Q2 09          X          Y
19   3  Q2 10          X          Y
20   3  Q3 09          X          Y
21   3  Q3 10          X          Y
22   3  Q4 09          X          Y
23   3  Q4 10          X          Y
24   4  Q1 09          X          Y
25   4  Q1 10          X          Y
26   4  Q2 09          X          Y
27   4  Q2 10          X          Y
28   4  Q3 09          X          Y
29   4  Q3 10          X          Y
..  ..    ...        ...        ...
42   6  Q2 09          X          Y
43   6  Q2 10          X          Y
44   6  Q3 09          X          Y
45   6  Q3 10          X          Y
46   6  Q4 09          X          Y
47   6  Q4 10          X          Y
48   7  Q1 09          X          Y
49   7  Q1 10          X          Y
50   7  Q2 09          X          Y
51   7  Q2 10          X          Y
52   7  Q3 09          X          Y
53   7  Q3 10          X          Y
54   7  Q4 09          X          Y
55   7  Q4 10          X          Y
56   8  Q1 09          X          Y
57   8  Q1 10          X          Y
58   8  Q2 09          X          Y
59   8  Q2 10          X          Y
60   8  Q3 09          X          Y
61   8  Q3 10          X          Y
62   8  Q4 09          X          Y
63   8  Q4 10          X          Y
64   9  Q1 09          X          Y
65   9  Q1 10          X          Y
66   9  Q2 09          X          Y
67   9  Q2 10          X          Y
68   9  Q3 09          X          Y
69   9  Q3 10          X          Y
70   9  Q4 09          X          Y
71   9  Q4 10          X          Y

[72 rows x 4 columns]
Sign up to request clarification or add additional context in comments.

6 Comments

Please assume that the "X " and "Y" variables will be numbers
You can change df.filter(regex='X$') to whatever your data represents, 'X$' means ends with X. So you can change the X to both variable numbers. @MadsAndersen
well the numbers will be different across all the observations, so can't just insert the numbers
Then how do you differentiate the variables from each other? You show X,Y which stand for two different variables in the columns. @MadsAndersen
No, the data is currently formatted such that the first columns contain data on the first variable, "X", across the quarters. the next columns contain data on the other variable, "Y", across the quarters. so the variables are differentiated based what the columns are named.
|
1

Create MultiIndex by all columns with variables, so possible reshape by DataFrame.stack - both T columns are joined together:

df = df.set_index('ID')
df.columns = df.columns.str.split(n=2, expand=True)
df = df.stack([0,1]).rename_axis(('ID','T','T1')).reset_index()
df['T'] = df['T'] + ' ' + df.pop('T1')
print (df.head(10))
   ID      T variable X variable Y
0   1  Q1 09          X          Y
1   1  Q1 10          X          Y
2   1  Q2 09          X          Y
3   1  Q2 10          X          Y
4   1  Q3 09          X          Y
5   1  Q3 10          X          Y
6   1  Q4 09          X          Y
7   1  Q4 10          X          Y
8   2  Q1 09          X          Y
9   2  Q1 10          X          Y

Or if possible create MultiIndex by first 5 values and another values of columns:

df = df.set_index('ID')
df.columns = pd.MultiIndex.from_tuples(zip(df.columns.str[:5], df.columns.str[5:]))
df = df.stack(0).rename_axis(('ID','T')).reset_index()

If order is important create column with ordered categorical and last sort values:

df = df.set_index('ID')
c = df.columns.str[:5]
df.columns = pd.MultiIndex.from_tuples(zip(c, df.columns.str[5:]))
df1 = df.stack(0).rename_axis(('ID','T')).reset_index()
df1['T'] = pd.CategoricalIndex(df1['T'], ordered=True, categories=c.unique())
df1 = df1.sort_values(['ID','T'])
print (df1.head(10))
    ID      T  variable X  variable Y
0    1  Q1 09           X           Y
2    1  Q2 09           X           Y
4    1  Q3 09           X           Y
6    1  Q4 09           X           Y
1    1  Q1 10           X           Y
3    1  Q2 10           X           Y
5    1  Q3 10           X           Y
7    1  Q4 10           X           Y
8    2  Q1 09           X           Y
10   2  Q2 09           X           Y

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.