0

I have the following unstructured data (read from a csv).

data = [[b'id' b'datetime' b'anomaly_length' b'affected_sensors' b'reason']
 [b'1' b'2019-12-20 08:09' b'26' b'all' b'Open Windows']
 [b'1' b'2019-12-20 08:10' b'26' b'all' b'Open Windows']
 [b'1' b'2019-12-20 08:11' b'26' b'all' b'Open Windows']
 [b'1' b'2019-12-20 08:12' b'26' b'all' b'Open Windows']
 [b'1' b'2019-12-20 08:13' b'26' b'all' b'Open Windows']
 [b'1' b'2019-12-20 08:14' b'26' b'all' b'Open Windows']
 [b'1' b'2019-12-20 08:15' b'26' b'all' b'Open Windows']
 [b'1' b'2019-12-20 08:16' b'26' b'all' b'Open Windows']
 [b'1' b'2019-12-20 08:17' b'26' b'all' b'Open Windows']]
 ...

I currently create structured arrays by using the following code:

labels_id = np.array(data[1:,0], dtype=int)
labels = [dt.datetime.strptime(date.decode("utf-8"), '%Y-%m-%d %H:%M') for date in np.array(data[1:,1])]
labels_length = np.array(data[1:,2], dtype=int)

This code is necessary because I need data with the correct datatype. In the function, I pass all the arrays and access them by index. I don't like this solution but because the function is called multiple times I don't want to cast the data inside the function each time.

Origin function code:

def special_find(labels_id, labels, labels_length):
    for i, id in enumerate(labels_id):
       print(id)
       print(labels[i])
       print(labels_length[i])
...

Expected: I want to have a structured array which only contains the needed columns:

structured_data = [[1 datetime.datetime(2019, 12, 20, 8, 9) b'2019-12-20 08:09' 26],
 [1 datetime.datetime(2019, 12, 20, 8, 10) 26],
 [1 datetime.datetime(2019, 12, 20, 8, 11) 26],
 [1 datetime.datetime(2019, 12, 20, 8, 12) 26],
 [1 datetime.datetime(2019, 12, 20, 8, 13) 26],
 [1 datetime.datetime(2019, 12, 20, 8, 14) 26],
...

I know I could concat all the created arrays but I don't think this is a good solution. Instead, I am searching for something like this:

structured_data = np.array(data[1:, 0:3], dtype=...)

UPDATE: here are some values for a csv file

id,datetime,anomaly_length,affected_sensors,reason
1,2019-12-20 08:09,26,all,Open Windows
1,2019-12-20 08:10,26,all,Open Windows
1,2019-12-20 08:11,26,all,Open Windows
1,2019-12-20 08:12,26,all,Open Windows
1,2019-12-20 08:13,26,all,Open Windows
1,2019-12-20 08:14,26,all,Open Windows
1,2019-12-20 08:15,26,all,Open Windows
1,2019-12-20 08:16,26,all,Open Windows
1,2019-12-20 08:17,26,all,Open Windows
6
  • 1
    Use Pandas. NumPy loses lots of its usefulness when your data isn't all of the same type. Commented Jan 9, 2020 at 14:03
  • @Seb can you give me a code example? Commented Jan 9, 2020 at 14:05
  • It might be easier to get the structured array when reading the csv. You can specify dtype=None or your own dtype Commented Jan 9, 2020 at 17:03
  • 1
    The pandas read_csv is powerful and fast. You could use ``to_records` to get a structured array from the dataframe. Regardless handing that date/time column can be tricky, since possible types include string, datetime objects and np.datetime64. Commented Jan 9, 2020 at 19:33
  • I'll second what @Seb suggested, use Pandas. Can you share at least part of your data? See: minimal reproducible example. Commented Jan 10, 2020 at 0:12

3 Answers 3

1

Since you've already converted the columns to NumPy arrays of the correct data type, it is easy to create a Pandas DataFrame from them, for example:

import pandas as pd

df = pd.DataFrame({
    'id': labels_id,
    'datetime': labels,
    'anomaly_length': labels_length
})
>>> df
   id            datetime  anomaly_length
0   1 2019-12-20 08:09:00              26
1   1 2019-12-20 08:10:00              26
2   1 2019-12-20 08:11:00              26
3   1 2019-12-20 08:12:00              26
4   1 2019-12-20 08:13:00              26
5   1 2019-12-20 08:14:00              26
6   1 2019-12-20 08:15:00              26
7   1 2019-12-20 08:16:00              26
8   1 2019-12-20 08:17:00              26

The Pandas docs have a good introduction on how to work with these objects.

Sign up to request clarification or add additional context in comments.

2 Comments

Is it possible to use pandas directly for reading the csv and filling the dataframe?
1

I tried to recreate your csv file with:

In [23]: cat stack59665655.txt                                                  
id, datetime, anomaly_length, affected_sensors, reason
1, 2019-12-20 08:09, 26, all, Open Windows
1, 2019-12-20 08:10, 26, all, Open Windows
1, 2019-12-20 08:11, 26, all, Open Windows

With pandas I can read it with:

In [24]: data = pd.read_csv('stack59665655.txt')                                
In [25]: data                                                                   
Out[25]: 
   id           datetime   anomaly_length  affected_sensors         reason
0   1   2019-12-20 08:09               26               all   Open Windows
1   1   2019-12-20 08:10               26               all   Open Windows
2   1   2019-12-20 08:11               26               all   Open Windows
In [26]: data.dtypes                                                            
Out[26]: 
id                    int64
 datetime            object
 anomaly_length       int64
 affected_sensors    object
 reason              object
dtype: object

The object columns contain strings. I suspect pandas has a way of converting that datetime string column to datetime objects or np.datetime64.

The simple conversion to array, produces an object dtype array:

In [27]: data.to_numpy()                                                        
Out[27]: 
array([[1, ' 2019-12-20 08:09', 26, ' all', ' Open Windows'],
       [1, ' 2019-12-20 08:10', 26, ' all', ' Open Windows'],
       [1, ' 2019-12-20 08:11', 26, ' all', ' Open Windows']],
      dtype=object)

to_records produces a record array, a variant on a structured array. Note the compound dtype:

In [28]: data.to_records()                                                      
Out[28]: 
rec.array([(0, 1, ' 2019-12-20 08:09', 26, ' all', ' Open Windows'),
           (1, 1, ' 2019-12-20 08:10', 26, ' all', ' Open Windows'),
           (2, 1, ' 2019-12-20 08:11', 26, ' all', ' Open Windows')],
          dtype=[('index', '<i8'), ('id', '<i8'), (' datetime', 'O'), (' anomaly_length', '<i8'), (' affected_sensors', 'O'), (' reason', 'O')])

Instead, using genfromtxt with it's auto-dtype mode:

In [29]: data1 =np.genfromtxt('stack59665655.txt',dtype=None, names=True,delimit
    ...: er=',',encoding=None, autostrip=True)                                  
In [30]: data1                                                                  
Out[30]: 
array([(1, '2019-12-20 08:09', 26, 'all', 'Open Windows'),
       (1, '2019-12-20 08:10', 26, 'all', 'Open Windows'),
       (1, '2019-12-20 08:11', 26, 'all', 'Open Windows')],
      dtype=[('id', '<i8'), ('datetime', '<U16'), ('anomaly_length', '<i8'), ('affected_sensors', '<U3'), ('reason', '<U12')])

I could convert the datetime field with:

In [31]: data1['datetime']                                                      
Out[31]: 
array(['2019-12-20 08:09', '2019-12-20 08:10', '2019-12-20 08:11'],
      dtype='<U16')
In [32]: data1['datetime'].astype('datetime64[m]')                              
Out[32]: 
array(['2019-12-20T08:09', '2019-12-20T08:10', '2019-12-20T08:11'],
      dtype='datetime64[m]')

Changing this in-place actually requires defining a new dtype.

Or I could construct a custom dtype, for example by modifying the one deduced for data1:

In [45]: dt = data1.dtype.descr                                                 
In [46]: dt[1]=('datetime', 'datetime64[m]')                                    
In [47]: dt= np.dtype(dt)                                                       
In [48]: dt                                                                     
Out[48]: dtype([('id', '<i8'), ('datetime', '<M8[m]'), ('anomaly_length', '<i8'), ('affected_sensors', '<U3'), ('reason', '<U12')])

In [49]: data2 =np.genfromtxt('stack59665655.txt',dtype=dt, names=True,delimiter
    ...: =',',encoding=None, autostrip=True)                                    
In [50]: data2                                                                  
Out[50]: 
array([(1, '2019-12-20T08:09', 26, 'all', 'Open Windows'),
       (1, '2019-12-20T08:10', 26, 'all', 'Open Windows'),
       (1, '2019-12-20T08:11', 26, 'all', 'Open Windows')],
      dtype=[('id', '<i8'), ('datetime', '<M8[m]'), ('anomaly_length', '<i8'), ('affected_sensors', '<U3'), ('reason', '<U12')])

To use datetime objects, I'd have to use a converter in `genfromtxt.

2 Comments

I suspect pandas has a way of converting that datetime string column to datetime objects or np.datetime64 .read_csv() has a parse_dates parameter, that might work?
thanks for the answer I actually combined read_csv with converters
0

I combined the read_csv from pandas together with `converters:

import pandas as pd
import datetime as dt

filename = './data.csv'

to_date = lambda value: (dt.datetime.strptime(value, '%Y-%m-%d %H:%M'))
values = pd.read_csv(filename, converters={'datetime': to_date})
print(values.dtypes)


>>> OUTPUT:
>>> id                           int64
>>> datetime            datetime64[ns]
>>> anomaly_length               int64
>>> affected_sensors            object
>>> reason                      object
>>> dtype: object

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.