Python - Data processing with array

Question

I'm using Python to work with data from csv files, and after reading csv into an array, my data looks like this:

data = [
    ["10","2018-03-22 14:38:18.329963","name 10","url10","True"],
    ["11","2018-03-22 14:38:18.433497","name 11","url11","False"],
    ["12","2018-03-22 14:38:18.532312","name 12","url12","False"]
]

I know I can use "for" loop but my data has around millions of records and the "for" loop takes too long time to run, so do you have any idea to do task listed below without using "for"?

Convert value from string to integer in column 1 (ie: "10" -> 10)
Add "http://" in column 3 (ie: "url10" -> "http://url10")
Convert value in column 4 to boolean (ie: "False" -> False)

Thank you a lot!

Sounds like a use case for map (docs.python.org/3/library/functions.html#map). — jobnz
– jobnz, Commented Apr 17, 2018 at 20:28

Ajax1234 · Accepted Answer · 2018-04-17 20:26:45Z

2

You can use map with a predefined function. map is slightly faster than a list comprehension on larger input:

def clean_data(row):
   val, date, name, url, truthy = row
   return [int(val), date, name, 'http://{}'.format(url), truthy == 'True']


data = [
["10","2018-03-22 14:38:18.329963","name 10","url10","True"],
["11","2018-03-22 14:38:18.433497","name 11","url11","False"],
["12","2018-03-22 14:38:18.532312","name 12","url12","False"]
]
print(list(map(clean_data, data)))

Output:

[[10, '2018-03-22 14:38:18.329963', 'name 10', 'http://url10', True], [11, '2018-03-22 14:38:18.433497', 'name 11', 'http://url11', False], [12, '2018-03-22 14:38:18.532312', 'name 12', 'http://url12', False]]

answered Apr 17, 2018 at 20:26

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

castaway2000 Over a year ago

nice solution. this is simple and clean.

Nhan Tran Over a year ago

Superb! Thank you a lot!

Sphinx · Accepted Answer · 2018-04-17 20:51:09Z

0

Pandas should be one option if you don't mind taking some time to load your data to the dataframe first.

Below is one solution using Pandas, then simply compare the time cost with map solution.

import pandas as pd
from datetime import datetime
data = [
    ["10","2018-03-22 14:38:18.329963","name 10","url10","True"],
    ["11","2018-03-22 14:38:18.433497","name 11","url11","False"],
    ["12","2018-03-22 14:38:18.532312","name 12","url12","False"]
]*10000 #multiply 10000 to simulate large data, you can test with one bigger number.

#Pandas
df = pd.DataFrame(data=data, columns=['seq', 'datetime', 'name', 'url', 'boolean'])
pandas_beg = datetime.now()
df['seq'] = df['seq'].astype(int)
df['url'] = 'http://' + df['url']
df['boolean'] = df['boolean'] == 'True'
pandas_end = datetime.now()
print('pandas: ', (pandas_end - pandas_beg))

#map
def clean_data(row):
   val, date, name, url, truthy = row
   return [int(val), date, name, 'http://{}'.format(url), truthy == 'True']
map_beg = datetime.now()
result = list(map(clean_data, data))
map_end = datetime.now()
print('map: ', (map_end - map_beg))

Output:

pandas:  0:00:00.016091
map:  0:00:00.036025
[Finished in 0.997s]

edited Apr 17, 2018 at 20:51

answered Apr 17, 2018 at 20:44

Sphinx

10.7k2 gold badges35 silver badges50 bronze badges

1 Comment

Nhan Tran Over a year ago

Cool. Thank you mate

Collectives™ on Stack Overflow

Python - Data processing with array

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related