2

I'm using Python to work with data from csv files, and after reading csv into an array, my data looks like this:

data = [
    ["10","2018-03-22 14:38:18.329963","name 10","url10","True"],
    ["11","2018-03-22 14:38:18.433497","name 11","url11","False"],
    ["12","2018-03-22 14:38:18.532312","name 12","url12","False"]
]

I know I can use "for" loop but my data has around millions of records and the "for" loop takes too long time to run, so do you have any idea to do task listed below without using "for"?

  1. Convert value from string to integer in column 1 (ie: "10" -> 10)
  2. Add "http://" in column 3 (ie: "url10" -> "http://url10")
  3. Convert value in column 4 to boolean (ie: "False" -> False)

Thank you a lot!

1

2 Answers 2

2

You can use map with a predefined function. map is slightly faster than a list comprehension on larger input:

def clean_data(row):
   val, date, name, url, truthy = row
   return [int(val), date, name, 'http://{}'.format(url), truthy == 'True']


data = [
["10","2018-03-22 14:38:18.329963","name 10","url10","True"],
["11","2018-03-22 14:38:18.433497","name 11","url11","False"],
["12","2018-03-22 14:38:18.532312","name 12","url12","False"]
]
print(list(map(clean_data, data)))

Output:

[[10, '2018-03-22 14:38:18.329963', 'name 10', 'http://url10', True], [11, '2018-03-22 14:38:18.433497', 'name 11', 'http://url11', False], [12, '2018-03-22 14:38:18.532312', 'name 12', 'http://url12', False]]
Sign up to request clarification or add additional context in comments.

2 Comments

nice solution. this is simple and clean.
Superb! Thank you a lot!
0

Pandas should be one option if you don't mind taking some time to load your data to the dataframe first.

Below is one solution using Pandas, then simply compare the time cost with map solution.

import pandas as pd
from datetime import datetime
data = [
    ["10","2018-03-22 14:38:18.329963","name 10","url10","True"],
    ["11","2018-03-22 14:38:18.433497","name 11","url11","False"],
    ["12","2018-03-22 14:38:18.532312","name 12","url12","False"]
]*10000 #multiply 10000 to simulate large data, you can test with one bigger number.

#Pandas
df = pd.DataFrame(data=data, columns=['seq', 'datetime', 'name', 'url', 'boolean'])
pandas_beg = datetime.now()
df['seq'] = df['seq'].astype(int)
df['url'] = 'http://' + df['url']
df['boolean'] = df['boolean'] == 'True'
pandas_end = datetime.now()
print('pandas: ', (pandas_end - pandas_beg))

#map
def clean_data(row):
   val, date, name, url, truthy = row
   return [int(val), date, name, 'http://{}'.format(url), truthy == 'True']
map_beg = datetime.now()
result = list(map(clean_data, data))
map_end = datetime.now()
print('map: ', (map_end - map_beg))

Output:

pandas:  0:00:00.016091
map:  0:00:00.036025
[Finished in 0.997s]

1 Comment

Cool. Thank you mate

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.