2

I have created a function, imgs_to_df() (which relies on img_to_vec()) that takes a list of URLs that point to a JPG (e.g. https://live.staticflickr.com/65535/48123413937_54bb53e98b_o.jpg), resizes it, and converts the URLs to a dataframe of RGB values, where each row is a different image, and each column is the R, G, or B value of the pixel of the (resized) image.

However, the function is very slow, especially once it gets into lists of hundreds or thousands of links, so I need a way to parallelize or otherwise make the process much, much faster. I'd also like to ensure there is a way to easily to match the URLs back with the RGB vectors after I'm done. enter image description here I am very new to parallel processing and everything I have read is just confusing me even more.

from PIL import Image
from io import BytesIO
import urllib.request
import requests
import numpy as np
import pandas as pd

def img_to_vec(jpg_url, resize=True, new_width=300, new_height=300):
    """ Takes a URL of an image, resizes it (optional), and converts it to a 
        vector representing RGB values.

    Parameters
    ----------
    jpg_url: String. A URL that points to a JPG image.
    resize: Boolean. Default True. Whether image should be resized before calculating RGB.
    new_width: Int. Default 300. New width to convert image to before calculating RGB.
    new_height: Int. Default 300. New height to conver image to before calculating RGB.

    Returns
    -------
    rgb_vec: Vector of size 3*new_width*new_height for the RGB values in each pixel of the image.


    """
    response = requests.get(jpg_url) # Create remote image connection
    img = Image.open(BytesIO(response.content)) # Save image connection (NOT actual image)
    if resize:
        img = img.resize((new_width, new_height)) 
    rgb_img = np.array(img) # Create matrix of RGB values
    rgb_vec = rgb_img.ravel() # Flatten 3D matrix of RGB values to a vector
    return rgb_vec   



# Consider parallel processing here
def imgs_to_df(jpg_urls, common_width=300, common_height=300):
    """ Takes a list of jpg_urls and converts it to a dataframe of RGB values.

    Parameters
    ----------
    jpg_urls: A list of jpg_urls to be resized and converted to a dataframe of RGB values.
    common_width: Int. Default 300. New width to convert all images to before calculating RGB.
    common_height: Int. Default 300. New height to convert all images to before calculating RGB.

    Returns
    -------
    rgb_df: Pandas dataframe of dimensions len(jpg_urls) rows and common_width*common_height*3
        columns. Each row is a unique jpeg image, and each column is an R/G/B value of 
        a particular pixel of the resized image


    """
    assert common_width>0 and common_height>0, 'Error: invalid new_width or new_height dimensions'
    for url_idx in range(len(jpg_urls)):
        if url_idx % 100 == 0:
            print('Converting url number {urlnum} of {urltotal} to RGB '.format(urlnum=url_idx, urltotal=len(jpg_urls)))
        try:  
            img_i = img_to_vec(jpg_urls[url_idx])
            if url_idx == 0:
                vecs = img_i
            else: 
                try:
                    vecs = np.vstack((vecs, img_i))
                except:
                    vecs = np.vstack((vecs, np.array([-1]*common_width*common_height*3)))
                    print('Warning: Error in converting {error_url} to RGB'.format(error_url=jpg_urls[url_idx]))

        except:
            vvecs = np.vstack((vecs, np.array([-1]*common_width*common_height*3)))
            print('Warning: Error in converting {error_url} to RGB'.format(error_url=jpg_urls[url_idx]))

    rgb_df = pd.DataFrame(vecs)
    return rgb_df


1 Answer 1

1

You can use a ThreadPool as your task is I/O bound.

I'm using concurrent.futures. Your function needs to be re-written so that it takes a single URL and makes it to a df.

I added two snippets, one just simply uses loops and another uses Threading. The second one is much much faster.

from PIL import Image
from io import BytesIO
import urllib.request
import requests
import numpy as np
import pandas as pd

def img_to_vec(jpg_url, resize=True, new_width=300, new_height=300):
    """ Takes a URL of an image, resizes it (optional), and converts it to a 
        vector representing RGB values.

    Parameters
    ----------
    jpg_url: String. A URL that points to a JPG image.
    resize: Boolean. Default True. Whether image should be resized before calculating RGB.
    new_width: Int. Default 300. New width to convert image to before calculating RGB.
    new_height: Int. Default 300. New height to conver image to before calculating RGB.

    Returns
    -------
    rgb_vec: Vector of size 3*new_width*new_height for the RGB values in each pixel of the image.


    """
    response = requests.get(jpg_url) # Create remote image connection
    img = Image.open(BytesIO(response.content)) # Save image connection (NOT actual image)
    if resize:
        img = img.resize((new_width, new_height)) 
    rgb_img = np.array(img) # Create matrix of RGB values
    rgb_vec = rgb_img.ravel() # Flatten 3D matrix of RGB values to a vector
    return rgb_vec   



# Consider parallel processing here
def imgs_to_df(jpg_url, common_width=300, common_height=300):

    assert common_width>0 and common_height>0, 'Error: invalid new_width or new_height dimensions'

    try:  
        img_i = img_to_vec(jpg_url)
        vecs = img_i

        try:
            vecs = np.vstack((vecs, img_i))
        except:
            vecs = np.vstack((vecs, np.array([-1]*common_width*common_height*3)))
            print('Warning: Error in converting {error_url} to RGB'.format(error_url=jpg_urls[url_idx]))

    except:
        print('failed')

    rgb_df = pd.DataFrame(vecs)
    return rgb_df

img_urls = ['https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Flower_poster_2.jpg/1200px-Flower_poster_2.jpg', 'https://www.tiltedtulipflorist.com/assets/1/14/DimFeatured/159229xL_HR_fd_3_6_17.jpg?114702&value=217',
            'https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Flower_poster_2.jpg/1200px-Flower_poster_2.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Flower_poster_2.jpg/1200px-Flower_poster_2.jpg']

import time
t1 = time.time()
dfs = []
for iu in img_urls:
    df = imgs_to_df(iu)
    dfs.append(df)
t2 = time.time()
print(t2-t1)
print(dfs)

# aprroach with multi-threading

import concurrent.futures

t1 = time.time()
with concurrent.futures.ThreadPoolExecutor() as executor:
    dfs = [df for df in executor.map(imgs_to_df, img_urls)]

t2 = time.time()
print(t2-t1)
print(dfs)

Out:

3.540484666824341
[   0       1       2       3       ...  269996  269997  269998  269999
0     240     240     237     251  ...     247     243     243     243
1     240     240     237     251  ...     247     243     243     243

[2 rows x 270000 columns],    0       1       2       3       ...  269996  269997  269998  269999
0     255     255     255     255  ...      93     155     119      97
1     255     255     255     255  ...      93     155     119      97

[2 rows x 270000 columns],    0       1       2       3       ...  269996  269997  269998  269999
0     240     240     237     251  ...     247     243     243     243
1     240     240     237     251  ...     247     243     243     243

[2 rows x 270000 columns],    0       1       2       3       ...  269996  269997  269998  269999
0     240     240     237     251  ...     247     243     243     243
1     240     240     237     251  ...     247     243     243     243

[2 rows x 270000 columns]]
1.2170848846435547
[   0       1       2       3       ...  269996  269997  269998  269999
0     240     240     237     251  ...     247     243     243     243
1     240     240     237     251  ...     247     243     243     243

[2 rows x 270000 columns],    0       1       2       3       ...  269996  269997  269998  269999
0     255     255     255     255  ...      93     155     119      97
1     255     255     255     255  ...      93     155     119      97

[2 rows x 270000 columns],    0       1       2       3       ...  269996  269997  269998  269999
0     240     240     237     251  ...     247     243     243     243
1     240     240     237     251  ...     247     243     243     243

[2 rows x 270000 columns],    0       1       2       3       ...  269996  269997  269998  269999
0     240     240     237     251  ...     247     243     243     243
1     240     240     237     251  ...     247     243     243     243

[2 rows x 270000 columns]]
Sign up to request clarification or add additional context in comments.

2 Comments

Could you explain WHY do you think a ThreadPool-based execution will be beneficial, given the Python (as-is in 2020/2Q) still uses central GIL-lock re-[SERIAL]-isation of the flow of code-execution ( across all such threads ) that they finally appear to get as a pure-[SERIAL] flow, paying all the costs of maintaining the GIL-based switching, yet having zero performance benefit from doing so? Latency masking may be beneficial, yet only for systems, that permit O/S-scheduler to run multiple threads on multi-core silicon. Not the Python, that avoids any&all [CONCURRENT]-events by the GIL-lock
I assume OP didn't actually ask why something will work so I didn't get into any explanation. By the rule of thumb, I know if a process is I/O bound, running it asynchronously will result in smaller execution time in most cases. Yes, there are cases where the multi-threading doesn't result in better speed, but in OP's example, it must improve the speed. Can you educate me on why do you think multi-threading will not work here? So far from my knowledge, it seems it should work and from some benchmarks that I ran, it works!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.