0

The code uses an OCR to read text from URLs in the list 'url_list'. I am trying to append the output in the form of a string 'txt' into an empty pandas column 'url_text'. However, the code does not append anything to the column 'url_text'? When

df = pd.read_csv(r'path') # main dataframe

df['url_text'] = "" # create empty column that will later contain the text of the url_image
url_list = (df.iloc[:, 5]).tolist() # convert column with urls to a list 

print(url_list)

['https://pbs.twimg.com/media/ExwMPFDUYAEHKn0.jpg', 
'https://pbs.twimg.com/media/ExuBd4-WQAMgTTR.jpg', 
'https://pbs.twimg.com/media/ExuBd5BXMAU2-p_.jpg', 
' ',
'https://pbs.twimg.com/media/Ext0Np0WYAEUBXy.jpg', 
'https://pbs.twimg.com/media/ExsJrOtWUAMgVxk.jpg', 
'https://pbs.twimg.com/media/ExrGetoWUAEhOt0.jpg',
' ',
' ']
for img_url in url_list: # loop over all urls in list url_list
    try:
        img = io.imread(img_url) # convert image/url to cv2/numpy.ndarray format

        # Preprocessing of image
        gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        (h, w) = gry.shape[:2]
        gry = cv2.resize(gry, (w*3, h*3))
        thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

        txt = pytesseract.image_to_string(thr)  # read tweet image text

        df['url_text'].append(txt)

        print(txt)
    except: # ignore any errors. Some of the rows does not contain a URL causing the loop to fail
        pass

print(df)
2
  • you might have to adjust the tesseract settings to pick up text. have you tried printing txt to see if it contains anything? Commented Dec 30, 2021 at 16:48
  • Yes, if I hash out #df['url_text'].append(txt) the txt is printed in the console one by one. However, when adding df['url_text'].append(txt) I cannot se the txt in the console. The txt object is a string. Commented Dec 30, 2021 at 16:55

2 Answers 2

1

I couldn't test it but please try this, as you may need to create the list first and then add it as a new column to the df (I converted the list itself to dataframe and then concatenated to the original df)

txtlst=[]
for img_url in url_list: # loop over all urls in list url_list
    try:
        img = io.imread(img_url) # convert image/url to cv2/numpy.ndarray format

        # Preprocessing of image
        gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        (h, w) = gry.shape[:2]
        gry = cv2.resize(gry, (w*3, h*3))
        thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

        txt = pytesseract.image_to_string(thr)  # read tweet image text
        txtlst.append(txt)


        print(txt)
    except: # ignore any errors. Some of the rows does not contain a URL causing the loop to fail
        txtlst.append("")
        pass
dftxt=pd.Dataframe({"url_text":txtlst})
df=pd.concat([df, dftxt], axis=1)
print(df)
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you! Your suggestion does append the 'txt' output to the 'url_text' column. However, since the url_list does contain some empty elements (as noted above) I end up with a mismatch between the 'image_url' column and the new 'url_text' column. I.e. there is text output in a number of rows in the 'url_text' column that do not have any URL in the corresponding row on the 'image_url' column and vice versa. Hope it makes sense.
that's why I also added blank url to the list inside the exception. Have you added this line in the exception as well?
Thank you! I overlook that line. My mistake. The code works perfectly now. The only thing was a missing ']' after 'dftxt' in the second to last line.
Thank you for pointing out. Added the missing "]".
0

As noted in the documentation for Series.append(), the append call works only between two series.

Better will be to create an empty list outside of the loop, append to that list of strings within the loop itself, and then insert that list into df["url_list"] = list_of_urls. This is also much faster at runtime than appending two series together repeatedly.

url_list = []

for ...:
    ...
    url_list.append(url_text)

df["url_list"] = url_list   

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.