1

I'm tracking cargo vessels from Maersk, and would like to automate the processes. So far I can get the data, but it is the cleaning part that is killing me.

I use BS4.

from bs4 import BeautifulSoup
import pandas as pd
import requests
import time

header = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"

#gets the data
def get_data(x):
    soup = BeautifulSoup(requests.get(url, headers={"User-Agent":header}).text, 'lxml')
    data = soup.find_all("td")
    list_of_prices = [x.text for x in data]
    return list_of_prices

#convert to a dictionary that can easily be converted to a pandas dataframe
def Convert(a):
    pts = get_data(a)
    it = iter(pts) 
    res_dct = dict(zip(it, it)) 
    return res_dct 

# makes it a dataframe with the required columns
def make_df():
    todf = Convert(get_data(url))
    df = pd.DataFrame((todf), index=[0])
    keep_flag = df[['Flag']]
    keep_ETA = df[['ETA']]
    keep_speed = df[['Course / Speed']]
    keep_report = df[['Last report ']]
    new_df = pd.concat([keep_flag, keep_ETA, keep_speed, keep_report], axis = 1).T
    #date = pd.Timestamp.today()
    return new_df

# how I print    
urls = {
    "EMMA MAERSK": "https://www.vesselfinder.com/vessels/EMMA-MAERSK-IMO-9321483-MMSI-220417000",
    "MANILA MAERSK": "https://www.vesselfinder.com/vessels/MANILA-MAERSK-IMO-9780469-MMSI-219038000"
    }
for ele, url in urls.items():
    print(ele, make_df())

The output is this:


EMMA MAERSK                                       0
Flag                            Denmark
ETA                       Nov 24, 00:01
Course / Speed         232.0° / 11.7 kn
Last report      Nov 22, 2019 08:10 UTC
MANILA MAERSK                                       0
Flag                            Denmark
ETA                       Nov 23, 11:30
Course / Speed         182.4° / 13.4 kn
Last report      Nov 22, 2019 08:31 UTC

A nice format, but I'm curious how I can make this into a dataframe.

I tried this:

new_df = []
for ele, url in urls.items():
    data = ele, make_df()
    ddf = new_df.append(data)

appended_data = pd.DataFrame(new_df)
appended_data.to_excel('appended.xlsx')

But it doesn't give me the wished for output.

I would like the two column to be side on side, instead of below one another. So Emma Maersk, and Manila Maersk is side by side.

Thank you!

2 Answers 2

2

Using your own functions:

dictionary_list = []
for ele, url in urls.items():
    values_dict = Convert(get_data(url))
    values_dict["Name"] = ele
    dictionary_list.append(values_dict)

Creating a dictionary from dictionary_list:

pd.DataFrame(dictionary_list)[["Name", "Flag", "ETA", "Course / Speed", "Last report "]]

Returns:

Name    Flag    ETA Course / Speed  Last report
0   EMMA MAERSK Denmark Nov 24, 00:01   240.5° / 11.9 kn    Nov 22, 2019 08:59 UTC
1   MANILA MAERSK   Denmark Nov 23, 11:30   179.6° / 14.1 kn    Nov 22, 2019 09:01 UTC

Then you can use rename to rename the column names as you wish.

Sign up to request clarification or add additional context in comments.

1 Comment

Exactly what I need, with the vessel names as well. Thanks!
1

You simply add all your data into one place then convert to dataframe

from bs4 import BeautifulSoup
import pandas as pd
import requests
import time

header = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"

#gets the data
def get_data(x):
    soup = BeautifulSoup(requests.get(url, headers={"User-Agent":header}).text, 'lxml')
    data = soup.find_all("td")
    list_of_prices = [x.text for x in data]
    return list_of_prices

#convert to a dictionary that can easily be converted to a pandas dataframe
def Convert(a):
    pts = get_data(a)
    it = iter(pts) 
    res_dct = dict(zip(it, it))
    data.append({'flag' : res_dct.get('Flag',''),
    'ETA' : res_dct.get('ETA',''),
    'Course / Speed' : res_dct.get('Course / Speed',''),
    'Last report' : res_dct.get('Last report ','')})



# how I print    
urls = {
    "EMMA MAERSK": "https://www.vesselfinder.com/vessels/EMMA-MAERSK-IMO-9321483-MMSI-220417000",
    "MANILA MAERSK": "https://www.vesselfinder.com/vessels/MANILA-MAERSK-IMO-9780469-MMSI-219038000"
    }
data = []
for ele, url in urls.items():
    Convert(get_data(url))

df = pd.DataFrame(data)

Output :

    flag    ETA Course / Speed  Last report
0   Denmark Nov 24, 00:01   241.6° / 12.0 kn    Nov 22, 2019 09:04 UTC
1   Denmark Nov 23, 11:30   184.8° / 13.9 kn    Nov 22, 2019 09:07 UTC

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.