1

I'm analysing real-estate sales for some N. American cities and am using k-means clustering on the data. I have seven clusters and for each observation in the cluster I have the latitude, longitude, zipcode, and cluster_id. I'd like to plot this on a map to better visualize the clusters - I'm not sure what such a plot is called - Choropleth? Polygon?

Most of the examples are using geoJSON files but I only have a data.frame object from my k-means clustering.

Actual data:

https://www.kaggle.com/threnjen/portland-housing-prices-sales-jul-2020-jul-2021

Sample data:

> dput(dt[runif(n = 10,min = 1,max = 25000)])
structure(list(id = c(23126L, 15434L, 5035L, 19573L, NA, 24486L, 
NA, 14507L, 3533L, 20192L), zipcode = c(97224L, 97211L, 97221L, 
97027L, NA, 97078L, NA, 97215L, 97124L, 97045L), latitude = c(45.40525436, 
45.55965805, 45.4983139, 45.39398956, NA, 45.47454071, NA, 45.50736618, 
45.52812958, 45.34381485), longitude = c(-122.7599182, -122.6500015, 
-122.7288742, -122.591217, NA, -122.8898392, NA, -122.6084061, 
-122.91745, -122.5948334), lastSoldPrice = c(469900L, 599000L, 
2280000L, 555000L, NA, 370000L, NA, 605000L, 474900L, 300000L
), lotSize = c(5227L, 4791L, 64904L, 9147L, NA, 2178L, NA, 4356L, 
2613L, 6969L), livingArea = c(1832L, 2935L, 5785L, 2812L, NA, 
1667L, NA, 2862L, 1844L, 742L), cluster_id = c(7, 7, 2, 7, NA, 
4, NA, 7, 7, 4)), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x7faa8000fee0>)

I've followed the example on https://gist.github.com/josecarlosgonz/8565908 to try and create a geoJSON file to be able to plot this data but without success.

I'm not using markers because I have ~25,000 observations - it would be difficult to plot them all and the file would take forever to load.

EDIT:

observations by zipcode:

> dput(dat[, .N, by = .(`address/zipcode`)][(order(`address/zipcode`))])
structure(list(`address/zipcode` = c(7123L, 97003L, 97004L, 97005L, 
97006L, 97007L, 97008L, 97009L, 97015L, 97019L, 97023L, 97024L, 
97027L, 97030L, 97034L, 97035L, 97038L, 97045L, 97056L, 97060L, 
97062L, 97068L, 97070L, 97078L, 97080L, 97086L, 97089L, 97113L, 
97123L, 97124L, 97132L, 97140L, 97201L, 97202L, 97203L, 97204L, 
97205L, 97206L, 97209L, 97210L, 97211L, 97212L, 97213L, 97214L, 
97215L, 97216L, 97217L, 97218L, 97219L, 97220L, 97221L, 97222L, 
97223L, 97224L, 97225L, 97227L, 97229L, 97230L, 97231L, 97232L, 
97233L, 97236L, 97239L, 97266L, 97267L), N = c(1L, 352L, 9L, 
252L, 421L, 1077L, 357L, 1L, 31L, 2L, 4L, 159L, 239L, 525L, 640L, 
548L, 1L, 1064L, 5L, 353L, 471L, 736L, 6L, 403L, 866L, 913L, 
8L, 5L, 1113L, 776L, 3L, 543L, 219L, 684L, 463L, 1L, 57L, 809L, 
189L, 216L, 688L, 510L, 504L, 330L, 318L, 177L, 734L, 195L, 832L, 
305L, 276L, 589L, 688L, 716L, 286L, 83L, 1307L, 475L, 77L, 150L, 
382L, 444L, 290L, 423L, 430L)), row.names = c(NA, -65L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x7f904781a6e0>)
3
  • are your geocoordinates the locations of the houses or are they representative for the zipcode already? If the later: How would you like to aggregate the data since there are possibly multiple observations per zip-code and we already have the cluster variable to be displayed? Commented Feb 5, 2022 at 3:43
  • @DPH they're location of the properties. In the entire dataset, 9 observations have more than 1 zipcode (repeat sales, bad data) all others have exactly one. zipcode is not part of the variables used for clustering. For aggregating, I want use the clusters from k-means algorithm instead of zipcode. I would, however, want to do a choropleth with zipcodes as well (part of the report). Commented Feb 5, 2022 at 16:56
  • @DPH I've edited the question to add a summary of the no. of observations by zipcode - hope that makes it a bit clearer. Commented Feb 5, 2022 at 17:00

1 Answer 1

0

I used the kaggle data on a simple laptop (i3 8th gen) to generate a ggplot2 object, with cluster IDs randomly sampled and transform this via the ggplotly() function ... the resulting plotly object seems OK to work with for analysis but I do not know your performance requirements:

library(dplyr)
library(ggplot2)
library(plotly)
library(rnaturalearth) # here we get the basic map data from

# read in data from zip, select minimal number of columns and sample cluster_id
df <- readr::read_csv(unzip("path_to_zip/portland_housing.csv.zip"))%>% 
    dplyr::select(az = `address/zipcode`, latitude, longitude) %>%              
    dplyr::mutate(cluster_id = sample(1:7, n(), replace = TRUE))
# get the map data
world <- rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")
# build the ggplot2 object (note that I use rings as shapes and alpha parameter to reduce the over plotting
plt <- ggplot2::ggplot(data = world) +
    ggplot2::geom_sf() +
    ggplot2::geom_point(data = df, aes(x = longitude, y = latitude, color = factor(cluster_id)), size = 1, shape = 21, alpha = .7) + 
    ggplot2::coord_sf(xlim = c(-124.5, -122), ylim = c(45, 46), expand = FALSE)
# plot it:
plt

enter image description here

# plotly auto transform from ggplot2 object
plotly::ggplotly(plt)

enter image description here

EDIT

To include a map you can use for example the ggmap package instead of the map data from rnaturalearth... I will only display the plotly result:

library(ggmap)

# https://stackoverflow.com/questions/23130604/plot-coordinates-on-map
sbbox <- ggmap::make_bbox(lon = c(-124.5, -122), lat = c(45, 46), f = .1)
myarea <- ggmap::get_map(location=sbbox, zoom=10, maptype="terrain")
myarea <- ggmap::ggmap(myarea)

plt2 <- myarea +
    ggplot2::geom_point(data = df, mapping = aes(x = longitude, y = latitude, color = factor(cluster_id)), shape = 21, alpha = .7) 

plotly::ggplotly(plt2)

enter image description here

There are many other approaches concerning the map data, like using the mapbox-api

Sign up to request clarification or add additional context in comments.

1 Comment

I'm able to follow the ggplot & plotly route but I need to plot this on a map for the report.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.