4

I have a data set of the form enter image description here

that I would like to change to this form below in R using SQL. enter image description here

I know that I could do this daily simply with dplyr but the point here is to learn to use SQL to create and manipulate a small relational database.

  • Price needs to be turned into a numeric value. Removing the "R" and spaces in between.

  • coordinates needs to be turned into 2 coordinates Long and Lat

  • floor size needs to be turned into a numeric from a string removing the space and "m^2" at the end.

Minimum working example

# Data to copy into sheet

       Price                            coordinates floor.size surburb       date
 R 1 750 000 -33.93082074573843, 18.857342125467635      68 m²     Jhb 2021-06-24
 R 1 250 000 -33.930077157927855, 18.85420954236195      56 m²     Jhb 2021-06-17
 R 2 520 000 -33.92954929205658, 18.857504799977896      62 m²     Jhb 2021-06-24

Code to manipulate in R markdown

```{r}
#install.packages("RSQLite", repos = "http://cran.us.r-project.org")

library(readxl)
library(dplyr)
library(RSQLite)
library(DBI)
library(knitr)

db <- dbConnect(RSQLite::SQLite(), ":memory:")

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(connection = "db")


# Import data
dataH <- read_excel("C:/Users/Dell/Desktop/exampledata.xlsx")

``` 

```{sql, connection = db}
# SQL code passed directly
```

Edit 1:

The answer by @Onyambu works almost. It is producing an error with the coordinates. For example in the image below the last two coordinates are supposed to have a Long that starts with '18.85' instead of '.85' when the coordinate was "-33.930989501123, 18.857270308516927". How would I fix this?

enter image description here

3
  • Have you looked at sqldf? It allows SQL queries against a data.frame(s). Commented Jun 27, 2021 at 19:04
  • Please post your data using dput(x), since as-is we cannot just copy and paste without a bit of manual extract. (The embedded spaces make it so that read.table and family cannot just parse it.) Commented Jun 27, 2021 at 19:06
  • Also you need to specificly state what db engine you are using. mysql, sqllite, postgresql etc all have different functions to be implemented. Commented Jun 27, 2021 at 19:58

2 Answers 2

2

Using the basic sql functions, you could do:

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,connection = "db")
```

```{r}
db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

txt <- "Price coordinates floor.size surburb date\n
     'R 1 750 000' '-33.93082074573843, 18.857342125467635' '68 m²' Jhb 2021-06-24\n
     'R 1 250 000' '-33.930077157927855, 18.85420954236195' '56 m²' Jhb 2021-06-17\n
     'R 2 520 000' '-33.92954929205658, 18.857504799977896' '62 m²' Jhb 2021-06-24"

dataH <- read.table(text = txt, header = TRUE) 
DBI::dbWriteTable(db, 'dataH', dataH)
```


```{sql}
SELECT REPLACE(SUBSTRING(price, 3, 100), ' ', '') price,
       replace(SUBSTRING(coordinates, 1, 20), ',', '') Lat,
       SUBSTRING(coordinates, 21, 255) Long,
       SUBSTRING(`floor.size`, 1, 2) floor_size,
       surburb,
       date
FROM dataH
```
Sign up to request clarification or add additional context in comments.

11 Comments

For last chunk, use: {sql, connection = db}
Hi @Onyambu, there is something that is giving issues with the coordinates. I added it into the question as an edit.
Perhaps @Parfait might know how to fix the issue with the coordinates transformation?
Could it be that the decimals are not the same length per say @Onyambu
@Parfait I placed connection = db within the setup chunck, so not really necessary to include it again, but no problem if included
|
1

You can use charindex and substr to do what you need. I'll demo with sqldf, which is using SQLite's engine under the hood. (This query is very similar to Onyambu's but solves one issue with text selection.)

dat <- structure(list(Price = c("R 1 750 000", "R 1 250 000", "R 2 520 000"), coordinates = c("-33.93082074573843, 18.857342125467635", "-33.930077157927855, 18.85420954236195", "-33.92954929205658, 18.857504799977896"), floor.size = c("68 m²", "56 m²", "62 m²"), surburb = c("Jhb", "Jhb", "Jhb"), date = c("2021-06-24", "2021-06-17", "2021-06-24")), class = "data.frame", row.names = c(NA, -3L))

out <- sqldf::sqldf(
  "select cast(replace(substr(price,2,99),' ','') as real) as price,
          cast(substr(coordinates,1,charindex(',',coordinates)-1) as real) as lat,
          cast(substr(coordinates,charindex(',',coordinates)+1,99) as real) as long,
          cast(substr([floor.size],1,charindex('m',[floor.size])-1) as real) as [floor.size]
   from dat", method = "raw")

out
#     price       lat     long floor.size
# 1 1750000 -33.93082 18.85734         68
# 2 1250000 -33.93008 18.85421         56
# 3 2520000 -33.92955 18.85750         62

str(out)
# 'data.frame': 3 obs. of  4 variables:
#  $ price     : num  1750000 1250000 2520000
#  $ lat       : num  -33.9 -33.9 -33.9
#  $ long      : num  18.9 18.9 18.9
#  $ floor.size: num  68 56 62

(The number of digits shown in the out output is due to R's "digits" option, those are class numeric as shown in the str output.)

You can shorten that and remove all cast(.. as ..) if you change to sqldf(.., method="numeric").

out <- sqldf::sqldf(
  "select replace(substr(price,2,99),' ','') as price,
          substr(coordinates,1,charindex(',',coordinates)-1) as lat,
          substr(coordinates,charindex(',',coordinates)+1,99) as long,
          substr([floor.size],1,charindex('m',[floor.size])-1) as [floor.size]
   from dat", method = "numeric")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.