48

The goal is to convert a nested list which sometimes contain missing records into a data frame. An example of the structure when there are missing records is:

mylist <- list(
  list(
    Hit = "True",
    Project = "Blue",
    Year = "2011",
    Rating = "4",
    Launch = "26 Jan 2012",
    ID = "19",
    Dept = "1, 2, 4"
  ),
  list(Hit = "False", Error = "Record not found"),
  list(
    Hit = "True",
    Project = "Green",
    Year = "2004",
    Rating = "8",
    Launch = "29 Feb 2004",
    ID = "183",
    Dept = "6, 8"
  )
)

When there are no missing records the list can be converted into a data frame using data.frame(do.call(rbind.data.frame, mylist)). However, when records are missing this results in a column mismatch. I know there are functions to merge data frames of non-matching columns but I'm yet to find one that can be applied to lists. The ideal outcome would keep record 2 with NA for all variables. Hoping for some help.

0

7 Answers 7

45

You can also use (at least v1.9.3) of rbindlist in the data.table package:

library(data.table)

rbindlist(mylist, fill=TRUE)

##      Hit Project Year Rating      Launch  ID    Dept            Error
## 1:  True    Blue 2011      4 26 Jan 2012  19 1, 2, 4               NA
## 2: False      NA   NA     NA          NA  NA      NA Record not found
## 3:  True   Green 2004      8 29 Feb 2004 183    6, 8               NA
Sign up to request clarification or add additional context in comments.

4 Comments

1.9.4 is now available on CRAN (although it may take a day more for remaining binaries to be available).
@hrbrmstr are you aware of a workaround that permits a non-uniform list structure? I'm running into rbind/rbindlist doesn't recycle as it already expects each item to be a uniform list, data.frame or data.table.
I get this error: Error in data.table::rbindlist(mylist, fill = TRUE) : Column 3 of item 1 is length 2 inconsistent with column 5 which is length 3. Only length-1 columns are recycled.
How about nested lists?
21

You could create a list of data.frames:

dfs <- lapply(mylist, data.frame, stringsAsFactors = FALSE)

Then use one of these:

library(plyr)
rbind.fill(dfs)

or the faster

library(dplyr)
bind_rows(dfs) # in earlier versions: rbind_all(dfs)

In the case of dplyr::bind_rows, I am surprised that it chooses to use "" instead of NA for missing data. If you remove stringsAsFactors = FALSE, you will get NA but at the cost of a warning... So suppressWarnings(rbind_all(lapply(mylist, data.frame))) would be an ugly but fast solution.

3 Comments

rbind_all() is deprecated. Please use bind_rows() instead.
What if in some of the rows, there is missing data for some of the columns? Just empty in the database (no NA or NULL)
I get this error: Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0
12

I just developed a solution for this question that is applicable here, so I'll provide it here as well:

tl <- function(e) { if (is.null(e)) return(NULL); ret <- typeof(e); if (ret == 'list' && !is.null(names(e))) ret <- list(type='namedlist') else ret <- list(type=ret,len=length(e)); ret; };
mkcsv <- function(v) paste0(collapse=',',v);
keyListToStr <- function(keyList) paste0(collapse='','/',sapply(keyList,function(key) if (is.null(key)) '*' else paste0(collapse=',',key)));

extractLevelColumns <- function(
    nodes, ## current level node selection
    ..., ## additional arguments to data.frame()
    keyList=list(), ## current key path under main list
    sep=NULL, ## optional string separator on which to join multi-element vectors; if NULL, will leave as separate columns
    mkname=function(keyList,maxLen) paste0(collapse='.',if (is.null(sep) && maxLen == 1L) keyList[-length(keyList)] else keyList) ## name builder from current keyList and character vector max length across node level; default to dot-separated keys, and remove last index component for scalars
) {
    cat(sprintf('extractLevelColumns(): %s\n',keyListToStr(keyList)));
    if (length(nodes) == 0L) return(list()); ## handle corner case of empty main list
    tlList <- lapply(nodes,tl);
    typeList <- do.call(c,lapply(tlList,`[[`,'type'));
    if (length(unique(typeList)) != 1L) stop(sprintf('error: inconsistent types (%s) at %s.',mkcsv(typeList),keyListToStr(keyList)));
    type <- typeList[1L];
    if (type == 'namedlist') { ## hash; recurse
        allKeys <- unique(do.call(c,lapply(nodes,names)));
        ret <- do.call(c,lapply(allKeys,function(key) extractLevelColumns(lapply(nodes,`[[`,key),...,keyList=c(keyList,key),sep=sep,mkname=mkname)));
    } else if (type == 'list') { ## array; recurse
        lenList <- do.call(c,lapply(tlList,`[[`,'len'));
        maxLen <- max(lenList,na.rm=T);
        allIndexes <- seq_len(maxLen);
        ret <- do.call(c,lapply(allIndexes,function(index) extractLevelColumns(lapply(nodes,function(node) if (length(node) < index) NULL else node[[index]]),...,keyList=c(keyList,index),sep=sep,mkname=mkname))); ## must be careful to translate out-of-bounds to NULL; happens automatically with string keys, but not with integer indexes
    } else if (type%in%c('raw','logical','integer','double','complex','character')) { ## atomic leaf node; build column
        lenList <- do.call(c,lapply(tlList,`[[`,'len'));
        maxLen <- max(lenList,na.rm=T);
        if (is.null(sep)) {
            ret <- lapply(seq_len(maxLen),function(i) setNames(data.frame(sapply(nodes,function(node) if (length(node) < i) NA else node[[i]]),...),mkname(c(keyList,i),maxLen)));
        } else {
            ## keep original type if maxLen is 1, IOW don't stringify
            ret <- list(setNames(data.frame(sapply(nodes,function(node) if (length(node) == 0L) NA else if (maxLen == 1L) node else paste(collapse=sep,node)),...),mkname(keyList,maxLen)));
        }; ## end if
    } else stop(sprintf('error: unsupported type %s at %s.',type,keyListToStr(keyList)));
    if (is.null(ret)) ret <- list(); ## handle corner case of exclusively empty sublists
    ret;
}; ## end extractLevelColumns()
## simple interface function
flattenList <- function(mainList,...) do.call(cbind,extractLevelColumns(mainList,...));

Execution:

## define data
mylist <- list(structure(list(Hit='True',Project='Blue',Year='2011',Rating='4',Launch='26 Jan 2012',ID='19',Dept='1, 2, 4'),.Names=c('Hit','Project','Year','Rating','Launch','ID','Dept')),structure(list(Hit='False',Error='Record not found'),.Names=c('Hit','Error')),structure(list(Hit='True',Project='Green',Year='2004',Rating='8',Launch='29 Feb 2004',ID='183',Dept='6, 8'),.Names=c('Hit','Project','Year','Rating','Launch','ID','Dept')));

## run it
df <- flattenList(mylist);
## extractLevelColumns():
## extractLevelColumns(): Hit
## extractLevelColumns(): Project
## extractLevelColumns(): Year
## extractLevelColumns(): Rating
## extractLevelColumns(): Launch
## extractLevelColumns(): ID
## extractLevelColumns(): Dept
## extractLevelColumns(): Error

df;
##     Hit Project Year Rating      Launch   ID    Dept            Error
## 1  True    Blue 2011      4 26 Jan 2012   19 1, 2, 4             <NA>
## 2 False    <NA> <NA>   <NA>        <NA> <NA>    <NA> Record not found
## 3  True   Green 2004      8 29 Feb 2004  183    6, 8             <NA>

My function is more powerful than data.table::rbindlist() as of 1.9.6, in that it can handle any number of nesting levels and different vector lengths across branches. In the linked question, my function correctly flattens the OP's list to a data.frame, but data.table::rbindlist() fails with "Error in rbindlist(jsonRList, fill = T) : Column 4 of item 16 is length 2, inconsistent with first column of that item which is length 1. rbind/rbindlist doesn't recycle as it already expects each item to be a uniform list, data.frame or data.table".

6 Comments

Wow, finally I found a solution to flatten the type of list I'm facing. Thank you.
tried this on a complicated list and got: Error in extractLevelColumns(lapply(nodes, function(node) if (length(node) < : error: inconsistent types () at /V1/2.
@GabrielFair (and @dca) If you post a link to your list (e.g. on GitHub) I might be able to debug and improve my code to handle your list, or at least improve the error message to make it more descriptive/clearer.
Thank you, sorry I should have been more clear. I'm getting the same error you are getting at the bottom of your post where you say you can't flatten OP's list. I'll create a new SO question, If I can't get this working on my own. Thanks again
I also get an error: Error in extractLevelColumns(lapply(nodes, [[, key), ..., keyList = c(keyList, : error: inconsistent types
|
5

Here's a solution that converts any nested/uneven list to dataframe. rbindlist doesn't work for many cases, especially for list of lists. So I had to create something better than rbindlist.

rbindlist.v2 <- function(l)
{
   l <- l[lapply(l, class) == "list"]
   df <- foreach(element = l, .combine = bind_rows, .errorhandling = 'remove') %do%
         {df = unlist(element); df = as.data.frame(t(df)); rm(element); return(df)}
   rm(l)
   return(df)
}

For large lists you can expedite the process by replacing %do% to %dopar%. That was also something I needed for my case.

3 Comments

This is a really cool function. Could you please kindly explain how the function works? I tried to understand how it works, but it is not exactly straightforward for me.
Where is the %do% function defined?
The %do%-operator seems to come from the library foreach. Anyway, the code doesn't work in my case -- it generates one row with > 489.000 columns.
2

And if you like purrr:

> te <- list(structure(list(Hit = "True", Project = "Blue", Year = "2011", 
Rating = "4", Launch = "26 Jan 2012", ID = "19", Dept = "1, 2, 4"), .Names = c("Hit", "Project", "Year", "Rating", "Launch", "ID", "Dept")), structure(list(
Hit = "False", Error = "Record not found"), .Names = c("Hit", 
"Error")), structure(list(Hit = "True", Project = "Green", Year = "2004", 
Rating = "8", Launch = "29 Feb 2004", ID = "183", Dept = "6, 8"), .Names = c("Hit", "Project", "Year", "Rating", "Launch", "ID", "Dept")))

> str(te)
List of 3
 $ :List of 7
  ..$ Hit    : chr "True"
  ..$ Project: chr "Blue"
  ..$ Year   : chr "2011"
  ..$ Rating : chr "4"
  ..$ Launch : chr "26 Jan 2012"
  ..$ ID     : chr "19"
  ..$ Dept   : chr "1, 2, 4"
 $ :List of 2
  ..$ Hit  : chr "False"
  ..$ Error: chr "Record not found"
 $ :List of 7
  ..$ Hit    : chr "True"
  ..$ Project: chr "Green"
  ..$ Year   : chr "2004"
  ..$ Rating : chr "8"
  ..$ Launch : chr "29 Feb 2004"
  ..$ ID     : chr "183"
  ..$ Dept   : chr "6, 8"
> purrr::map_dfr(te,as_tibble)
# A tibble: 3 × 8
  Hit   Project Year  Rating Launch      ID    Dept    Error           
  <chr> <chr>   <chr> <chr>  <chr>       <chr> <chr>   <chr>           
1 True  Blue    2011  4      26 Jan 2012 19    1, 2, 4 NA              
2 False NA      NA    NA     NA          NA    NA      Record not found
3 True  Green   2004  8      29 Feb 2004 183   6, 8    NA

   

Comments

1

Alternative @ishonest:

df <- purrr::map_dfr(l,function(y){
  y[[1]]
})

here are some other methods that depend on how the lists are nested:

list of lists

df <- purrr::map_dfr(r,function(x){
    unlist(x)
})

if the nested list is more complex, where some elements are lists:

format_json_list <- function(r){
  purrr::map_dfr(r,function(x){
    #Base object, e.g. Vessel info
    b <- x[[1]]
    # object's events, e.g. paces Vessel visited 
    df <- purrr::map_dfr(x[[2]],function(y){
        v <- y[[1]]
        p <- y[2:length(y)]
        dplyr::bind_cols(v,p)
    })
    dplyr::bind_cols(b,df)
  })
}

for some complex json, duplicate variable naming can be an issue. One fix is to specify the naming. This code below is the hard coding the names. I recon this can be made dynamic.

purrr::map_dfr(vo, function(vessels){
      if(is.list(vessels)){
        vinfo <- purrr::map_dfr(vessels, function(vessel){
          if(!is.list(vessel)){
            #print(vessel)
            vessel
          }
        }) %>%  dplyr::rename_all(~ paste0("Vessel.", .))
        calinfo <- purrr::map_dfr(vessels$Callings, function(calling){
          if(is.list(calling)){
            call <- purrr::map_dfr(calling, function(call){
              if(!is.list(call)){
                call
              }
            })
            callport <- purrr::map_dfr(calling$Port, function(port){
              if(!is.list(port)){
                port
              }
            }) %>% dplyr::rename_all(~ paste0("Port.", .))
            dplyr::bind_cols(call, callport)
          }
        }) %>%  dplyr::rename_all(~ paste0("Calling.", .))
        bind_cols(vinfo, calinfo)
      } 
    }, .id ="Vessel" ) 
  })

Comments

1

These days we have

> list2DF(collapse::rowbind(mylist, fill = TRUE))
    Hit Project Year Rating      Launch   ID    Dept            Error
1  True    Blue 2011      4 26 Jan 2012   19 1, 2, 4             <NA>
2 False    <NA> <NA>   <NA>        <NA> <NA>    <NA> Record not found
3  True   Green 2004      8 29 Feb 2004  183    6, 8             <NA>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.