How to scan multiple CSV files (some) with missing columns using Polars?

Question

I’m trying to use the Polars library to scan multiple CSV files and select a set of columns from each file. However, some of the CSV files are missing some of the columns I want to select. Is there a way to handle this case and fill the missing columns with None values or some other default value?

queries = pl.LazyFrame()

for file in glob.glob("*.csv"):
    q = pl.scan_csv(file, ignore_errors=True ).select(
           ['Date','ID', 'colA','Column A','columnA'])

    queries=pl.concat([queries, q], how="diagonal")
 

dataframes = pl.collect_all(queries)

jqurious · Accepted Answer · 2023-07-08 03:49:47Z

You can move the .select() to be the last operation before the collect:

columns = "a", "b", "c", "d"

dfs = [
   pl.scan_csv(file, ignore_errors=True)
   for file in glob.glob("*.csv")
]

pl.concat(dfs, how="diagonal").select(columns).collect()

shape: (5, 4)
┌──────┬──────┬──────┬──────┐
│ a    ┆ b    ┆ c    ┆ d    │
│ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╪══════╡
│ 1    ┆ 2    ┆ null ┆ null │
│ 4    ┆ 5    ┆ 6    ┆ null │
│ null ┆ null ┆ 3    ┆ 5    │
│ null ┆ 7    ┆ 8    ┆ 9    │
│ null ┆ 10   ┆ 11   ┆ 12   │
└──────┴──────┴──────┴──────┘

As for adding in missing columns, I'm not sure if there is anything other than "manually" determining the differece:

columns = "a", "b", "c", "d"

dfs = (
   df.with_columns(**missing).select(columns)
   for file    in glob.glob("*.csv")
   for df      in [ pl.scan_csv(file, ignore_errors=True) ]
   for diff    in [ set(columns).difference(df.columns) ]
   for missing in [ dict.fromkeys(diff) ]
)

pl.concat(dfs, how="vertical_relaxed").collect()

shape: (5, 4)
┌──────┬──────┬──────┬──────┐
│ a    ┆ b    ┆ c    ┆ d    │
│ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╪══════╡
│ 1    ┆ 2    ┆ null ┆ null │
│ 4    ┆ 5    ┆ 6    ┆ null │
│ null ┆ null ┆ 3    ┆ 5    │
│ null ┆ 7    ┆ 8    ┆ 9    │
│ null ┆ 10   ┆ 11   ┆ 12   │
└──────┴──────┴──────┴──────┘

dict.fromkeys() builds the names and "nulls" to add as the missing columns:

>>> dict.fromkeys(["c", "d"])
{'c': None, 'd': None}

The dtypes will differ in which case you can use vertical_relaxed strategy.

Collectives™ on Stack Overflow

How to scan multiple CSV files (some) with missing columns using Polars?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related