You can move the .select() to be the last operation before the collect:
columns = "a", "b", "c", "d"
dfs = [
pl.scan_csv(file, ignore_errors=True)
for file in glob.glob("*.csv")
]
pl.concat(dfs, how="diagonal").select(columns).collect()
shape: (5, 4)
┌──────┬──────┬──────┬──────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪══════╡
│ 1 ┆ 2 ┆ null ┆ null │
│ 4 ┆ 5 ┆ 6 ┆ null │
│ null ┆ null ┆ 3 ┆ 5 │
│ null ┆ 7 ┆ 8 ┆ 9 │
│ null ┆ 10 ┆ 11 ┆ 12 │
└──────┴──────┴──────┴──────┘
As for adding in missing columns, I'm not sure if there is anything other than "manually" determining the differece:
columns = "a", "b", "c", "d"
dfs = (
df.with_columns(**missing).select(columns)
for file in glob.glob("*.csv")
for df in [ pl.scan_csv(file, ignore_errors=True) ]
for diff in [ set(columns).difference(df.columns) ]
for missing in [ dict.fromkeys(diff) ]
)
pl.concat(dfs, how="vertical_relaxed").collect()
shape: (5, 4)
┌──────┬──────┬──────┬──────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪══════╡
│ 1 ┆ 2 ┆ null ┆ null │
│ 4 ┆ 5 ┆ 6 ┆ null │
│ null ┆ null ┆ 3 ┆ 5 │
│ null ┆ 7 ┆ 8 ┆ 9 │
│ null ┆ 10 ┆ 11 ┆ 12 │
└──────┴──────┴──────┴──────┘
dict.fromkeys() builds the names and "nulls" to add as the missing columns:
>>> dict.fromkeys(["c", "d"])
{'c': None, 'd': None}
The dtypes will differ in which case you can use vertical_relaxed strategy.