0

I have a dataframe with just one column with content like:

view: meta_record_extract
dimension: e_filter
type: string
hidden: yes
sql: "SELECT * FROM files"
dimension: category
type: string
...

What I tried to produce would be a dataframe with columns and the data like this:

____________________________________________________________________    
view                    | dimension |label | type  | hidden | sql      |
     meta_record_extract| e_filter  | NaN  | string| yes    |"SELECT * FROM files" 
NaN                     | category  | NaN  | string ...

Given that splitting the string data like

df.header[0].split(': ')[0]

was giving me label with [0] or value with [1] I tried this:

df.pivot_table(df, columns = df.header.str.split(': ')[0], values = df.header.str.split(': ')[1])

but it did not work giving the error.

Can anyone help me to achieve the result I need?

1 Answer 1

1

Use str.findall() + map, as follows:

str.findall() helps you extract the keyword and value pairs into a list. We then map the list of keyword-value pairs into a dict for pd.Dataframe to turn the dict into a dataframe.

(Assuming the column label of your column is Col1):

df_extract = df['Col1'].str.findall(r'(\w+):\s*(.*)')

df_result = pd.DataFrame(map(dict, df_extract))

Result:

print(df_result)



                  view dimension    type hidden                    sql
0  meta_record_extract       NaN     NaN    NaN                    NaN
1                  NaN  e_filter     NaN    NaN                    NaN
2                  NaN       NaN  string    NaN                    NaN
3                  NaN       NaN     NaN    yes                    NaN
4                  NaN       NaN     NaN    NaN  "SELECT * FROM files"
5                  NaN  category     NaN    NaN                    NaN
6                  NaN       NaN  string    NaN                    NaN

Update

To compress the rows to minimize the NaN's, we can further use .apply() with .dropna(), as follows:

df_compressed = df_result.apply(lambda x:  pd.Series(x.dropna().values))

Result:

print(df_compressed)


                  view dimension    type hidden                    sql
0  meta_record_extract  e_filter  string    yes  "SELECT * FROM files"
1                  NaN  category  string    NaN                    NaN
Sign up to request clarification or add additional context in comments.

18 Comments

@RandyMcKay We can do that. But since you can have some keywords appear more than once and some others only once, it's inevitable that it still leave with some NaN. Anyway, we can minimize that. Will edit the solution for that. Stay tuned.
amazing! Thank you a lot!
@RandyMcKay Sorry, not quite understand what you mean, especially the statement I see the dimensions for the first view is below under the other views now.. Can you elaborate ? Is that related to data not in the sample data ?
@RandyMcKay Let me clarify a bit more. Is the relative sequence within one particular column retained or shuffled ? I mean within one particular column, not between columns. This kind of compression is working on column by column. It simply ignore relative sequence between columns.
@RandyMcKay Let's consider only one column. Let's say view. For its values in df_result, assume 3 values in sequence view1, view2, view3. You mean after compression, it becomes e.g. view1, view3, view2 ? Is that true ? If true, it's weird. As the sorted function I just gave you provides stable sort, that's mean, it will maintain sequence.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.