0

From this question: How to group data and construct a new column - python pandas?, I know how to groupby multiple columns and construct a new unique id by using pandas, but if I want to use Apache beam in Python to achieve the same thing that is described in that question, how can I achieve it and then write the new data to a newline delimited JSON format file (each line is one unique_id with an array of objects that belong to that unique_id)?

Assuming the dataset is stored in a csv file.

I'm new to Apache beam, here's what I have now:

import pandas
import apache_beam as beam
from apache_beam.dataframe.io import read_csv

with beam.Pipeline() as p:
    df = p | read_csv("example.csv", names=cols)
    agg_df = df.insert(0, 'unique_id',
          df.groupby(['postcode', 'house_number'], sort=False).ngroup())
    agg_df.to_csv('test_output')        

This gave me an error:

NotImplementedError: 'ngroup' is not yet supported (BEAM-9547)

This is really annoying, I'm not very familiar with Apache beam, can someone help please...

(ref: https://beam.apache.org/documentation/dsls/dataframes/overview/)

1 Answer 1

0

Assigning consecutive integers to a set is not something that's very amenable to parallel computation. It's also not very stable. Is there any reason another identifier (e.g. the tuple (postcode, house_number) or its hash would not be suitable?

Sign up to request clarification or add additional context in comments.

4 Comments

Hi I don't follow your question, are you asking why I need to construct another identifier?
Yes, or why this identifier can't be an (injective) function of postcode and house_number.
The value for the identifier doesn't matter, it's only used to identify the same property.
In that case, simply concatenate the postcode and house number and use that as an identifier (e.g. df['unique_id'] = df['postcode'] + '-' + df['house_number '])

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.