0

I'm working on a project that will determine whether or not I score an internship. The project focuses on stream processing and is due in 2 weeks. It's pretty simple, just deriving some statistics from a csv file and printing it to a GUI. The project looks something like this:

A provided CSV is formatted as

ID: int, OperatingSystem: str, Date: str, Score: int

I'm supposed to track the lowest, highest, and median scores

  • per OS,
  • per date, and
  • across the entire dataset

Then I'm supposed to define a data structure for creating a histogram, also per date, OS, and entire dataset. I can use any language that I want, but I'd prefer Python if possible.

The problem is that I've never done any stream processing work before and I'm having trouble finding resources on how to actually put it into code. I've watched videos explaining kafka and looked into the docs and code samples for the faust and Maki Nage frameworks, but I've only gotten as far as crashing the program right off the bat and staring at doc pages scratching my head.

Are there any simple, well documented stream processing libraries that I should look into? Additionally, are there any resources that demonstrate how to actually write code for these libraries? Youtube seems to only focus on architectures and uml diagrams without any practical demonstrations, and I'm beginning to worry that I'll never understand how to build this project.

Thanks, Geisha

5
  • I like to use gRPC which is the one used by Spotify and Netflix - It is developed by Google and highly used as a communication between microservices. Commented Nov 21, 2021 at 22:09
  • Are you supposed to stream over network or directly from a file? Commented Nov 21, 2021 at 22:10
  • If you are supposed to stream from a file it is very simple Commented Nov 21, 2021 at 22:11
  • Since you can read files lazily and handle each line in real time: stackoverflow.com/a/519653/12868928 Commented Nov 21, 2021 at 22:16
  • Yes, it's streaming directly from a csv file Commented Nov 21, 2021 at 22:54

1 Answer 1

0

This is just a point in the right direction it doesn't have to be a class you could also make a function with inner functions. You just need to persist some state.

This function will do the calculations for each line it reads.

# Remember to strip the header first

class Streamer:
    data = []

    date = {
        'high': 0,
        'low': 0,
        'median': 0,
    }

    os = {
        'high': 0,
        'low': 0,
        'median': 0,
    }

    score = {
        'high': 0,
        'low': 0,
        'median': 0,
    }

    def __init__(self)
        for line in open('file.csv', 'r'):
            es  = [x.strip() for x in line.strip().split(',')]

            x   = {
                'id'    : x[0],
                'os'    : x[1],
                'date'  : x[2],
                'score' : x[3],
            }

            self.calculate_os_median_high_low(x['os'])
            self.calculate_date_median_high_low(x['date'])
            self.calculate_score_median_high_low(x['score'])

            self.data.append(x)

    def calculate_os_median_high_low(self, os):
        pass

    def calculate_date_median_high_low(self, date):
        pass

    def calculate_score_median_high_low(self, score):
        pass

If you wanna be real clever then you could just feed the list for each line and run the reading concurrently, so that you can call the calculation functions from outside of the reading and thereby save alot of comutational engergy. (In this case I would use Golang instead since concurrency is 100 times easier and more safe in golang than in python)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.