280 questions
0
votes
2
answers
89
views
Is it safe to mutate, emit, and snapshot the same operator-state instance in Apache Flink?
I'm building a single global Topology object in a non-keyed ProcessFunction with parallelism = 1. I keep it as a local mutable object and update it for every input event using topology.apply(GnmiEvent)...
-1
votes
1
answer
40
views
Does Flink throughput decrease proportionally with the number of side outputs?
I have a Flink job with multiple downstream operators I want to route tuples to based on a condition. Side outputs are advertised for this use case in the Flink documentation.
However, when sending ...
0
votes
0
answers
58
views
Azure Event Hubs +Apache Flink + Cassandra – how to handle downtime without losing events (one Replay Hub vs multiple Replay Hubs)?
We use Azure Event Hubs (Kafka API) with Apache Flink consumers, and a shared Cassandra DB as the sink.
There are 7 Event Hubs (one per application) → each has its own Flink consumer writing to the ...
0
votes
1
answer
57
views
How to handle multiple aggregations on the same Kafka event stream when each needs a different partitioning key?
I'm working on a system that processes events from Kafka, and I'm running into a design problem related to scaling aggregations.
The events contain fields like this (purchased SKUs in an order):
{
&...
0
votes
1
answer
35
views
How does Kafka Streams' KTable.groupBy + aggregate guarantee correct order of the re-partitioned messages?
If I have a kafka input topic with multiple partitions and then in Kafka Streams I use kStream.map to change the key of each record and write that to an output topic, I will face the problem, that ...
0
votes
0
answers
49
views
Average aggregation of data stream in bytewax
I want to aggregate the values of my DataStream in tumbling windows of 10 seconds.
Unfortunately is the documentation in Bytewax very limited and I also don't find any other source where an average of ...
1
vote
1
answer
77
views
Docker connection between Kafka and Bytewax refused
I want to consume a stream from Kafka using Bytewax to perform aggregations. Unfortunately I'm not able to connect to Kafka and the connection is always refused. I assume something with the port setup ...
0
votes
1
answer
123
views
Apache Flink Branching
Problem Overview:
I am working on a Flink application that allows users to design dataflows dynamically. The core engine is built around stages, where a DataStream is passed sequentially through these ...
0
votes
1
answer
32
views
Timestamps of result elements of window function in Apache Flink
The Flink documentation says "The only relevant information that is set on the result elements is the element timestamp [...] which is [set to] end timestamp - 1 [...]"
In case I have to ...
0
votes
1
answer
48
views
Flink State Processor API with Windowed CoGroup and Unaligned Checkpoints
How can I use the Flink State Processor API to process the state of a windowed CoGroup (or Join) function? The documentation does not give such an example.
Is there a way to use the Flink State ...
1
vote
0
answers
40
views
ksqldb explode multiple list elements
I have data like this:
One Kafka Message:
[{"source": "858256_6052+571", "numericValue": null, "created": 1725969039288, "textValue": "mytestData&...
0
votes
1
answer
345
views
Kafka Streams Processor has no access to StateStore null the store is not connected to the processor
I have the code in A. After we call builtTopology = builder.build, the call to new org.apache.kafka.streams.TopologyTestDriver(builtTopology, properties) gives me the error in B.
I've combed through ...
1
vote
1
answer
118
views
Is there any chance to limit database sessions using jdbc sinks with apache flink?
We are using jdbc sinks (apache flink) with which we are hitting the database maximal session count, especially when we increase parallelism.
Our tests showed that if we increase our default ...
0
votes
1
answer
45
views
Flink GlobalWindow Trigger only process the trigger event
I have datastream keyby by an event property, that is then passed to a globalwindow, trigged when a specific event comes in, the issue is that when the window is trigged to process the events, it only ...
0
votes
1
answer
139
views
How to correlate HTTP request to its corresponding response with Apache Flink?
I'm looking for the best practice to correlate requests and their corresponding responses in an Apache Flink stream processing job. The key attributes of the problem are:
Conditions:
Each request and ...
0
votes
2
answers
101
views
Slowly Updating Side Inputs & Session Windows - Transform node AppliedPTransform was not replaced as expected
In my apache beam streaming pipeline, I have an unbounded pub/sub source which I use with session windows.
There is some bounded configuration data which I need to pass into some of the DoFns of the ...
2
votes
1
answer
72
views
Why does PyFlink give me a time in the past
I have a Kafka topic in which I produce an entry every 2-3 seconds
Then I have PyFlink job that will format the entries and send them to a db
here's my Flink env setup
env = StreamExecutionEnvironment....
0
votes
1
answer
145
views
Order Preserving Union/Merge Executor in Flink
I want to merge two (multiple) streams using Flink. Both streams are themselves ordered, I want merged result to be ordered also. As an example
[1,2,4,5,7,8, ...] and [2,3,6,7, ..]
should produce ...
2
votes
0
answers
379
views
Apache Beam portable runner to run golang based job on flink
I am trying to learn Apache Beam and trying to create a sample project to learn stream processing. For now, I want to read from a Kafka topic "word" and print the data on console.
I have ...
1
vote
1
answer
1k
views
Are checkpoints needed for a stream processed in databricks job running via continuous trigger?
We have a requirement to process a data stream in a databricks notebook job and load to a delta table.
I noted that there was a new "Continuous" trigger available for databricks jobs and we ...
0
votes
2
answers
3k
views
Apache flink job not printing anything to stdout and not sending to Kafka sink topic
I am experimenting with Apache Flink for a personal project and I have been struggling to make the resulting stream output to StdOut and to send it to a Kafka topic orders-output.
My goal is to ...
0
votes
1
answer
46
views
How to access the results from ProcessFunction?
I am running a query against two topics and calculating the results.
In the main class:
tableEnv.createTemporaryView("tbl1", stream1);
tableEnv.createTemporaryView("tbl2", stream2);...
2
votes
2
answers
1k
views
Good way to parallel process elements of 'stream' while keeping output in order
I have an app that receives a stream of XML events from Kafka. These events have to be deserialized/parsed and otherwise converted, before being handed in-order to some business logic. (This logic ...
0
votes
1
answer
78
views
Unable to perform 'maxBy' on a Flink windowed stream of type <SampleClass> by using the SampleClass's field as a parameter does not work
Lets say this is my sample stream like so:
SingleOutputStreamOperator<Tuple2<String, SampleClass>> sampleStream = previousStream
.keyBy(value -> ...
2
votes
1
answer
2k
views
Is there a suitable python library for doing stream processing with Kafka topics? [duplicate]
I am trying to find a suitable Python library to do stream processing with streams Kafka topics, Kafka streams. Specifically, I am looking for libraries that support the following operations.
KStream-...
0
votes
1
answer
1k
views
In Kafka Streams, how do you parallelize complex operations (or sub-topologies) using multiple topics and partitions?
I am currently trying to understand how Kafka Streams achieves parallelism. My main concern boils down to three questions:
Can multiple sub-topologies read from the same partition?
How can you ...
0
votes
0
answers
179
views
Is it possable to stream images to a PDF using tesseract?
I need to stream images from a scanner to a PDF document. Ultimately I want to OCR the images and save the text to the PDF as well, but I'm more concerned with getting the streaming working first. ...
1
vote
0
answers
214
views
Pipelined vs blocking data exchange in a Flink job
I've been reading about pipelined region scheduling in Flink and am a bit confused about what they mean. My understanding of it is that a Streaming job is always pipelined whereas a Batch job can ...
1
vote
0
answers
138
views
Event-Driven-API with Real Time Streaming Analytics from Datastream? (Kappa-Architecture, IoT)
I've recently read up common Big Data architectures (Lambda and Kappa) and I'm trying to put it into practice in the context of an IoT Application.
As of right now, events are produced, ingested into ...
1
vote
0
answers
243
views
Upgrade Apache Storm from 1.2.3. to 2.2.0
We are seeing below error when we migrate to Apache Storm 2.2.0 from 1.2.3. We do not see any permission issues as well on the file which it is unable to find. Also, the file exists but just the /data ...
0
votes
1
answer
195
views
How to configure logback-based logging to handle log masking with Apache Storm?
I am trying to configure logback based log masking for Apache Storm topologies.
When I try to replace logback.xml file inside Apache Storm log4j2- directory and update worker.xml and cluster.xml file, ...
4
votes
1
answer
864
views
One consumer to multiple tables or many consumers per table
I have a kafka topic with millions of sale events. I have a consumer which on every message will insert the data into 4 table:
1 for the raw sales,
1 for the sales sum by date by product category (...
0
votes
2
answers
78
views
Is it possible to implement a timer in a Apache Storm bolt?
A Bolt's code is triggered when data arrives (an input tuple). How can we program code inside a Bolt to run even in the case of missing input data? I mean, if no tuple arrives how can we force an ...
0
votes
1
answer
328
views
Azure Stream Analytics - JSON Array, windowing and previous value
I receive containers of senor data as an input for ASA. Containers look like this.
{
"data": [
{
"sensor_id": 55,
"timestamp": 1663075725000,
"value&...
2
votes
2
answers
3k
views
Apache flink vs Apache Beam (With flink runner)
I am considering using Flink or Apache Beam (with the flink runner) for different stream processing applications. I am trying to compare the two options and make the better choice. Here are the ...
2
votes
1
answer
660
views
How to apply an offset to Tumbling Window, in order to delay the starting of Windows<TimeWindow> in Kafka Streams
I'm calculating a simple mean on a dataset with values for May 2022, using different windows sizes. Using 1 hour windows there are no problems, while using 1 week and 1 month windows, records are not ...
2
votes
3
answers
865
views
Flink AggregateFunction in TumblingWindow is automatically splitted in two windows for big window size
I'm calculating a simple mean on some records, using different windows sizes. Using 1 hour and 1 week windows there are no problems, and the results are computed correctly.
var keyed = src
....
1
vote
1
answer
69
views
Sink for user activity data stream to build Online ML model
I am writing a consumer that consumes (user activity data, (activityid, userid, timestamp, cta, duration) from Google Pub/Sub and I want to create a sink for this such that I can train my ML model in ...
2
votes
1
answer
511
views
How to drain the window after a Flink join using coGroup()?
I'd like to join data coming in from two Kafka topics ("left" and "right").
Matching records are to be joined using an ID, but if a "left" or a "right" record ...
3
votes
3
answers
1k
views
How to do stream processing with Redpanda?
Redpanda seems easy to work with, but how would one process streams in real-time?
We have a few thousand IoT devices that send us data every second. We would like to get the running average of the ...
1
vote
0
answers
772
views
Use Redis streams for MQTT queue
Redis has changed a lot in the later years, and it's difficult to keep up with the latest features.
We have a few thousand IoT devices that all send MQTT messages every second. We want different ...
3
votes
4
answers
270
views
Stream processing alternatives
We have a few thousand IoT devices that send us their temperature every second. The input source can be MQTT or JSON (or a queue if needed).
Our goal is to near continuously process data for each of ...
2
votes
1
answer
368
views
How to inject delay between the window and sink operator?
Context - Application
We have an Apache Flink application which processes events
The application uses event time characteristics
The application shards (keyBy) events based on the sessionId field
The ...
1
vote
1
answer
382
views
Python program to use Elasticsearch as sink in Apache Flink
I am trying to read data from a kafka topic do some processing and dump the data into elasticsearch. But I could not find example in python ti use Elastisearch as sink. Can anyone help me with a ...
4
votes
3
answers
5k
views
Flink Python Datastream API Kafka Consumer
Im new to pyflink. Im tryig to write a python program to read data from kafka topic and prints data to stdout. I followed the link Flink Python Datastream API Kafka Producer Sink Serializaion. But i ...
0
votes
2
answers
57
views
Are Hazelcast Jet Reliable Topic Sinks idempotent? (Hazelcast fault-tolerance of a websocket source)
I cannot find this in the Hazelcast Jet 5.0 (or 4.x) documentation, so I hope someone can answer this here - can a reliable topic be used as an idempotent sink, for example to de-duplicate events ...
0
votes
1
answer
513
views
Easiest stream processing framework for python?
I'm working on a project that will determine whether or not I score an internship. The project focuses on stream processing and is due in 2 weeks. It's pretty simple, just deriving some statistics ...
0
votes
1
answer
436
views
How to scale flink on an un-keyed stream
I have a relatively basic use case. My data lives in a few 100 kafka partitions and I need to pass the events through a map operator before I send them to a custom HTTP sink.
For performance reasons ...
3
votes
0
answers
356
views
using Java Stream without the filter() operation to block the stream
I'm working on a service that needs to make some stream processing for products.
Given a Company we can use getProducts(Company company) to get List<Product>.
The next thing I'd like to do is to ...
0
votes
1
answer
553
views
Why does my flink window trigger when I have set watermark to be a high number?
I would expect windows to trigger only after we wait until the maximum possible time as defined by the max lateness for watermark.
.assignTimestampsAndWatermarks(
WatermarkStrategy....