More and more often it’s not enough to process a “big data” dataset offline; it’s a requirement to process data as it comes, in a streaming fashion (on-line). This brings a whole new set of challenges.
Rarely actions can be taken basing on single data elements. We need to aggregate the results somehow to get valuable insights. In case of data streams, this usually means making decisions basing on data received in a time window.
There’s a lot of choices to be made when partitioning data into windows: should they be sliding, or tumbling; should boundaries be determined by event-time or processing-time; should the grouping be by time, or sessions. And finally, how to handle late data points, it at all.
With so many variables, each tool tries to solve the problem differently. In the presentation, we’ll first define what are the possible characteristics of a data windowing mechanism, and then we’ll try to compare the approaches taken by Apache Kafka, Flink, Spark and Akka Streams, with code examples.