Reference:
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Event time vs. processing time
Event time:
which is the time at which events actually occurred.
Processing time:
which is the time at which events are observed in the system.
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Event time vs. processing time
Event time:
which is the time at which events actually occurred.
Processing time:
which is the time at which events are observed in the system.
- Time-agnostic
- Approximation
- https://pkghosh.wordpress.com/2014/09/10/realtime-trending-analysis-with-approximate-algorithms/
- https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
- Windowing by processing time
- Fixed windows
- Sliding windows
- Sessions
- Windowing by event time
Watermarks:
- A watermark is a notion of input completeness with respect to event times.
- A watermark with a value of time X makes the statement: “all input data with event times less than X have been observed.”
- As such, watermarks act as a metric of progress when observing an unbounded data source with no known end.
Triggers:
- A trigger is a mechanism for declaring when the output for a window should be materialized relative to some external signal.
- Triggers provide flexibility in choosing when outputs should be emitted.
- They also make it possible to observe the output for a window multiple times as it evolves.
- This in turn opens up the door to refining results over time, which allows for providing speculative results as data arrive as well as dealing with changes in upstream data (revisions) over time or data which arrive late relative to the watermark
(e.g., mobile scenarios, where someone’s phone records various actions and their event times while the person is offline, then proceeds to upload those events for processing upon regaining connectivity).
Accumulation:
- An accumulation mode specifies the relationship between multiple results that are observed for the same window. Those results might be completely disjointed, i.e., representing independent deltas over time, or there may be overlap between them.
- Different accumulation modes have different semantics and costs associated with them, and thus find applicability across a variety of use cases.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.