Odds are the dataset you are working with is some combination of timestamped events and observations of entities and their relationships at various points in time. Activity Streams provides a simple yet powerful standard format for these types of data, regardless of their origin, publisher, or specific details. Activity Streams is a community-driven specification designed for interoperability and flexibility. By supporting activity streams you maximize the chance that a new data-source of interest to you will be compatible with your existing data, and that your data will be compatible with that of other communities working on similar projects.
A short list of organizations and products that support activity streams format is compiled here.
If your organization supports activity streams, please let us know on the project mailing list.
If you are working with structured event and or entity data that fits the Activity Streams model, and working with a JVM language, Apache Streams can simplify many of the challenging aspects involved with these types of projects. For example:
Apache Streams is
Apache Streams is not
The primary Streams git repository apache-streams (org.apache.streams:apache-streams) contains
Similar modules can also be hosted externally - so long as they publish maven artifacts compatible with your version of streams, you can import and use them in your streams easily.
The streams community also supports a separate repository apache-streams-examples (org.apache.streams:apache-streams-examples) which contains a library of simple streams that are ‘ready-to-run’. Look here to see examples of what Streams user code look like.
Why use Postgres, Elasticsearch, Cassandra, Hadoop, Linux, or Java?
Frameworks make important but boring parts of systems and code just work so your team can focus on features important to your users.
If you are sure you can write code that is some combination of faster, more readable, better tested, easier to learn, easier to build with, or more maintainable than any existing framework (including Streams), maybe you should.
But you are probably under-estimating how difficult it will be to optimize across all of these considerations, stay current with upgrades to underlying libraries, and fix whatever bugs are discovered.
Or maybe you are capable of doing it all flawlessly, but your time is just more valuable focused on your product rather than on plumbing.
By joining forces with others who care about clean running water, everyone can run better, faster, stronger code assembled with more diverse expertise, tested and tuned under more use cases.
You don’t have to look hard to find great data processing frameworks for batch or for real-time. Pig, Hive, Storm, Spark, Samza, Flink, and Google Cloud Dataflow (soon-to-be Apache Beam) are all great. Apex and NiFi are interesting newer options. This list only includes Apache Foundation JVM projects!
At the core these platforms help you connect inputs and outputs to a directed graph of computation, and run your code at scale.
Streams use this computational model as well, but is more focused on intelligently and correctly modeling the data that will flow through the stream than on stream execution. In this sense Streams is an alternative to avro or protocol buffers - one which prioritizes flexibility, expressivity, interoperability, and tooling ahead of speed or compute efficiency.
Streams seeks to make it easy to design and evolve streams, and to configure complex streams sensibly. Where many processing frameworks leave all business logic and configuration issues to the developer, streams modules are designed to mix-and-match. Streams modules expect to be embedded with other frameworks and are organized to make that process painless.
Streams also contains a library of plug-and-play data providers to collect and normalize data from a variety of popular sources.
Currently you cannot deploy Streams (uppercase). Streams has no shrink-wrapped ready-to-run server process. You can however deploy streams (lowercase). The right method for packaging, deploying, and running streams depends on what runtime you are going to use.
Streams includes a local runtime that uses multi-threaded execution and blocking queues within a single process. In this scenario you build an uberjar with few exclusions and ship it to a target environment however you want - maven, scp, docker, etc… You launch the stream process with an appropriate configuration and watch the magic / catastrophic fail.
Alternatively, components written to streams interfaces can be bound within other platforms such as pig or spark. In this scenario, you build an uberjar that excludes the platform parts of the classpath and launch your stream using the launch style of that platform.
Absolutely - and that will work great right up until the point where the requirements, the tools, or the way you want to index your data need to change.
No problem - anyone can write a Streams provider. The project contains providers that use a variety of strategies to generate near-real-time data streams, including:
Providers can run continuously and pass-through new data, or they can work sequentially through a backlog of items. If you need to collect so many items that you can’t fit all of their ids in the memory available to your stream, it’s pretty simple to sub-divide your backlog into small batches and launch a series of providers for collection using frameworks such as Flink or Spark Streaming.
No problem - anyone can write a Streams persist reader or persist writer. The project contains persist writers that:
If you just want to use streams providers to collect and feed incoming data into a queueing system to work with outside of streams that’s just fine.
Describe any specific data collection, processing, or storage function and there are probably several if not tens of basic implementations on GitHub. There may even be language-specific libraries published by a vendor with a commercial interest in a related technology.
However, in general there are a set of tradeoffs involved when relying on these packages.
Streams goes to great lengths to regularize many of these issues so that they are uniform across project modules, and easy to reuse within new and external modules.
Where quality java libraries exist, their most useful parts may be included within a streams module, with unnecessary or difficult parts of their dependency tree excluded.
Work your way through the ‘Tutorial’ menu to get up and running with streams.
Then browse the ‘Other Resources’ menu to learn more about how streams works and why.