I'm evaluating Apache Spark to see if it's good platform for the following requirements:
- Cloud computing environment.
- Commodity hardware.
- Distributed DB (e.g. HBase) with possibly a few petabytes of data.
- Lots of simultaneous small computations that need to complete fast (within seconds). Small means 1-100 MBs of data.
- A few large computations that don't need to complete fast (hours is fine). Large means 10-1000 GBs of data.
- Very rarely, very large computations that don't need to complete fast (days is fine). Very large means 10-100 TBs of data.
- All computations are mutually independent.
- Real-time data stream incoming for some of the computations.
- Machine learning involved.
Having read a bit about Spark, I see the following advantages:
- Runs well on commodity hardware and with HBase/Cassandra.
- MLlib for machine learning.
- Spark Streaming for real-time data.
- While MapReduce doesn't seem strictly necessary, maybe it could speed things up, and would let us adapt if the requirements became tighter in the future.
These are the main questions I still have:
- Can it do small computations very fast?
- Will it load-balance a large number of simultaneous small computations?
I also wonder if I'm generally not trying to use Spark for a purpose it wasn't designed for, not using the main advantages: MapReduce and in-memory RDDs. If so, I'd also welcome a suggestion for an alternative. Many thanks!