12

I'm evaluating Apache Spark to see if it's good platform for the following requirements:

  • Cloud computing environment.
  • Commodity hardware.
  • Distributed DB (e.g. HBase) with possibly a few petabytes of data.
  • Lots of simultaneous small computations that need to complete fast (within seconds). Small means 1-100 MBs of data.
  • A few large computations that don't need to complete fast (hours is fine). Large means 10-1000 GBs of data.
  • Very rarely, very large computations that don't need to complete fast (days is fine). Very large means 10-100 TBs of data.
  • All computations are mutually independent.
  • Real-time data stream incoming for some of the computations.
  • Machine learning involved.

Having read a bit about Spark, I see the following advantages:

  • Runs well on commodity hardware and with HBase/Cassandra.
  • MLlib for machine learning.
  • Spark Streaming for real-time data.
  • While MapReduce doesn't seem strictly necessary, maybe it could speed things up, and would let us adapt if the requirements became tighter in the future.

These are the main questions I still have:

  • Can it do small computations very fast?
  • Will it load-balance a large number of simultaneous small computations?

I also wonder if I'm generally not trying to use Spark for a purpose it wasn't designed for, not using the main advantages: MapReduce and in-memory RDDs. If so, I'd also welcome a suggestion for an alternative. Many thanks!

2
  • @samthebest: "Yes" to Spark being good for this use case, or "yes" to my using it for something it wasn't designed for? Commented Jul 14, 2014 at 17:31
  • 1
    "Yes" to the title of the question :)
    – samthebest
    Commented Jul 14, 2014 at 18:42

3 Answers 3

7

Small computations fast

We do use Spark in an interactive setting, as the backend of a web interface. Sub-second latencies are possible, but not easy. Some tips:

  • Create SparkContext on start up. It takes a few seconds to get connected and get the executors started on the workers.
  • You mention many simultaneous computations. Instead of each user having their own SparkContext and own set of executors, have just one that everyone can share. In our case multiple users can use the web interface concurrently, but there's only one web server.
  • Operate on memory cached RDDs. Serialization is probably too slow, so use the default caching, not Tachyon. If you cannot avoid serialization, use Kryo. It is way faster than stock Java serialization.
  • Use RDD.sample liberally. An unbiased sample is often good enough for interactive exploration.

Load balancing

Load balancing of operations is a good question. We will have to tackle this as well, but have not done it yet. In the default setup everything is processed in a first-in-first-out manner. Each operation gets the full resources of the cluster and the next operation has to wait. This is fine if each operation is fast, but what if one isn't?

The alternative fair scheduler likely solves this issue, but I have not tried it yet.

Spark can also off-load scheduling to YARN or Mesos, but I have no experience with this. I doubt they are compatible with your latency requirements.

3
  • I would try to give small job few partitions so Spark will not be able to utilize all the cluster running them, so having several contexts might do. Commented Jul 14, 2014 at 8:11
  • 1
    Pardon me for the novice question, does this mean that the web server has to be present on the cluster itself to kick off computations? Is there a way to have an independent web server running as a client and firing jobs on the cluster remotely? This is important as I don't want to put additional load of serving web requests on the cluster. Thanks!
    – jatinpreet
    Commented Aug 26, 2014 at 7:14
  • Sure, you can have the web server outside of the Spark machines. When you create the SparkContext you have to specify the master URL and it can easily be a different machine. But the web server that connects to Spark has to be a single instance. If you have high web load, you need multiple web server instances. I would recommend having two layers: 1 web server connects to Spark (let's call this the Spark proxy), many web servers serve user-facing requests and talk to the Spark proxy. Commented Aug 26, 2014 at 10:14
3

I think the short answer is "yes". Spark advertises itself as "near real-time". The one or two papers I've read describe throughput latency as either one second or several seconds. For best performance, look at combining it with Tachyon, an in-memory distributed file system.

As for load-balancing, later releases of Spark can use a round-robin scheduler so that both large and small jobs can co-exist in the same cluster:

Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.

However, I am not clear on what you mean by "...also wonder if I'm generally not trying to use Spark for a purpose it wasn't designed for, not using the main advantages: MapReduce and in-memory RDDs."

I might be wrong, but I don't think you can use Spark without RDDs. Spark is distributed computing with RDDs. What kind of jobs are you trying to run if not MapReduce style jobs? If your use cases are not a good fit for the example use cases that provided with the Spark documentation or tutorials, then what do your use cases fit? Hadoop/Spark shine when there are tons of data, and very little iterative computation on the data. For example, solving systems of equations is not a traditional use case for these technologies.

Is there a need to distribute jobs that only involve 1-100 MB? Such small amounts of data are often processed most quickly on a single powerful node. If there is a reason to distribute the computation, look at running MPI under Mesos. A lot jobs that fall under the name "scientific computing" continue to use MPI as the distributed computing model.

If the jobs are about crunching numbers (e.g. matrix multiplication), then small-medium jobs can be handled quickly with GPU computing on a single node. I've used Nvidia's CUDA programming environment. It rocks for computationally intensive tasks like computer vision.

What is the nature of the jobs that will run in your environment?

1
  • Many thanks for the answer. The nature of my jobs is statistics and prediction on business processes. What I meant by not using MapReduce and RDDs is that they don't seem strictly necessary to solve the problem. The size of each job and corresponding time constraint doesn't necessitate distributed computing (MapReduce). The non-iterativeness and independence of jobs doesn't necessitate in-memory caching (RDDs). That's how I understand these things - clarifications welcome! Commented Jul 12, 2014 at 20:52
1

If I get it right from the comments you provided, IF you are into statistics and prediction , then MLlib with distributed machine learning algorithms will be very helpful. When your concern comes to non distributed task, it uses breeze library for linear algebra calculations and it is very fast. So you can have both kind of implementation but not in same file. If you have prior information on your input size, you can switch to either implementation.

As other answers also pointed out, creating a common spark context is a nice idea. Something similar we use for testing algorithms using scalaTest

import org.scalatest.Suite
import org.scalatest.BeforeAndAfterAll

import org.apache.spark.{SparkConf, SparkContext}

trait LocalSparkContext extends BeforeAndAfterAll { self: Suite =>
  @transient var sc: SparkContext = _

  override def beforeAll() {
    val conf = new SparkConf()
      .setMaster("local")
      .setAppName("test")
    sc = new SparkContext(conf)
    super.beforeAll()
  }

  override def afterAll() {
    if (sc != null) {
      sc.stop()
    }
    super.afterAll()
  }
}

Something on this line will help you create spark context for many users.

Not the answer you're looking for? Browse other questions tagged or ask your own question.