Scala Optimization Pitfalls (part I)

I. Introduction

As programming languages and optimization techniques become more and more advanced, we often ignore the performance cost when using advanced language features thinking the optimiser will be smart enough to make it fast. Unfortunately, this is not always the case. Take Scala, a language runs on JVM which has decades of optimizations and a reputation of running fast, there are still many cases in which performance will suffer needlessly due to unexpected implementation details. This blog will examine one of those case.

II. The problem

So let’s start with one common optimization case: Loop invariant code motion. The compiler will check for any invariant and move it out of the loop so we don’t have to recalculate for each iteration. So we will need one simple case where the cost of recalculating the invariant outweight the primary calculation. Here is one  naive and very simple example:

  def intersect1(s1: Seq[Int], s2: Seq[Int]): Seq[Int] =
    s2.filter(elem => s1.toSet.contains(elem))

Here we will calculate the intersection between two Seq by casting one to Set and use .contains() method on it. Without any optimization the .toSet() will be called every iteration and it is much more expensive than member testing .contains(). It’s easy to see that (s1.toSet) is a loop invariant and can be move out of the loop. We will make another version for that:

  def intersect2(s1: Seq[Int], s2: Seq[Int]): Seq[Int] = {
    val set = s1.toSet
    s2.filter(elem => set.contains(elem))
  }

The following code snippet is used for benchmarking:

  def benchmark(block: => Unit): (Double, Double) = {
    (1 to 3).foreach(_ => block) // Let the JIT engine optimize the hotspot
    val results = (1 to 31).map{_ => time(block)}
    (mean(results), variance(results))
  }

  private def time(block: => Unit): Long = {
    val t0 = System.nanoTime()
    block
    val result = System.nanoTime() - t0
    System.gc()
    result
  }

And the test data:

    val s1 = (1 to testSize).filter(_ % 2 == 0).toSeq
    val s2 = (1 to testSize).filter(_ % 7 == 0).toSeq

With testSize = 10000, here is the result:
– Intersect1: 1768ms
– Intersect2: 1.327ms

So by moving the .toSet() out of the hot loop we have sped it up more 1000 times! Examining the bytecode generated by scalac confirms that Scala wasn’t able to move it out of the loop by itself:

         0: aload_0
         1: getfield      #22                 // Field s1$3:Lscala/collection/Seq;
         4: invokeinterface #28,  1           // InterfaceMethod scala/collection/Seq.toSet:()Lscala/collection/immutable/Set;
         9: iload_1
        10: invokestatic  #34                 // Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
        13: invokeinterface #40,  2           // InterfaceMethod scala/collection/immutable/Set.contains:(Ljava/lang/Object;)Z
        18: ireturn

On line 3, invokeinterface call Seq.toSet() method on anonymous field #22 of the class. And this happens every iterations. Building a Set from Seq will takes log(n) while look up a member is effectively constant time (see Scala’s Collection Performance Characteristics)

One has to wonder, why didn’t scalac optimisation in this case, s2 is immutable and .toSet is also a pure function. This might be a very simple example and it’s unlikely that anyone will make this mistake, but it is not so obvious in real world codes. Aside from this there are many pitfalls such as unintended closure creation (especially in for comprehension) or too many temporary objects in functional style code (scala will create an anonymous class for every function literal), this will bog down the GC greatly and can kill performance completely. So despite all the advances in compiler technology and techniques, it’s still up to the developer to ensure that the performance

In the next post we will examine other ways of implementing above function and how scala handles them. We will examine the generated bytecode by scalac further to see the details of Scala implementation. All the code used in this post can be found on my Github, any comment/correction is welcomed.

Add a Comment

Scroll Up