May 30, 2015

1 Click from Code to Prod: Spark, Scala, sbt, Intellij and Hadoop

You will probably will find dozens of Q&A articles how to create a new scala project using Intellij and submit it to a remote Hadoop based Spark cluster. However, none of them is actually complete and shows the full picture.
This post is going to save you a lot of time, so stay tuned...

Expected Outcome: an environment that will let you in one click move from coding your Scala to submitting it to a remote YARN based Spark cluster.

Note: some issues like "no spaces" may be overcome using double quotes or other methods, but we recommend you to follow the process and make it the simple way to avoid unexpected outcomes.

  1. Install JDK. Make sure JDK is placed in a folder w/o spaces (for example C:\Java)
  2. Configure the JAVA_HOME environment variable to you installation location
  3. Download and install the latest Intellij IDE
  4. Install the Intellij Scala plugin
  5. Download and install scala. Again make sure scala is placed in a folder w/o spaces (for example C:\scala)
  6. Set the environment variable for Scala
  7. Download and install sbt 0.13.8. Again make sure sbt is placed in a folder w/o spaces (for example C:\sbt)
  8. Add the sbt folder to the PATH environment variable
  9. In windows, you will need 1) to download winutils.exe; 2) create Hadoop folder; 3) and2) place in a bin folder inside a in order to avoid the following error:
    Failed to locate the winutils binary in the hadoop binary path Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
  10. If your environment is connected to the internet using a proxy add the following proxy parameter of both HTTP and HTTPS to the JAVA_OPTIONS environment variable: -Dhttp.proxyPort=8080 -Dhttps.proxyPort=8080
Note: when working w/ Intellij, consider using the early access program for quick fixes especially in a dynamic environment like Scala and Spark

Create the Initial Project
  1. Create a new Scala (and not an sbt on) project in Intellij
  2. Select the right JDK version (based on the installation you made before)
  3. Select the right Scala SDK to match the cluster (see below): click on create and select the right one (or click on download and get it):
Create the Basic Project Files
This should be done in the file system and not inside the Intellij IDE to avoid surprises.
  1. Create a build.sbt file in the project root. Please notice:

    1. Matching the Spark client and cluster versions.
    2. Matching the Hadoop client and cluster versions.
    3. Matching the Scala and Spark version by looking for the spark-core package in Maven Central. In our case you should look for the spark cluster version (1.2.0) and then get the matching Scala version (2.10) from the ArtifactId. The minor version can be found in the Scala site.

      The various Spark-core versions and matching the Spark and Scala versions
    4. Adding the "provided" keyword to the library dependencies in order to avoid jar clashes when building the project:
      [error] (*:assembly) deduplicate: different file contents found in the following:
      [error] \.ivy2\cache\javax.activation\activation\jars\activation-1.1.jar:javax/activation/ActivationDataFlavor.class
      [error] \.ivy2\cache\org.eclipse.jetty.orbit\javax.activation\orbits\javax.activation-1.1.0.v201105071233.jar:javax/activation/ActivationDataFlavor.class
    5. Not using the provided key: in order to avoid cases where sbt assembly run correctly, but actually the make (or sbt run) does not, you should include the following reincluding in your build.sbt (and not in your assembly.sbt): run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))
    6. Exclude javax.servlet file to avoid the following errors:
      [error] (run-main-0) java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package
      java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package
    7.         at java.lang.ClassLoader.checkCerts(
    8.         at java.lang.ClassLoader.preDefineClass(
    9.         at java.lang.ClassLoader.defineClass(
    10. Keeping a spaced line between each two lines.
    11. From the command line in the project root run: 
      1. sbt
      2. sbt update
      3. sbt assembly
      4. sbt run
    12. If you get during running the following exception, do worry, it is just a cleanup issue and you can disregard it:
      ERROR Utils: Uncaught exception in thread SparkListenerBus
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(
  2. Create a /project/assembly.sbt file and run sbt assembly to verify the project jar is being created: sbt assembly. Using this you will be able to avoid:
    [error] Not a valid command: assembly
    [error] Not a valid project ID: assembly
    [error] Not a valid configuration: assembly
    [error] Not a valid key: assembly
    [error] assembly
  3. Create a /main/scala/SimpleApp.scala file (or any other original name for your project main file). Please notice to include
    1. The spark conf should include your spark master that will serve your jar using setMaster, in order to avoid the following error:
      "A master URL must be set in your configuration"
    2. If you have limited resources (and you will have), configure the number of used cores and the allocated memory per core.
    3. The path where your compiled jar is located using setJars. After building your project in the first time you will be able to find it inside the target folder in your project. If you want configure setJars, you will get messages that Spark cannot find your Jar
The /build.sbt file:
name := "SimpleApp"

scalaVersion := "2.10.4"

run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.5.0" % "provided" excludeAll ExclusionRule(organization = "javax.servlet")

libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.5.0" % "provided" excludeAll ExclusionRule(organization = "javax.servlet")

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" % "provided"

The /project/assembly.sbt file:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

The /main/scala/SimpleApp.scala file:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "hdfs://*" 
    val conf = new SparkConf()
        .setAppName("Simple Application")
        .set("spark.executor.memory", "64m")
        .set("spark.cores.max", "4")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))

Arrange the IDE and Submit your First Job
  1. Create a new configuration.
  2. Add sbt assembly to "the before launch" configuration in order to generate a jar file:
    1. Add a new task to the "Before Launch"
    2. Add a new external tool
    3. Set the sbt location in the program and "assembly" in parameters
Bottom Line
It may take a little time to launch a proper Scala and Spark configuration in Intellij, but the result worths it!

Keep Performing,
Moshe Kaplan


    Intense Debate Comments

    Ratings and Recommendations