Nov 22, 2014

Analyzing Twitter Streams in Real Time using Spark

One of the most interesting (and some people say not working) features in Apache Spark is the ability to analyze the Twitter stream in real time.

DZone just released earlier this week the new Spark RefCard. Since I made a peer review of it, it is a good time to discuss this topic.

Tuning an Out of a Box Solution
Spark provides an out of the box example that with some tuning, you will get the top ten trending Twitter tags every 10 and 60 seconds.

  1. Create a new Twitter App or use your existing app credentials at Twitter Apps.
  2. Download and install Java, Scala and Spark
  3. Adjust the environment variables :
    1. Scala home
      export SCALA_HOME=/usr/lib/scala
    2. PATH to run scala
      export PATH=$PATH:$SCALA_HOME/bin
    3. Add the location of spark-streaming-twitter_2.10-1.0.0.jar,  twitter4j-core-3.0.3.jar and twitter4j-stream-3.0.3.jar to CLASSPATH
      export CLASSPATH=$CLASSPATH:/root/spark/lib/twitter4j/
  4. Run the code after tuning some parameters:
    1. Get into the spark foldercd /var/lib/spark/spark-1.0.1/
    2. If you are running on a single core machine (or you want to just make sure you will get results and not just "WARN BlockManager: Block input-0-XXXXXXXX already exists on this machine; not re-adding it") change the ./bin/run-example code:
      sudo sed -i 's/local\[\*\]/local\[2\]/g' *.txt
    3. Run the example (please remember that you should write the class name, including the streaming., and avoid placing the path, the scala extension or any other fancy stuff:sudo ./bin/run-example streaming.TwitterPopularTags 
  5. The result will be shown after several seconds:Popular topics in last 60 seconds (194 total):
    #MTVStars (42 tweets)
    #NashsNewVideo (9 tweets)
    #IShipKarma (6 tweets)
    #SledgehammerSaturday (6 tweets)
    #NoKiam (5 tweets)
    #mufc's (3 tweets)
    #gameinsight (3 tweets)

Bottom Line
Spark is an amazing platform, with some little adjustments you will be able to enjoy it in a few minutes 

Keep Performing,
Moshe Kaplan

ShareThis

Intense Debate Comments

Ratings and Recommendations