CommunityData:Hyak Spark: Difference between revisions

From CommunityData
(Create page)
 
No edit summary
Line 26: Line 26:
# Effectively developing your spark code means getting it setup on your own laptop, which isn't trivial.  
# Effectively developing your spark code means getting it setup on your own laptop, which isn't trivial.  
# Doing more advanced things requires programming in Scala.
# Doing more advanced things requires programming in Scala.
= Getting Started with Spark =
The first thing to do is to get a working stand alone Spark installation on your laptop. You should develop your Spark code on your laptop before running on hyak. To get spark working on your laptop you need to first install the Oracle Java Development Toolkit (Oracle JDK) and then install Spark.
=== Installing Java ===
To install java, all that should be required is to download and unzip the software and then set the <code> $JAVA_HOME </code> environment variable and add the java programs to your <code> $PATH </code>. We have Java 8 on Hyak.
# Download the java jdk appropriate for your Operating System from [http://www.oracle.com/technetwork/pt/java/javase/downloads/jdk8-downloads-2133151.html here].
# Unpack the archive where you want, for example <code>/home/you/Oracle_JDK</code>.
# Edit your environment variables (i.e. in your .bashrc) to
    JAVA_HOME=/home/you/Oracle_JDK/
    PATH=$JAVA_HOME/bin:$PATH
=== Installing Spark ===
Now we can install spark.

Revision as of 20:39, 24 August 2018

Apache Spark is a powerful system for writing programs that deal with large datasets. The most likely reason for using Spark on Hyak is that you run into memory limitations when building variables. For example suppose you want to compute:

  • The number of prior edits a Wikipedia editor has made, for every editor and every edit.
  • The number of views for every Wikipedia page for every month and every edit.
  • The number of times every Reddit user has commented on every thread on every subreddit for every week.

You might try writing a program to build these variables using a data science tool like pandas in Python or data.table, or plyr in R. These common data science tools are powerful, expressive, and fast, but do not work when data does not fit in memory. When a table does not fit in memory, but the computation you want to do only requires operating on one row at a time (such as in a simple transformation or aggregation), you can often work around this limitation by writing a simple custom program that operates in a streaming fashion. However, when computation cannot be done one row at a time, such as in a sort, group by, or join a streaming solution will not work. In this case your options are limited. One option is writing bespoke code to perform the required operations and building variables. However, this can be technically challenging and time consuming work. Moreover, your eventual solution is likely to be relatively slow and difficult to extend or maintain compared to a solution build using Spark. A number of us (Nate, Jeremy, Kaylea) have all at some point written bespoke code for computing user-level variables on Wikipedia data. The infamous "million file problem" is a result from abusing the filesystem to perform a massive group by.

This page will help you decide if you should use Spark on Hyak for your problem and provide instructions on how to get started.

Pros and Cons of Spark

The main advantages of Spark on Hyak:

  1. Work with "big data" without ever running out of memory.
  2. You get very good parallelism for free.
  3. Distribute computational work across many hyak nodes so your programs run faster.
  4. Common database operations (select, join, groupby, filter) are pretty easy.
  5. Spark supports common statistical and analytical tasks (stratified sampling, summary and pairwise statistics, common and simple models).
  6. Spark is a trendy technology that lots of people know or want to learn.

The main disadvantages of Spark are

  1. It takes several steps to get the cluster up and running.
  2. The programming paradigm is not super intuitive, especially if you are not familiar with SQL databases or lazy evaluation.
  3. Effectively developing your spark code means getting it setup on your own laptop, which isn't trivial.
  4. Doing more advanced things requires programming in Scala.

Getting Started with Spark

The first thing to do is to get a working stand alone Spark installation on your laptop. You should develop your Spark code on your laptop before running on hyak. To get spark working on your laptop you need to first install the Oracle Java Development Toolkit (Oracle JDK) and then install Spark.

Installing Java

To install java, all that should be required is to download and unzip the software and then set the $JAVA_HOME environment variable and add the java programs to your $PATH . We have Java 8 on Hyak.

  1. Download the java jdk appropriate for your Operating System from here.
  2. Unpack the archive where you want, for example /home/you/Oracle_JDK.
  3. Edit your environment variables (i.e. in your .bashrc) to
   JAVA_HOME=/home/you/Oracle_JDK/
   PATH=$JAVA_HOME/bin:$PATH 


Installing Spark

Now we can install spark.