1.1 RDD Basics

This notebook demonstates the basics of RDDs (Resilient Distributed Datasets) - the low level programming abstractions of Apache Spark.

The RDDs can be thought of as immutable, possibly very long lists of elements, which can be transformed to other RDDs by operations like map or filter or aggregated with operations such as reduce.

Spark takes care of efficiently running these operations in the distributed manner on a cluster.

sc, the SparkContext instance, is the main entry point for RDD operations.

Let's create a simple RDD and look at some basic operations:

In [1]:
# The most direct way to create an `RDD` is to convert an existing python `list` to 
# an `RDD` using `parallelize`. After all RDDs are kind of lists.

textRDD = sc.parallelize([
    "The numbers: 10 20 30 are chosen.", 
    "hello 11 there.",
    "it's 12 pm."
])

# We can also do the opposite, that is convert an `RDD` to a `list` using `collect`.
# Please be mindful however of the size of the RDD: the resulting python `list` must fit into 
# the memory of your (driver) machine and the whole point of using Spark is to deal with data sets 
# that are larger than that!

# But is a useful function for small RDDs.

textRDD.collect()
Out[1]:
['The numbers: 10 20 30 are chosen.', 'hello 11 there.', "it's 12 pm."]
In [2]:
# Here are some other basic functions:

# `count` - returns the number of elements in the `RDD`

print("RDD size: %s" % textRDD.count())


# `first` - gets the first element 

print("RDD first element: %s" % textRDD.first())

# `take` - takes the n first elements as a python `list` - useful to preview an RDD of any size
print("RDD first 2 elements:  %s" % textRDD.take(2))

# `foreach` apply a function to each element of an RDD 
# The code below prints all the elements. 
# The output however goes to the driver's `stdout` - check the console window or `My Cluster/View Driver Logs`

from __future__ import print_function
textRDD.foreach(print)
RDD size: 3
RDD first element: The numbers: 10 20 30 are chosen.
RDD first 2 elements:  ['The numbers: 10 20 30 are chosen.', 'hello 11 there.']

We can also save the entire RDD to a disk:

In [3]:
%%sh 

# Spark by default will fail if the output file exits, so we need to make to remove it first.
# (It might have been created by the previous run of this notebook)

rm -rf output/sample.txt
In [4]:
# save the RDD to a text file

textRDD.saveAsTextFile('output/sample.txt')
In [5]:
%%sh 

# Let's preview the results.

ls -lh output/sample.txt


# The output is a directory with the actual content partitioned amongst multiple `part-*` files.
# The simplistic explanation for this is that the RDD itself is partitioned amongst the many  
# executors (e.g. nodes in the cluster) and each executor writes its own partition.
# The `_SUCCESS` indicates that the complete output has been successfully produced. 


# Now let's display the actual content:

echo 
echo "Content:"
cat output/sample.txt/part-*
total 24
-rw-r--r--  1 szu004  staff     0B 10 Jul 10:45 _SUCCESS
-rw-r--r--  1 szu004  staff     0B 10 Jul 10:45 part-00000
-rw-r--r--  1 szu004  staff    34B 10 Jul 10:45 part-00001
-rw-r--r--  1 szu004  staff    16B 10 Jul 10:45 part-00002
-rw-r--r--  1 szu004  staff    12B 10 Jul 10:45 part-00003

Content:
The numbers: 10 20 30 are chosen.
hello 11 there.
it's 12 pm.

Let's try to solve the follwowing simple problem using the Spark RDD API:

Find the sum of all the numbers in the given text. For example in the text:

The numbers: 10 20 30 are chosen. 
hello 11 there.
it's 12 pm.

we wouild identify: 10 20 30 11 and 12 as numbers a produce the sum of: 83.

We will first go through all the transformations step by step and then combine them together.

We have already loaded the text into the textRDD:

In [6]:
# this is our initial RDD - each elements a line of text
textRDD.collect()
Out[6]:
['The numbers: 10 20 30 are chosen.', 'hello 11 there.', "it's 12 pm."]
In [7]:
# in the first step we will split each line into tokens 
# and combine all the tokens into a single RDD (list)

textRDD.flatMap(lambda line:line.split())
Out[7]:
PythonRDD[10] at RDD at PythonRDD.scala:48

Please note the output above - the result of flatMap as another RDD and no actual computation has been performed yet.

Rather a formula on how to transform one RDD to another has been created. It will be executed only when the actual data of the RDD are needed.

Spark makes the distintions between the transformations, which lazily transform RDDs and actions, which produce the results.

Here are some examples of both:

  • transformations: map, flatMap, filter
  • operations: collect, count, reduce, saveAsTextFile

For more information on transformations vs operations please check the Spark Programming Guide.

To to preview result of each transformation step we will call collect action at the end:

In [8]:
textRDD \
    .flatMap(lambda line:line.split()) \
    .collect()
Out[8]:
['The',
 'numbers:',
 '10',
 '20',
 '30',
 'are',
 'chosen.',
 'hello',
 '11',
 'there.',
 "it's",
 '12',
 'pm.']
In [9]:
import re

# a simple function to check if a string is a number
# params: 
#    s : a string
# returns:
#    True if `s` is a number and False otherwise
def is_number(s):
    return re.match(r'[0-9]+', s)


# in the second step we produce the RDD, which only includes words which are numbers
textRDD.flatMap(lambda line:line.split()) \
    .filter(is_number) \
    .collect()
Out[9]:
['10', '20', '30', '11', '12']
In [10]:
# in the third step we convert each `number word` to the actual `int` number

textRDD.flatMap(lambda line:line.split()) \
    .filter(is_number) \
    .map(int) \
    .collect()
Out[10]:
[10, 20, 30, 11, 12]
In [11]:
# in the step last we `reduce` the RDD by adding all it's elements togeher 

import operator
textRDD.flatMap(lambda line:line.split()) \
    .filter(is_number) \
    .map(int) \
    .reduce(operator.add)
Out[11]:
83

We have obtained the expected result of 83.

Please note that reduce is an action (we need the actual data to add them together), so it will triger the execution of all the transformation steps before.

You can now play around modifying pieces of the code.

When you are done and you are running off the local machine remember to close the notebook with File/Close and Halt.