Welcome to the Apache Spark tutorial notebooks.
This very simple notebook is designed to test that your environment is setup correctly.
Please Run All
cells.
The notebook should run without errors and you should see a histogram plot at the end.
(You can also check the expected output here)
Let's check that there are some input data available:
%%sh
head -n 10 data/prince_by_machiavelli.txt
Let's check if spark is available and what version are we using (should be 2.1+):
spark.version
Let's try to run a simple spark program to compute the number of occurences od words in Machiavelli's "Prince", and display 10 most frequent ones:
import operator
import re
wordCountRDD = sc.textFile('data/prince_by_machiavelli.txt') \
.flatMap(lambda line: re.split(r'[^a-z\-\']+', line.lower())) \
.filter(lambda word: len(word) > 0 ) \
.map(lambda word: (word, 1)) \
.reduceByKey(operator.add)
top10Words = wordCountRDD \
.map(lambda (k,v):(v,k)) \
.sortByKey(False) \
.take(10)
print(top10Words)
Let's use Spark SQL to display a table with the 10 lest frequent words:
wordCountDF = spark.createDataFrame(wordCountRDD, ['word', 'count'])
bottom10Words = wordCountDF.sort('count').limit(10)
display(bottom10Words)
Let's save the results to a csv file in the output directory:
wordCountDF.write.csv('output/prince-word-count.csv', mode='overwrite', header=True)
Let's preview the output:
%%sh
head -n 10 output/prince-word-count.csv/part-00000-*.csv
Let's use maptplotlib to plot the histogram of the distribution of word counts:
import matplotlib.pyplot as plt
wordCountPDF = wordCountDF.toPandas()
plt.hist(wordCountPDF['count'], bins = 10, log = True)
plt.show()
display()