Editing CommunityData:Hyak Spark
From CommunityData
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 378: | Line 378: | ||
To gain access to various useful SparkContext functions, you need to instantiate a pointer to the context which encloses your session. It seems to be common for Spark users to call this pointer sc, e.g. after you do | To gain access to various useful SparkContext functions, you need to instantiate a pointer to the context which encloses your session. It seems to be common for Spark users to call this pointer sc, e.g. after you do | ||
spark = SparkSession.builder.getOrCreate() | spark = SparkSession.builder.getOrCreate() | ||
add a line like | add a line like | ||
sc = spark.sparkContext | sc = spark.sparkContext | ||
and then you can use sc to access the functions described here: [http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext]. | and then you can use sc to access the functions described here: [http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext]. | ||
Line 394: | Line 390: | ||
One way to create an empty dataframe is to generate a schema as in the example script, and then pass the schema into the create method, with an empty RDD object as data. | One way to create an empty dataframe is to generate a schema as in the example script, and then pass the schema into the create method, with an empty RDD object as data. | ||
myAwesomeDataset = spark.createDataFrame(data=sc.emptyRDD(), schema=myGroovySchema) | myAwesomeDataset = spark.createDataFrame(data=sc.emptyRDD(), schema=myGroovySchema) | ||
==== Pyspark string slicing seems to be non-pythonic ==== | ==== Pyspark string slicing seems to be non-pythonic ==== | ||
Line 408: | Line 402: | ||
You can access 2015 with | You can access 2015 with | ||
articleDF.timestamp[1:4] | articleDF.timestamp[1:4] | ||
And to get 07: | And to get 07: | ||
articleDF.timestamp[5:2] | articleDF.timestamp[5:2] | ||
==== When Reading In Multiple Files with Different Schemas ==== | ==== When Reading In Multiple Files with Different Schemas ==== | ||
Make sure you re-instantiate your reader object, e.g. | Make sure you re-instantiate your reader object, e.g. | ||
sparkReader = spark.read | sparkReader = spark.read | ||
when changing to a new file. The reader may cache the schema of the previous file and fail to detect the new schema. To make sure you have what you're expecting, try a | when changing to a new file. The reader may cache the schema of the previous file and fail to detect the new schema. To make sure you have what you're expecting, try a | ||
if DEBUG: | if DEBUG: | ||
yourDataset.show() | yourDataset.show() | ||
This will get you the same behavior as pandas print(yourDataset.head()) -- 20 rows, nicely formatted in your stdout. | This will get you the same behavior as pandas print(yourDataset.head()) -- 20 rows, nicely formatted in your stdout. |