Editing CommunityData:Hyak Spark

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 378: Line 378:
To gain access to various useful SparkContext functions, you need to instantiate a pointer to the context which encloses your session. It seems to be common for Spark users to call this pointer sc, e.g. after you do
To gain access to various useful SparkContext functions, you need to instantiate a pointer to the context which encloses your session. It seems to be common for Spark users to call this pointer sc, e.g. after you do


<syntaxhighlight lang="python">
     spark = SparkSession.builder.getOrCreate()
     spark = SparkSession.builder.getOrCreate()
</syntaxhighlight>


add a line like
add a line like


<syntaxhighlight lang="python">
     sc = spark.sparkContext
     sc = spark.sparkContext
</syntaxhighlight>


and then you can use sc to access the functions described here: [http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext].
and then you can use sc to access the functions described here: [http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext].
Line 394: Line 390:
One way to create an empty dataframe is to generate a schema as in the example script, and then pass the schema into the create method, with an empty RDD object as data.
One way to create an empty dataframe is to generate a schema as in the example script, and then pass the schema into the create method, with an empty RDD object as data.


<syntaxhighlight lang="python">
     myAwesomeDataset = spark.createDataFrame(data=sc.emptyRDD(), schema=myGroovySchema)
     myAwesomeDataset = spark.createDataFrame(data=sc.emptyRDD(), schema=myGroovySchema)
</syntaxhighlight>


==== Pyspark string slicing seems to be non-pythonic ====
==== Pyspark string slicing seems to be non-pythonic ====
Line 408: Line 402:


You can access 2015 with
You can access 2015 with
<syntaxhighlight lang="python">
     articleDF.timestamp[1:4]
     articleDF.timestamp[1:4]
</syntaxhighlight>


And to get 07:
And to get 07:
<syntaxhighlight lang="python">
     articleDF.timestamp[5:2]
     articleDF.timestamp[5:2]
</syntaxhighlight>


==== When Reading In Multiple Files with Different Schemas ====
==== When Reading In Multiple Files with Different Schemas ====


Make sure you re-instantiate your reader object, e.g.
Make sure you re-instantiate your reader object, e.g.  
<syntaxhighlight lang="python">
     sparkReader = spark.read
     sparkReader = spark.read
</syntaxhighlight>


when changing to a new file. The reader may cache the schema of the previous file and fail to detect the new schema. To make sure you have what you're expecting, try a  
when changing to a new file. The reader may cache the schema of the previous file and fail to detect the new schema. To make sure you have what you're expecting, try a  
 
 
<syntaxhighlight lang="python">
     if DEBUG:
     if DEBUG:
         yourDataset.show()
         yourDataset.show()
</syntaxhighlight>


This will get you the same behavior as pandas print(yourDataset.head()) -- 20 rows, nicely formatted in your stdout.
This will get you the same behavior as pandas print(yourDataset.head()) -- 20 rows, nicely formatted in your stdout.
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)