Not logged in
Talk
Contributions
Create account
Log in
Navigation
Main page
About
People
Publications
Teaching
Resources
Research Blog
Wiki Functions
Recent changes
Help
Licensing
Project page
Discussion
Edit
View history
Editing
CommunityData:Hyak Spark
(section)
From CommunityData
Jump to:
navigation
,
search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
= Tips, Recipes, and Resources = To gain access to various useful SparkContext functions, you need to instantiate a pointer to the context which encloses your session. It seems to be common for Spark users to call this pointer sc, e.g. after you do <syntaxhighlight lang="python"> spark = SparkSession.builder.getOrCreate() </syntaxhighlight> add a line like <syntaxhighlight lang="python"> sc = spark.sparkContext </syntaxhighlight> and then you can use sc to access the functions described here: [http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext]. ==== To create an empty dataframe ==== One way to create an empty dataframe is to generate a schema as in the example script, and then pass the schema into the create method, with an empty RDD object as data. <syntaxhighlight lang="python"> myAwesomeDataset = spark.createDataFrame(data=sc.emptyRDD(), schema=myGroovySchema) </syntaxhighlight> ==== Pyspark string slicing seems to be non-pythonic ==== # Strings begin at 1, not 0. [0:5] will yield the same results as [1:5]. # [x:y] will give you "give me y total characters, starting with the char in position x." So, given a column in articleDF called timestamp, with contents like.... 20150701000000 You can access 2015 with <syntaxhighlight lang="python"> articleDF.timestamp[1:4] </syntaxhighlight> And to get 07: <syntaxhighlight lang="python"> articleDF.timestamp[5:2] </syntaxhighlight> ==== When Reading In Multiple Files with Different Schemas ==== Make sure you re-instantiate your reader object, e.g. <syntaxhighlight lang="python"> sparkReader = spark.read </syntaxhighlight> when changing to a new file. The reader may cache the schema of the previous file and fail to detect the new schema. To make sure you have what you're expecting, try a <syntaxhighlight lang="python"> if DEBUG: yourDataset.show() </syntaxhighlight> This will get you the same behavior as pandas print(yourDataset.head()) -- 20 rows, nicely formatted in your stdout. Note that calling show(), while filtered to show 20 rows, still causes all steps of the job to execute, and can take a long time. ==== Getting Help & Useful Links ==== Getting help with Spark in the usual fora such as Stack Exchange or even a straight-up Google search seems to be a less effective strategy for pyspark than it is for normal python queries and errors. Specifying pyspark in your search terms helps in getting only Python answers, but debugging an error may require looking at Java documentation, and some online recipe blogs speak of writing code "in Spark", so presumably the console language. The apache spark site at [https://spark.apache.org/ https://spark.apache.org/] is useful but not all of the example code is localized to python. ==== Join help ==== There's some good info about joins here: [http://www.learnbymarketing.com/1100/pyspark-joins-by-example/ http://www.learnbymarketing.com/1100/pyspark-joins-by-example/] === Java Errors and Responses === When I got: spark java.io.IOException: No subfolder can be created in . the University of Google told me it was a disk space issue. But shutting down the cluster and restarting it solved the problem -- maybe it hadn't shut down cleanly the last time someone used spark. === Slurm kills my job! Encountering Memory Limits === In theory, Spark enables you to run computations on data of any size without memory limitations. In practice, memory management issues occur. We are trying to understand these issues and to learn how to write Spark scripts that don't overuse memory. ==== Things to try if you run out of memory ==== # The 'out of memory' may be ephemeral or due to memory management issues in some layer other than your code -- try starting up a new cluster and running the same job unchanged. # Repartition your data: increasing the number of partitions should make it easier for the Spark scheduler to avoid exceeding its memory limits. # Increase the number of nodes. This can solve the problem essentially by giving Spark more Ram to work with. # Be careful moving data out of distributed Spark objects into normal python objects. Work your way up from a small sample to a larger sample. # You might try tweaking memory management options in <code>$SPARK_HOME/conf/spark-env.sh</code> and <$SPARK_HOME/conf/spark-defaults.conf</code>. Decreasing the number of executors, and the total memory allocated to executors should make Spark more resilient at the cost of performance. === Errors While Starting Cluster === Sometimes I get errors like this: n0650: failed to launch: nice -n 0 /com/local/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://n0649:18899 Usually it seems to happen if I relinquish my spark cluster (whether I use the kill script or not) and then immediately restart one. The error goes away if I shut down and wait a minute or two before re-launching; my assumption is that there's some hygienic work being done behind the scenes that the scheduler doesn't know about and I need to let that finish. And sometimes I get errors like this: scontrol: error: host list is empty when trying to launch spark. This means I'm in a session that doesn't know what nodes are assigned to the spark cluster. This will launch a dysfunctional cluster. Stop-all and then start_spark_cluster from the same session you landed in when you ran get_spark_nodes. When I get errors about full logs, I go in and clean up the temporary files folder it refers to.
Summary:
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see
CommunityData:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:
Cancel
Editing help
(opens in new window)
Tools
What links here
Related changes
Special pages
Page information