Editing CommunityData:Hyak

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 144: Line 144:


More information on parallelizing your R code can be found in the [https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf <code>parallel</code> package documentation].
More information on parallelizing your R code can be found in the [https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf <code>parallel</code> package documentation].
<!-- The hyak machines have 16 cpu cores.  The Mox machines will have 28! Running your program on all the cores can speed things up a lot! We make heavy use of R for building datasets and for fitting models. Like most programming languages, R uses only one cpu by default. However, for typical computation-heavy data science tasks it is pretty easy to make R use all the cores.
For fitting models, the R installed in Gentoo should use all cores automatically. This is thanks to OpenBlas, which is a numerical library that implements and parallelizes linear algebra routines like matrix factorization, matrix inversion, and other operations that bottleneck model fitting.
However, for building datasets, you need to do a little extra work. One common strategy is to break up the data into independent chunks (for example, when building wikia datasets there is one input file for each wiki) and then use <code>mcapply</code> from <code>library(parallel)</code> to build variables from each chunk. Here is an example:
    library(parallel)
    options(mc.cores=detectCores())  ## tell R to use all the cores
   
    mcaffinity(1:detectCores()) ## required and explained below
 
    library(data.table) ## for rbindlist, which concatenates a list of data.tables into a single data.table
   
    ## imagine defining a list of wikis to analyze
    ## and a function to build variables for each wiki
    source("wikilist_and_buildvars")
   
    dataset <- rbindlist(mclapply(wikilist,buildvars))
   
    mcaffinity(rep(1,detectCores())) ## return processor affinities to the status preferred by OpenBlas
A working example can be found in the [[Message Walls]] git repository.
<code>mcaffinity(1:detectCores())</code> is required for the gentoo R <code>library(parallel)</code> to use multiple cores. The reason is technical and has to do with OpenBlas. Essentially, OpenBlas changes settings that govern how R assigns processes to cores. OpenBlas wants all processes assigned to the same core, so that the other cores do not interfere with it's fancy multicore linear algebra. However, when building datasets, the linear algebra is not typically the bottleneck. The bottleneck is instead operations like sorting and merging that OpenBlas does not parallelize.
The important thing to know is that if you want to use mclapply, you need to do <code>mcaffinity(1:detectCores())</code>. If you want to then fit models you should do <code>mcaffinity(rep(1,detectCores())</code> so that OpenBlas can do its magic. -->


=== Using the Checkpoint Queue ===
=== Using the Checkpoint Queue ===
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)

Templates used on this page: