Editing CommunityData:CDSC Reddit (section)

== Building Parquet Datasets ==

Pushshift dumps are huge compressed json files with a lot of metadata that we may not need. It isn’t indexed so it’s expensive to pull data from just a handful of subreddits. It also turns out that it’s a pain to read these compressed files straight into spark. Extracting useful variables from the dumps and building parquet datasets will make them easier to work with. This happens in two steps:

# Extracting json into (temporary, unpartitioned) parquet files using pyarrow.
# Repartitioning and sorting the data using pyspark.

The final datasets are in <code>/gscratch/comdata/output.</code>

* <code>reddit_comments_by_author.parquet</code> has comments partitioned and sorted by username (lowercase).
* <code>reddit_comments_by_subreddit.parquet</code> has comments partitioned and sorted by subreddit name (lowercase).
* <code>reddit_submissions_by_author.parquet</code> has submissions partitioned and sorted by username (lowercase).
* <code>reddit_submissions_by_subreddit.parquet</code> has submissions partitioned and sorted by subreddit name (lowercase).

Breaking this down into two steps is useful because it allows us to decompress and parse the dumps in the backfill queue and then sort them in spark. Partitioning the data makes it possible to efficiently read data for specific subreddits or authors. Sorting it means that you can efficiently compute agreggations at the subreddit or user level. More documentation on using these files is available [https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets here].