Editing CommunityData:Hyak Datasets
From CommunityData
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 50: | Line 50: | ||
Parquet is a [https://en.wikipedia.org/wiki/Column-oriented_DBMS column-oriented format] which means that it is capable of reading each column independently of others. This confers two key advantages compared to unstructured formats that can make it very fast. First, the <code>filter</code> runs only on the <code>subreddit</code> column to figure out what rows need to be read for the other fields. Second, only the columns that are selected in <code>columns=</code> need to be read at all. This is how arrow can pull data from parquet so fast. | Parquet is a [https://en.wikipedia.org/wiki/Column-oriented_DBMS column-oriented format] which means that it is capable of reading each column independently of others. This confers two key advantages compared to unstructured formats that can make it very fast. First, the <code>filter</code> runs only on the <code>subreddit</code> column to figure out what rows need to be read for the other fields. Second, only the columns that are selected in <code>columns=</code> need to be read at all. This is how arrow can pull data from parquet so fast. | ||
=== Streaming parquet datasets === | === Streaming parquet datasets === | ||
If the data you want to pull exceed available memory, you have a few options. | If the data you want to pull exceed available memory, you have a few options. | ||