Parquet writers provide varied encoding and compression choices which might be turned off by default. Enabling these choices can present higher lossless compression to your knowledge, however understanding which choices to make use of is essential for optimum efficiency, in accordance with the NVIDIA Technical Weblog.
Understanding Parquet Encoding and Compression
Parquet’s encoding step reorganizes knowledge to scale back its dimension whereas preserving entry to every knowledge level. The compression step additional reduces the overall dimension in bytes however requires decompression earlier than accessing the info once more. The Parquet format contains two delta encodings designed to optimize string knowledge storage: DELTA_LENGTH_BYTE_ARRAY (DLBA) and DELTA_BYTE_ARRAY (DBA).
RAPIDS libcudf and cudf.pandas
RAPIDS is a collection of open-source accelerated knowledge science libraries. On this context, libcudf is the CUDA C++ library for columnar knowledge processing. It helps GPU-accelerated readers, writers, relational algebra capabilities, and column transformations. The Python cudf.pandas library accelerates current pandas code by as much as 150x.
Benchmarking with Kaggle String Knowledge
A dataset of 149 string columns, comprising 4.6 GB whole file dimension and 12 billion whole character rely, was used to check encoding and compression strategies. The research discovered lower than 1% distinction in encoded dimension between libcudf and arrow-cpp and a 3-8% improve in file dimension when utilizing the ZSTD implementation in nvCOMP 3.0.6 in comparison with libzstd 1.4.8+dfsg-3build1.
String Encodings in Parquet
String knowledge in Parquet is represented utilizing the byte array bodily sort. Most writers default to RLE_DICTIONARY encoding for string knowledge, which makes use of a dictionary web page to map string values to integers. If the dictionary web page grows too giant, the author falls again to PLAIN encoding.
Complete File Measurement by Encoding and Compression
For the 149 string columns within the dataset, the default setting of dictionary encoding and SNAPPY compression yields a complete 4.6 GB file dimension. ZSTD compression outperforms SNAPPY, and each outperform uncompressed choices. The perfect single setting for the dataset is default-ZSTD, with additional reductions potential utilizing delta encoding for particular circumstances.
When to Select Delta Encoding
Delta encoding is useful for knowledge with excessive cardinality or lengthy string lengths, usually reaching smaller file sizes. For string columns with lower than 50 characters, DBA encoding can present vital file dimension reductions, particularly for sorted or semi-sorted knowledge.
Reader and Author Efficiency
The GPU-accelerated cudf.pandas library confirmed spectacular efficiency in comparison with pandas, with 17-25x quicker Parquet learn speeds. Utilizing cudf.pandas with an RMM pool additional improved throughput to 552 MB/s learn and 263 MB/s write speeds.
Conclusion
RAPIDS libcudf gives versatile, GPU-accelerated instruments for studying and writing columnar knowledge in codecs reminiscent of Parquet, ORC, JSON, and CSV. For these seeking to leverage GPU acceleration for Parquet processing, RAPIDS cudf.pandas and libcudf present vital efficiency advantages.
Picture supply: Shutterstock