Optimizing Parquet String Data Compression with RAPIDS

Jessie A Ellis
Jul 17, 2024 17:53

Uncover how you can optimize encoding and compression for Parquet string knowledge utilizing RAPIDS, resulting in vital efficiency enhancements.

Parquet writers provide varied encoding and compression choices which might be turned off by default. Enabling these choices can present higher lossless compression to your knowledge, however understanding which choices to make use of is essential for optimum efficiency, in accordance with the NVIDIA Technical Weblog.

Understanding Parquet Encoding and Compression

Parquet’s encoding step reorganizes knowledge to scale back its dimension whereas preserving entry to every knowledge level. The compression step additional reduces the overall dimension in bytes however requires decompression earlier than accessing the info once more. The Parquet format contains two delta encodings designed to optimize string knowledge storage: DELTA_LENGTH_BYTE_ARRAY (DLBA) and DELTA_BYTE_ARRAY (DBA).

RAPIDS libcudf and cudf.pandas

RAPIDS is a collection of open-source accelerated knowledge science libraries. On this context, libcudf is the CUDA C++ library for columnar knowledge processing. It helps GPU-accelerated readers, writers, relational algebra capabilities, and column transformations. The Python cudf.pandas library accelerates current pandas code by as much as 150x.

Benchmarking with Kaggle String Knowledge

A dataset of 149 string columns, comprising 4.6 GB whole file dimension and 12 billion whole character rely, was used to check encoding and compression strategies. The research discovered lower than 1% distinction in encoded dimension between libcudf and arrow-cpp and a 3-8% improve in file dimension when utilizing the ZSTD implementation in nvCOMP 3.0.6 in comparison with libzstd 1.4.8+dfsg-3build1.

String Encodings in Parquet

String knowledge in Parquet is represented utilizing the byte array bodily sort. Most writers default to RLE_DICTIONARY encoding for string knowledge, which makes use of a dictionary web page to map string values to integers. If the dictionary web page grows too giant, the author falls again to PLAIN encoding.

Complete File Measurement by Encoding and Compression

For the 149 string columns within the dataset, the default setting of dictionary encoding and SNAPPY compression yields a complete 4.6 GB file dimension. ZSTD compression outperforms SNAPPY, and each outperform uncompressed choices. The perfect single setting for the dataset is default-ZSTD, with additional reductions potential utilizing delta encoding for particular circumstances.

When to Select Delta Encoding

Delta encoding is useful for knowledge with excessive cardinality or lengthy string lengths, usually reaching smaller file sizes. For string columns with lower than 50 characters, DBA encoding can present vital file dimension reductions, particularly for sorted or semi-sorted knowledge.

Reader and Author Efficiency

The GPU-accelerated cudf.pandas library confirmed spectacular efficiency in comparison with pandas, with 17-25x quicker Parquet learn speeds. Utilizing cudf.pandas with an RMM pool additional improved throughput to 552 MB/s learn and 263 MB/s write speeds.

Conclusion

RAPIDS libcudf gives versatile, GPU-accelerated instruments for studying and writing columnar knowledge in codecs reminiscent of Parquet, ORC, JSON, and CSV. For these seeking to leverage GPU acceleration for Parquet processing, RAPIDS cudf.pandas and libcudf present vital efficiency advantages.

Picture supply: Shutterstock

What's Hot

Crypto Titans Rally: Top US Exchanges Lobby For Risk Asset Easing In CLARITY Act

Meta’s Muse Spark ends its open-source AI era

Spot Bitcoin ETFs Log 6th Straight Week of Net Inflows for First Time Since August

Optimizing Parquet String Data Compression with RAPIDS

Spot Bitcoin ETFs Log 6th Straight Week of Net Inflows for First Time Since August

Find Out What Usually Follows

TON price doubles after Telegram made a move critics say cuts against crypto’s core promise

Judge clears path for Aave to move $71 million in ETH linked to North Korea hack

Crypto Titans Rally: Top US Exchanges Lobby For Risk Asset Easing In CLARITY Act

Meta’s Muse Spark ends its open-source AI era

Spot Bitcoin ETFs Log 6th Straight Week of Net Inflows for First Time Since August

Find Out What Usually Follows

TON price doubles after Telegram made a move critics say cuts against crypto’s core promise

What's Hot

Optimizing Parquet String Data Compression with RAPIDS

Understanding Parquet Encoding and Compression

RAPIDS libcudf and cudf.pandas

Benchmarking with Kaggle String Knowledge

String Encodings in Parquet

Complete File Measurement by Encoding and Compression

When to Select Delta Encoding

Reader and Author Efficiency

Conclusion

Related Posts