Close Menu
StreamLineCrypto.comStreamLineCrypto.com
  • Home
  • Crypto News
  • Bitcoin
  • Altcoins
  • NFT
  • Defi
  • Blockchain
  • Metaverse
  • Regulations
  • Trading
What's Hot

Semler Scientific Plans To Hold 105,000 Bitcoin By 2027

June 20, 2025

Bitcoin Price Could Be Headed For A Surprise Move

June 19, 2025

relief rally or true trend reversal?

June 19, 2025
Facebook X (Twitter) Instagram
Friday, June 20 2025
  • Contact Us
  • Privacy Policy
  • Cookie Privacy Policy
  • Terms of Use
  • DMCA
Facebook X (Twitter) Instagram
StreamLineCrypto.comStreamLineCrypto.com
  • Home
  • Crypto News
  • Bitcoin
  • Altcoins
  • NFT
  • Defi
  • Blockchain
  • Metaverse
  • Regulations
  • Trading
StreamLineCrypto.comStreamLineCrypto.com

Optimizing Parquet String Data Compression with RAPIDS

July 17, 2024Updated:July 17, 2024No Comments3 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email
Optimizing Parquet String Data Compression with RAPIDS
Share
Facebook Twitter LinkedIn Pinterest Email
ad


Jessie A Ellis
Jul 17, 2024 17:53

Uncover how you can optimize encoding and compression for Parquet string knowledge utilizing RAPIDS, resulting in vital efficiency enhancements.





Parquet writers provide varied encoding and compression choices which might be turned off by default. Enabling these choices can present higher lossless compression to your knowledge, however understanding which choices to make use of is essential for optimum efficiency, in accordance with the NVIDIA Technical Weblog.

Understanding Parquet Encoding and Compression

Parquet’s encoding step reorganizes knowledge to scale back its dimension whereas preserving entry to every knowledge level. The compression step additional reduces the overall dimension in bytes however requires decompression earlier than accessing the info once more. The Parquet format contains two delta encodings designed to optimize string knowledge storage: DELTA_LENGTH_BYTE_ARRAY (DLBA) and DELTA_BYTE_ARRAY (DBA).

RAPIDS libcudf and cudf.pandas

RAPIDS is a collection of open-source accelerated knowledge science libraries. On this context, libcudf is the CUDA C++ library for columnar knowledge processing. It helps GPU-accelerated readers, writers, relational algebra capabilities, and column transformations. The Python cudf.pandas library accelerates current pandas code by as much as 150x.

Benchmarking with Kaggle String Knowledge

A dataset of 149 string columns, comprising 4.6 GB whole file dimension and 12 billion whole character rely, was used to check encoding and compression strategies. The research discovered lower than 1% distinction in encoded dimension between libcudf and arrow-cpp and a 3-8% improve in file dimension when utilizing the ZSTD implementation in nvCOMP 3.0.6 in comparison with libzstd 1.4.8+dfsg-3build1.

String Encodings in Parquet

String knowledge in Parquet is represented utilizing the byte array bodily sort. Most writers default to RLE_DICTIONARY encoding for string knowledge, which makes use of a dictionary web page to map string values to integers. If the dictionary web page grows too giant, the author falls again to PLAIN encoding.

Complete File Measurement by Encoding and Compression

For the 149 string columns within the dataset, the default setting of dictionary encoding and SNAPPY compression yields a complete 4.6 GB file dimension. ZSTD compression outperforms SNAPPY, and each outperform uncompressed choices. The perfect single setting for the dataset is default-ZSTD, with additional reductions potential utilizing delta encoding for particular circumstances.

When to Select Delta Encoding

Delta encoding is useful for knowledge with excessive cardinality or lengthy string lengths, usually reaching smaller file sizes. For string columns with lower than 50 characters, DBA encoding can present vital file dimension reductions, particularly for sorted or semi-sorted knowledge.

Reader and Author Efficiency

The GPU-accelerated cudf.pandas library confirmed spectacular efficiency in comparison with pandas, with 17-25x quicker Parquet learn speeds. Utilizing cudf.pandas with an RMM pool additional improved throughput to 552 MB/s learn and 263 MB/s write speeds.

Conclusion

RAPIDS libcudf gives versatile, GPU-accelerated instruments for studying and writing columnar knowledge in codecs reminiscent of Parquet, ORC, JSON, and CSV. For these seeking to leverage GPU acceleration for Parquet processing, RAPIDS cudf.pandas and libcudf present vital efficiency advantages.

Picture supply: Shutterstock


ad
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Related Posts

Semler Scientific Plans To Hold 105,000 Bitcoin By 2027

June 20, 2025

Bitcoin Price Could Be Headed For A Surprise Move

June 19, 2025

Trump family cuts stake in World Liberty Financial by 20%

June 19, 2025

OpenAI to Phase Out Work with Scale AI Amid Meta Deal

June 19, 2025
Add A Comment
Leave A Reply Cancel Reply

ad
What's New Here!
Semler Scientific Plans To Hold 105,000 Bitcoin By 2027
June 20, 2025
Bitcoin Price Could Be Headed For A Surprise Move
June 19, 2025
relief rally or true trend reversal?
June 19, 2025
Bitcoin Top Is In And Price Is Headed For $92,000, Analyst Warns
June 19, 2025
X transforms into a finance hub but sidesteps crypto—for now
June 19, 2025
Facebook X (Twitter) Instagram Pinterest
  • Contact Us
  • Privacy Policy
  • Cookie Privacy Policy
  • Terms of Use
  • DMCA
© 2025 StreamlineCrypto.com - All Rights Reserved!

Type above and press Enter to search. Press Esc to cancel.