Parquet
1. Parquet
ID: 07fcb963-c208-4735-9abb-3c55c2340a5f CREATED: <2025-02-06 Thu 18:43>
1.1. glossary
ID: e71f388c-9ed1-4862-8890-7f74271e8df0 CREATED: <2025-02-06 Thu 18:43>
- block
- same as HDFS block
- file
- file metadata is required, data is not
- row-group
- a logical horizontal partitioning of the data into rows. no physical rep is guaranteed for row-group
- column-chunk
- a chunk of the data for a particular column
- page
- column chunks are divided into pages. a page is conceptually indivisible in terms of compression/encoding. multiple page types can be interleaved in a column chunk.
Files consists of 1+ row-groups. A row-group contains exactly one column chunk per column. Column chunks contain one or more pages.
1.2. format summary
ID: ae54516c-c8a8-49f8-aac6-a95c18f5de8e CREATED: <2025-02-06 Thu 18:43>
4-byte magic number "PAR1" <Column 1 Chunk 1> <Column 2 Chunk 1> ... <Column N Chunk 1> <Column 1 Chunk 2> <Column 2 Chunk 2> ... <Column N Chunk 2> ... <Column 1 Chunk M> <Column 2 Chunk M> ... <Column N Chunk M> File Metadata 4-byte length in bytes of file metadata (little endian) 4-byte magic number "PAR1"