data engineering overview
Row based vs Column based
Rows on disk stored in —- line by line, new rows are written efficiently but for select statement that target just a subset of columns (or fields), reading is slow because you have to scan all then filter. compression is inefficient because data type are scattered all around the files. OLTP, avro
Column on disk stored in | | | by column new row are written slowly, because one record need to update different parts of file. but for select statement that target just a subset of columns (or fields), read are faster b/c avoid scan the entire file OLAP, Parquet, ORC
Written on July 16, 2023