Data File Formats for Data Engineering: CSV, JSON, Parquet, Avro, and More

Introduction

As data engineers, we constantly deal with files, receiving them from external systems, processing them through pipelines and storing them efficiently. Yet the choice of file format is often an afterthought, despite having significant implications on storage costs, query performance and pipeline reliability.

This guide covers the formats you’ll encounter as sources, the formats you should choose for storage and when to use each.

Part 1: Source Formats (What You Receive)

These are formats you typically receive from external systems, APIs, or business users. You rarely choose them, you adapt to them.

CSV (Comma-Separated Values)

The universal exchange format. Everyone can produce it, everyone can read it.

order_id,customer_id,order_date,total_amount,status
1001,C-0042,2024-01-05,129.90,paid
1002,C-0081,2024-01-06,58.50,refunded
1003,C-0107,2024-01-06,12.00,paid

import pandas as pd
 
# Basic reading
df = pd.read_csv("sales.csv")
 
# Real-world CSV: handling quirks
df = pd.read_csv(
    "messy_data.csv",
    sep=";",                    # European systems love semicolons
    encoding="latin-1",         # Because UTF-8 would be too easy
    decimal=",",                # 1.234,56 instead of 1,234.56
    na_values=["", "NULL", "N/A", "-"],
    dtype={"zip_code": str}     # Prevent 01234 → 1234
)

The good:

Human-readable, easy to debug
Universal compatibility
Small files compress well

The bad:

No schema enforcement, types are guessed or lost
No native support for nested data
Encoding hell (UTF-8? Latin-1? Windows-1252?)
Delimiter conflicts (what if your data contains commas?)
Large files are slow to parse

Watch out for:

Leading zeros getting stripped (zip codes, IDs)
Date format ambiguity (01/02/2024: January 2nd or February 1st?)
Newlines inside quoted fields
Mixed encodings within the same file (yes, it happens)

JSON (JavaScript Object Notation)

The API standard. If you’re pulling data from a REST API, you’re getting JSON.

{
  "order_id": 1001,
  "customer": {"id": "C-0042", "name": "Alice"},
  "items": [
    {"sku": "BK-001", "qty": 1, "price": 29.90},
    {"sku": "AC-123", "qty": 2, "price": 50.00}
  ]
}

import json
 
# Single JSON object
with open("config.json") as f:
    config = json.load(f)
 
# JSON Lines (JSONL) - one object per line
import pandas as pd
df = pd.read_json("events.jsonl", lines=True)
 
# Nested JSON requiring flattening
data = {"user": {"name": "Alice", "address": {"city": "Paris"}}}
# You'll need to decide: flatten or keep nested?

The good:

Self-describing with keys
Handles nested/hierarchical data naturally
Native support in most programming languages
Human-readable

The bad:

Verbose: keys repeated for every record
No native date/datetime type
Large files are memory-intensive to parse
Schema can vary between records

JSON vs JSON Lines:

Aspect	JSON	JSON Lines (.jsonl)
Structure	Single array/object	One JSON object per line
Streaming	Must load entire file	Can process line by line
Use case	API responses, configs	Log files, data exports

XML (eXtensible Markup Language)

Still prevalent in enterprise systems, SOAP APIs, and legacy integrations.

<order id="1001" status="paid">
  <customer id="C-0042">Alice</customer>
  <items>
    <item sku="BK-001" qty="1" price="29.90" />
    <item sku="AC-123" qty="2" price="50.00" />
  </items>
</order>

import xml.etree.ElementTree as ET
 
tree = ET.parse("data.xml")
root = tree.getroot()
 
# Navigating XML is verbose
for item in root.findall(".//order"):
    order_id = item.find("id").text
    amount = item.find("amount").text

The good:

Rich schema validation (XSD)
Handles complex hierarchical data
Mature tooling in enterprise environments

The bad:

Extremely verbose
Painful to parse compared to JSON
Overkill for simple data structures

When you’ll encounter it: Bank feeds, government APIs, older ERP systems, SOAP services.

Excel (.xlsx, .xls)

The format that refuses to die. Business users love it.

import pandas as pd
 
# Simple read
df = pd.read_excel("report.xlsx", sheet_name="Data")
 
# Real-world Excel: multiple sheets, specific ranges
df = pd.read_excel(
    "report.xlsx",
    sheet_name="Q4 Sales",
    skiprows=3,                 # Skip header rows
    usecols="B:F",              # Only specific columns
    dtype={"Product ID": str}   # Prevent type coercion
)
 
# Reading all sheets
all_sheets = pd.read_excel("report.xlsx", sheet_name=None)  # Returns dict

The good:

Familiar to business users
Can contain multiple sheets, formatting, formulas
Widely supported

The bad:

Not designed for large datasets (slow, memory-intensive)
Hidden formatting issues (merged cells, formulas returning errors)
Type coercion surprises
Binary format makes debugging harder

Watch out for:

Merged cells breaking your parsing
Dates stored as numbers (Excel’s date serial format)
Formulas that return #REF! or #N/A
Hidden rows/columns

Fixed-Width Files

Common in banking, mainframe exports, and government data.

import pandas as pd
 
# You need to know the exact column positions
df = pd.read_fwf(
    "mainframe_export.txt",
    colspecs=[(0, 10), (10, 30), (30, 40), (40, 50)],
    names=["account_id", "name", "balance", "date"]
)

When you’ll encounter it: COBOL system exports, ACH files, census data.

A Note on Unstructured Data

While this guide focuses on structured and semi-structured data, you may occasionally need to handle:

Images/Video: Usually stored as binary blobs, metadata extracted separately
Audio: Similar to video, store raw, process with specialized tools
PDFs: Extract text with libraries like pdfplumber or pymupdf, but expect inconsistencies
Documents (.docx): Use python-docx for extraction

For these formats, the pattern is typically: extract what you can into structured format, store the original as a reference.

Part 2: Storage Formats (What You Choose)

This is where you have control. The right choice depends on your query patterns, storage constraints, and tooling.

The Big Picture

Format	Type	Best For	Avoid When
Parquet	Columnar	Analytics, large datasets	Row-level operations
Avro	Row-based	Streaming, schema evolution	Ad-hoc analytics
ORC	Columnar	Hive ecosystem	Non-Hadoop environments
CSV	Row-based	Small files, compatibility	Large datasets, typed data
JSON	Row-based	Semi-structured, flexibility	Large-scale analytics

Columnar vs Row-Based Storage

Before diving into specific formats, it’s worth understanding the two fundamental storage approaches.

Row-based storage (CSV, JSON, Avro) stores data record by record. To read a single column, you must scan through every row.

Columnar storage (Parquet, ORC) stores data column by column. Queries that only need specific columns skip the rest entirely.

Data Scanning: Row vs. Columnar

When row-based wins:

Writing complete records frequently (streaming, event logging)
Reading entire rows (fetch user by ID, get full order details)
Fast appends (logs, message queues)
Processing records one at a time

When columnar wins:

Reading few columns across many rows (analytical queries)
Aggregations (SUM, AVG, COUNT)
Storage efficiency (similar values compress better together)
Predicate pushdown (skip irrelevant data at storage level)

Think of it this way:

Row-based = optimized for “give me everything about this one record”
Columnar = optimized for “give me this field across millions of records”

That’s why Kafka + Avro is a classic combo (streaming = lots of writes, full records), while data warehouses use Parquet (analytics = lots of reads, few columns).

Parquet

The de facto standard for analytical workloads. If you’re building a data lake, you’re probably using Parquet.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
 
# Writing Parquet
df = pd.DataFrame({"id": [1, 2, 3], "value": [10.5, 20.3, 30.1]})
df.to_parquet("data.parquet")
 
# With compression options
df.to_parquet("data.parquet", compression="snappy")  # Fast, widely supported
df.to_parquet("data.parquet", compression="gzip")    # Better compression, slower
df.to_parquet("data.parquet", compression="zstd")    # Best compression/speed ratio (check tool compatibility)
 
# Reading specific columns (the magic of columnar storage)
df = pd.read_parquet("data.parquet", columns=["id", "value"])
 
# Partitioned writes for large datasets
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
    table,
    root_path="data/",
    partition_cols=["year", "month"]
)

Why Parquet wins for analytics:

Columnar storage: Only reads the columns you need
Compression: Similar data in columns compresses extremely well
Predicate pushdown: Filters applied at storage level
Embedded schema metadata: Types are stored in the file metadata to guide readers and reduce type ambiguity
Ecosystem support: Spark, Pandas, DuckDB, BigQuery, Snowflake…

How predicate pushdown works:

Parquet files are organized into row groups, and each group stores metadata, specifically the min and max values for every column. When you run a query like SELECT salary WHERE age > 30, the reader checks this metadata first. If a row group’s max age is 25, the engine skips that entire block without reading the actual data. This drastically reduces I/O on large datasets.

The tradeoffs:

Not human-readable
Appending to a single file requires rewriting it (though you can append new files to a dataset)
Overkill for tiny files

Avro

Row-based format with strong schema support. Popular in streaming and event-driven architectures.

import fastavro
 
# Define schema
schema = {
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "id", "type": "int"},
        {"name": "name", "type": "string"},
        {"name": "email", "type": ["null", "string"], "default": None}
    ]
}
 
# Write Avro
records = [
    {"id": 1, "name": "Alice", "email": "alice@example.com"},
    {"id": 2, "name": "Bob", "email": None}
]
 
with open("users.avro", "wb") as f:
    fastavro.writer(f, schema, records)
 
# Read Avro
with open("users.avro", "rb") as f:
    reader = fastavro.reader(f)
    for record in reader:
        print(record)

Why Avro for streaming:

Schema evolution: Add/remove fields without breaking consumers
Schema registry integration: Kafka + Avro is a classic combo
Row-based: Efficient for writing complete records
Compact binary format: Smaller than JSON

The tradeoffs:

Poor for analytical queries (must read all columns)
Less ecosystem support for analytical querying than Parquet
Requires schema management

ORC (Optimized Row Columnar)

Columnar format optimized for Hive and the Hadoop ecosystem.

import pyarrow.orc as orc
import pyarrow as pa
 
# Writing ORC
table = pa.table({"id": [1, 2, 3], "value": ["a", "b", "c"]})
orc.write_table(table, "data.orc")
 
# Reading ORC
table = orc.read_table("data.orc")

ORC vs Parquet:

ORC has better compression in some cases
Parquet has broader ecosystem support
ORC is native to Hive; Parquet is more universal

Practical advice: Unless you’re deep in the Hadoop/Hive ecosystem, default to Parquet.

Delta Lake / Iceberg / Hudi

Not file formats per se, but table formats built on Parquet that add:

ACID transactions
Time travel (query historical versions)
Schema evolution
Efficient upserts

If you’re building a production data lake on Databricks (Delta) or in a cloud-agnostic way (Iceberg), these are worth exploring.

Part 3: Comparison Deep Dive

Storage Efficiency

Testing with a 1M row dataset (7 columns: id, value, category, description, date, is_active, quantity).

Benchmark run on Python 3.12, pandas 2.2, pyarrow 19. Results will vary based on hardware, data characteristics, and library versions. The benchmark script is available if you want to reproduce with your own data.

The chart below compares file size, write time, and read time for the same 1M-row dataset across formats:

File format benchmark results

Format	File Size	Write Time	Read Time
CSV	65.9 MB	1.14s	0.22s
CSV (gzip)	11.7 MB	2.70s	0.25s
JSON Lines	150.8 MB	0.79s	1.47s
Parquet (snappy)	9.3 MB	0.13s	0.05s
Parquet (zstd)	6.3 MB	0.13s	0.03s
Parquet (none)	13.4 MB	0.11s	0.03s
Avro	66.2 MB	2.63s	0.77s

Key observations:

Parquet (zstd) achieves the best compression (6.3 MB) while being the fastest to read
JSON Lines is the most verbose at 150.8 MB, keys repeated for every record add up
CSV (gzip) offers decent compression but doubles the write time
Avro write performance suffers from the Python implementation (fastavro), JVM-based systems perform better

Column Projection

Reading only 2 of 7 columns from the same 1M row dataset:

Format	Time
CSV	0.13s
Parquet (snappy)	0.01s

This is where columnar formats shine, Parquet is 13x faster because it only reads the columns you need, while CSV must parse the entire file regardless.

Schema Evolution Support

Format	Add Column	Remove Column	Rename Column	Change Type
CSV	N/A	N/A	N/A	N/A
JSON	✅	✅	❌	⚠️
Parquet	✅	✅	❌*	⚠️
Avro	✅	✅	✅ (aliases)	⚠️

⚠️ = Possible with compatible types (e.g., int → long) * Note on Renaming: While raw Parquet files do not natively support column renaming (it requires a file rewrite), modern Table Formats like Apache Iceberg or Delta Lake handle this via metadata mapping. They allow you to rename columns logically without touching the underlying Parquet files.

Part 4: Decision Framework

For Ingestion Pipelines

Source format is CSV/Excel/JSON?
    ↓
Validate and convert to Parquet early
    ↓
Store raw files in a "raw" zone (for debugging/replay)
    ↓
Process and store in Parquet in "processed" zone

Choosing Your Storage Format

Use Parquet when:

Building a data lake or warehouse
Running analytical queries (aggregations, filters)
Storage cost matters
Using Spark, Pandas, DuckDB, or any modern data tool

Use Avro when:

Streaming data (Kafka)
Schema evolution is critical
Row-level operations dominate
You need a schema registry

Use CSV/JSON when:

Exchanging data with external systems
Files are small (<10MB) and infrequently accessed
Human readability is important
Quick debugging or exploration

Use ORC when:

You’re in a Hive-heavy environment
Already invested in Hadoop ecosystem

Partitioning Strategy & The Small File Problem

For large datasets, your partitioning strategy is as critical as your format choice. However, engineers often fall into the trap of over-partitioning, which leads to the Small File Problem.

# Good: Partition by commonly filtered columns
# data/market=FR/date=2024-01-15/data.parquet
 
# Bad: Over-partitioning creates too many small files
# data/market=FR/date=2024-01-15/hour=14/minute=30/data.parquet

Parquet and Avro have metadata overhead. If you partition so granularly that you create thousands of tiny files, the metadata becomes larger than the actual data. Your query engine will spend more time opening and closing files than actually processing records.

In my tests, splitting 1M rows into 1,000 small files instead of a single file added 32% storage overhead and 1.4x read latency and this penalty grows significantly with cloud storage (S3 list operations) and distributed engines (Spark task scheduling). See the benchmark script to test with your own data.

Rules of thumb:

Target Size: Aim for individual files between 128MB and 1GB.
Filter Frequently: Only partition by columns you use in nearly every WHERE clause (e.g., date or region).
Avoid High Cardinality: Never partition by a column like user_id or order_timestamp.

Key Takeaways

Separate source from storage: Accept the messy JSON/CSV you receive, but convert to Parquet or Avro as soon as it enters your “Bronze” or “Raw” layer.
Parquet is your default for Analytics: For 90% of data lake use cases, Parquet’s columnar compression and predicate pushdown make it the winner for both speed and storage efficiency.
Efficiency = Cost Savings: Choosing Parquet isn’t just a technical preference, it’s a financial one. Because cloud data warehouses (BigQuery, Athena, Snowflake) often charge based on the amount of data scanned, reducing your I/O via columnar storage can reduce your cloud bill by 80-90% compared to scanning raw JSON.
Keep raw data: Always store the original files before transformation for debugging and potential re-processing.
Schema matters: Use schema-on-write formats (Avro, Parquet) to catch data quality issues at the source rather than letting them break your downstream dashboards.

File formats are foundation. Get them right, and your pipelines will be faster, cheaper, and more reliable. Get them wrong, and you’ll spend your time fighting tooling instead of solving problems.

❄️ Pierre Munhoz engineering blog

Explorer

Data File Formats for Data Engineering: CSV, JSON, Parquet, Avro, and More

Introduction

Part 1: Source Formats (What You Receive)

CSV (Comma-Separated Values)

JSON (JavaScript Object Notation)

XML (eXtensible Markup Language)

Excel (.xlsx, .xls)

Fixed-Width Files

A Note on Unstructured Data

Part 2: Storage Formats (What You Choose)

The Big Picture

Columnar vs Row-Based Storage

Parquet

Avro

ORC (Optimized Row Columnar)

Delta Lake / Iceberg / Hudi

Part 3: Comparison Deep Dive

Storage Efficiency

Column Projection

Schema Evolution Support

Part 4: Decision Framework

For Ingestion Pipelines

Choosing Your Storage Format

Partitioning Strategy & The Small File Problem

Key Takeaways

React to this article!

Table of Contents

❄️ Pierre Munhoz engineering blog

🚀 Elevate Your Data Engineering Skills

Explorer

Data File Formats for Data Engineering: CSV, JSON, Parquet, Avro, and More

Introduction

Part 1: Source Formats (What You Receive)

CSV (Comma-Separated Values)

JSON (JavaScript Object Notation)

XML (eXtensible Markup Language)

Excel (.xlsx, .xls)

Fixed-Width Files

A Note on Unstructured Data

Part 2: Storage Formats (What You Choose)

The Big Picture

Columnar vs Row-Based Storage

Parquet

Avro

ORC (Optimized Row Columnar)

Delta Lake / Iceberg / Hudi

Part 3: Comparison Deep Dive

Storage Efficiency

Column Projection

Schema Evolution Support

Part 4: Decision Framework

For Ingestion Pipelines

Choosing Your Storage Format

Partitioning Strategy & The Small File Problem

Key Takeaways

React to this article!

Table of Contents