A comprehensive hands-on learning repository for Apache Iceberg, designed for data engineers and professionals working with modern data lake architectures. Special focus on MinIO object storage integration.
Apache Iceberg is a table format for large analytical datasets. Think of it as a specification for organizing collections of files in object storage to behave like a proper database table with ACID transactions, schema evolution, and time travel capabilities.
Iceberg's architecture is remarkably similar to container images:
Docker Container Image │ Iceberg Table
├── manifest.json (metadata) │ ├── metadata.json (table schema, snapshots)
├── config.json (configuration) │ ├── version-hint.text (current version)
└── layers/ (tar.gz files) │ └── data/ (parquet files)
Both use:
- Layered architecture with file reuse
- Immutable artifacts (files never change)
- Metadata-driven assembly
- Content addressing for integrity
- Incremental updates without rebuilding everything
Common Confusion: Are Parquet and Iceberg competing formats?
Reality: They work at different layers:
- Parquet = File format (how individual files store data efficiently)
- Iceberg = Table format (how collections of files are organized into logical tables)
Iceberg Table
├── metadata/ (JSON files tracking schema, partitions, snapshots)
└── data/
├── file1.parquet ← Parquet handles efficient columnar storage
├── file2.parquet ← Parquet handles efficient columnar storage
└── file3.parquet ← Parquet handles efficient columnar storage
Analogy: Parquet is like individual books, Iceberg is the library catalog system.
The combination of object storage (like MinIO) with Iceberg creates a powerful modern data architecture:
- Massive scalability - Store petabytes cost-effectively
- Decoupled compute and storage - Scale independently
- Multi-engine access - Same data, different processing engines
- Cloud-native design - Works across on-premises and cloud
- ACID transactions - Atomic, consistent, isolated, durable operations
- Schema evolution - Add/modify columns without breaking existing data
- Time travel - Query historical versions of your data
- Snapshot isolation - Consistent reads even during writes
- Performance optimizations - File pruning, predicate pushdown
- S3 compatibility - Works with existing tools and workflows
- High performance - Optimized for analytical workloads
- On-premises control - Keep sensitive data in-house
- Cost efficiency - Much cheaper than traditional data warehouses
- Kubernetes native - Easy container orchestration
Every change to an Iceberg table creates an immutable snapshot - like Git commits for data. This enables:
- Time travel queries - Query your data as it existed at any point
- Rollback capabilities - Safely revert problematic changes
- Audit trails - Complete history of all data modifications
Add new columns or modify existing ones without breaking existing queries:
- Safe changes - Old queries continue to work
- Backward compatibility - New fields get NULL values in old data
- No downtime - Schema changes are instantaneous
Iceberg tracks everything through JSON metadata files:
- Table schema - Field definitions with unique IDs
- Partition information - How data is organized
- File statistics - Enable query optimization
- Snapshot history - Complete change tracking
This repository contains progressive learning projects:
Focus: Fundamentals of Iceberg table operations
- Create your first Iceberg table
- Load data from CSV to Parquet
- Schema evolution in practice
- Time travel queries
- CLI tools for exploration
Perfect for: Understanding core concepts and hands-on practice
Focus: Production deployment patterns
- Connect Iceberg to MinIO object storage
- S3-compatible configuration and bucket management
- Local development vs production patterns
- Performance optimization and monitoring
Focus: Modern data pipeline architectures
- Stream processing with Iceberg
- Late-arriving data handling
- Exactly-once semantics
- Integration with Kafka/Kinesis
Focus: Multi-engine data analysis
- Query same data with DuckDB, Spark, Trino
- Performance comparisons
- Query optimization techniques
- Data visualization integration
- Python 3.12+
- uv package manager (recommended) -
curl -LsSf https://astral.sh/uv/install.sh | sh
- Basic SQL knowledge - for querying examples
- Understanding of data formats - CSV, JSON, Parquet basics
# Clone the repository
git clone https://github.com/your-username/learn-iceberg-python.git
cd learn-iceberg-python
# Start with the ETL demo
cd iceberg-etl-demo
uv sync
# Generate sample data and run first tutorial
uv run src/generate_logs.py
uv run src/01_create_table.py
Iceberg represents a paradigm shift in how we think about data storage:
- Expensive data warehouses with vendor lock-in
- Complex ETL pipelines that are hard to debug
- Schema migrations that require downtime
- No version control for data changes
- Difficult multi-engine access - each tool needs its own copy
- Open table format - works with any storage or compute engine
- ACID transactions - reliable, consistent data operations
- Time travel - built-in versioning and audit capabilities
- Schema evolution - safe, backward-compatible changes
- Performance optimization - automatic file pruning and statistics
- Multi-engine compatibility - one table, many analysis tools
Transform existing data lakes into reliable, ACID-compliant systems without vendor lock-in.
Handle complex audit requirements with immutable snapshots and complete change history.
Efficiently manage high-volume sensor data with automatic file organization and query optimization.
Enable reproducible analysis with time travel queries and schema evolution for changing models.
Meet regulatory requirements with immutable audit trails and point-in-time data reconstruction.
- Apache Iceberg - Official project site
- PyIceberg - Python library documentation
- MinIO Documentation - Object storage documentation
- Apache Iceberg Slack - Join the community discussions
- MinIO Community - Connect with MinIO users and developers
- Iceberg Table Format Specification - Deep technical details
- Data Engineering Best Practices - Production guidance
Found an issue or want to improve the tutorials? Contributions are welcome!
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-tutorial
) - Commit your changes (
git commit -m 'Add amazing tutorial'
) - Push to the branch (
git push origin feature/amazing-tutorial
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Start your Iceberg journey → Begin with iceberg-etl-demo/
for hands-on learning!