GitHub - PSAB/Data-Lake-Project

Project Description

This project contains one script: etl.py. This script's tasks consist of the following:

Extract data from S3.
Use Spark to process the data into a Star Schema parquet format.
Load processed data back into S3.

dl.cfg is a config file that contains critical access information to remote AWS services

Data for this project can be found in multiple places:

Within the local data directory (smaller subset of dataset)
On the provided S3 bucket s3a://udacity-dend/

How to use

Type in command line: python etl.py
This should run the tasks that would load datasets into Spark dataframes, process those dataframes, and write them back as parquet files back into S3.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
README.md		README.md
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project Description

How to use

About

Uh oh!

Releases

Packages

Languages

PSAB/Data-Lake-Project

Folders and files

Latest commit

History

Repository files navigation

Project Description

How to use

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages