Data Ingestion Pipeline for New York City Data | Preprocessed and Deployed to Docker | Data Warehouse | Data Lake | PSQL | SQL
Automated ML saves countless hours of debugging, and prevents tedious work that need not be repeated. This project is a data ingestion pipeline that takes open city data, preprocesses it and puts it ingests it into a data warehouse.
- Create a real, usable and scalable data ingestion pipeline
- Handle both .csv and .parquet file formats for robustness
- Deploy the pipeline script in Docker for reliability across different environments
- Dockerize the Data Warehouse
- Dockerize a Data management solution to ensure services and data all works together nicely
export URL="https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet"
Docker build -t taxi_ingest:v1
docker run -it \ --network=pg-network \ --name=data_ingest \ taxi_ingest:v1 \ --user=root \ --password=root \ --host=pg-db \ --port=5432 \ --db=ny_taxi \ --table_name=yellow_taxi_trips \ --url=${URL}
Ideally it would be set with environment file and used from the docker compose service
This is a robust data ingestion pipeline that can handle and preprocess data of many different formats
This is to have proof of concept and thoroughly test the pipeline steps before deploying a final product
- postgres
- Postgres DB
- pgadmin
- Data Management tool
Contains the instructions to build and run the data ingestion script.