HealthVerity
- Automated big data normalization pipelines processing 5-10 terabytes of healthcare data per day. Most normalization logic was written in Spark SQL and utilized custom UDFs.
- Built utility libraries and tooling around PySpark and AWS EMR to accelerate normalization debugging and development for my team and others.
- Managed data lake and data warehouse utilizing Hadoop and Hive
- Improved internal ETL pipeline library built on top of Airflow to handle a wider range of data cases.
- Troubleshooted various data pipeline issues from data sizing problems to inconsistent data delivery, schemas, and formats. Improved checks, logging, and notification systems around ingestion and pre-normalization data validation.
- Increased tooling around AWS EMR cluster management for one-off normalization jobs.
- Improved data normalization testing through schema-based representative data generation, Docker Compose orchestration, and Zeppelin integration.
- Migrated legacy normalization jobs from Redshift. Optimized Spark configurations.
- Led efforts to develop and launch internal wiki for knowledge sharing within the organization
- Assisted security team in performing pentesting of systems. Assisted in various security efforts.