Data Engineer


  • Automated big data normalization pipelines processing 5-10 terabytes of healthcare data per day. Most normalization logic was written in Spark SQL and utilized custom UDFs.
  • Built utility libraries and tooling around PySpark and AWS EMR to accelerate normalization debugging and development for my team and others.
  • Managed data lake and data warehouse utilizing Hadoop and Hive
  • Improved internal ETL pipeline library built on top of Airflow to handle a wider range of data cases.
  • Troubleshooted various data pipeline issues from data sizing problems to inconsistent data delivery, schemas, and formats. Improved checks, logging, and notification systems around ingestion and pre-normalization data validation.
  • Increased tooling around AWS EMR cluster management for one-off normalization jobs.
  • Improved data normalization testing through schema-based representative data generation, Docker Compose orchestration, and Zeppelin integration.
  • Migrated legacy normalization jobs from Redshift. Optimized Spark configurations.
  • Led efforts to develop and launch internal wiki for knowledge sharing within the organization
  • Assisted security team in performing pentesting of systems. Assisted in various security efforts.