Data Engineer

HealthVerity

Automated big data normalization pipelines processing 5-10 terabytes of healthcare data per day.
Built utility libraries and tooling around PySpark and AWS EMR to accelerate normalization debugging and development.
Improved internal data pipeline library built on top of Airflow to handle a wider range of data cases.
Improved data normalization testing through schema-based representative data generation.
Migrated legacy normalization jobs from Redshift. Optimized Spark configurations.
Led efforts to develop and launch internal wiki for knowledge sharing within the organization
Assisted security team in performing pentesting of systems. Assisted in various security efforts.