Data Engineer

HealthVerity

  • Automated big data normalization pipelines processing 5-10 terabytes of healthcare data per day.
  • Built utility libraries and tooling around PySpark and AWS EMR to accelerate normalization debugging and development.
  • Improved internal data pipeline library built on top of Airflow to handle a wider range of data cases.
  • Improved data normalization testing through schema-based representative data generation.
  • Migrated legacy normalization jobs from Redshift. Optimized Spark configurations.
  • Led efforts to develop and launch internal wiki for knowledge sharing within the organization
  • Assisted security team in performing pentesting of systems. Assisted in various security efforts.