- Automated big data normalization pipelines processing 5-10 terabytes of healthcare data per day.
- Built utility libraries and tooling around PySpark and AWS EMR to accelerate normalization debugging and development.
- Improved internal data pipeline library built on top of Airflow to handle a wider range of data cases.
- Improved data normalization testing through schema-based representative data generation.
- Migrated legacy normalization jobs from Redshift. Optimized Spark configurations.
- Led efforts to develop and launch internal wiki for knowledge sharing within the organization
- Assisted security team in performing pentesting of systems. Assisted in various security efforts.