Title: Data Scientist III
Location: Cambridge, MA | Hybrid
Duration: 2 months
Top 3-5 Skills Needed:
- Strong programming and data engineering skills (Python, SQL, R)
- Experience with large-scale omics data management and integration
- Knowledge of metadata standards and ontologies for biological data
- Experience designing or maintaining bioinformatics data pipelines or repositories
- Understanding of data governance, permissions, and FAIR data principles
Job Description:
- We are seeking a highly motivated Data Scientist to design and implement an internal GEO-like system for managing the Immune Discovery omics data assets.
- The successful candidate will build a centralized platform that integrates raw, processed, and metadata layers of multi-omics datasets (e.g., bulk and single-cell RNA-seq, spatial omics, CyTOF) and ensures that they are findable, accessible, well-documented, and permission-controlled.
- This role bridges bioinformatics, data engineering, and data governance, enabling researchers to efficiently submit, query, and reuse internal datasets while maintaining data quality and compliance.
Key Responsibilities:
- Design and implement scalable pipelines for ingestion, curation, and storage of raw and processed omics data.
- Build and maintain a searchable data catalog or portal to enable dataset discovery and visualization of metadata and QC metrics.
- Implement access controls and permission management systems to ensure appropriate data security and compliance.
- Work closely with Immunology Discovery, and IR teams to integrate the system with existing compute and storage infrastructure.
- Develop and enforce metadata standards, ontologies, and schema to ensure consistency and interoperability across studies.
Impact:
By developing this internal data platform, the candidate will transform how omics data are organized and shared across the client. The system will improve data visibility and reuse, enhance reproducibility, and accelerate scientific insights by enabling streamlined access to all relevant data layers, raw, processed, and annotated.
Qualifications:
- BS (5+ years) or MS (0-3 years) in Bioinformatics, Computational Biology, Data Science, Computer Science, or related field.
- Proficiency in Python and SQL, with experience in data wrangling, ETL pipelines, and automation.
- Hands-on experience managing large omics datasets.
- Strong understanding of metadata models, data provenance, and FAIR data principles.
- Excellent communication skills and ability to collaborate with cross-functional teams.
Preferred Technical Skills:
- Experience with cloud storage or compute environments (AWS, GCP, or on-prem HPC).
- Experience with workflow orchestration tools (Nextflow, Snakemake).
- Familiarity with relational and NoSQL databases (PostgreSQL).
- Familiarity with public repositories such as GEO, or SRA and their metadata standards.
- Proficiency with Git for version control and collaboration.
Additional Technical Skills (a plus):
- Experience with containerization (Docker/Singularity) and CI/CD workflows.
- Understanding of web application frameworks or dashboarding tools for data portals.
- Exposure to single-cell or multi-omics integration workflows.
- Experience implementing data access and permission systems integrated with organizational identity management.
|