Data Science & Machine Learning Intern



Software Engineering, Data Science
South San Francisco, CA, USA
Posted on Sunday, October 22, 2023

The Opportunity

Global drug development productivity is declining exponentially, with an overall failure to develop effective treatments for many increasingly prevalent complex diseases affecting millions of patients per year. We seek to tackle this by combining innovative machine learning techniques with pioneering technologies that measure multiple cellular aspects, aiming to drastically improve and accelerate how drugs are discovered and developed.

We are looking for highly motivated interns to join the data science & machine learning (DSML) team looking to work at the intersection of machine learning and life sciences for our Summer 2024 cohort.

You will partner directly with a DSML team mentor in developing and/or applying ML methods to process and analyze large scale datasets from multiple modalities over the course of the summer (11 weeks). The DSML diverse team that works across the company spanning imaging, omics, statistical genetics, small molecule discovery, clinical research, and research software engineering.

Example of areas & topics you will be working on:

  • Computational Biology:
    • Perform single cell transcriptomics data analysis, including cell type annotation and modeling of differentiation trajectories using RNA velocity;
    • Use bioinformatic methods to perform downstream analysis in order to extract insights about disease mechanisms, such genes and pathways that are relevant to the therapeutic areas;
  • Methods for Omics & Imaging data modalities:
    • Develop, productionize, and deploy cutting-edge ML approaches to analyze and integrate large-scale multi-modal phenotypic datasets, including multi-omic; modalities (single-cell (sc) transcriptomics, sc-ATAC-seq), and imaging (e.g. brightfield, histopathology).
    • Develop ML methods to process and analyze images from multiple microscopy modalities and integrate our in-vitro imaging data to extract insights about disease mechanisms.
  • Research Engineering:
    • Explore several recent papers on self-supervised learning for images and demonstrate whether they provide practical benefits when applied to insitro’s internal biological datasets, compared to our current algorithms;
    • Help us integrate new large language models into our analysis tools to help our analysts get more out of our experimental data, faster.
  • Statistical and Translational Genetics:
    • Develop workflows to enable post-GWAS (Genome-Wide Association Scan) analysis of results, e.g. fine-mapping
    • Translational genetics deep dives: enabling higher throughput annotation and exploration of candidate genes from our discovery efforts
    • Pipelines to better derive and leverage metadata from sequenced cell lines and to incorporate this into image-based ML feature extraction
    • Design of statistical methods to improve rare variant burden tests, and methods to improve power in longitudinal phenotypes
  • Clinical Machine Learning
    • Develop ML models for imputing disease-relevant phenotypes from high-content clinical imaging or time series data (e.g., histopathology, MRI/PET-CT, EEG, EKG)
    • Develop ML methods for disentangling axes of variation in complex phenotypes
    • Use LLMs to extract disease-relevant information from medical records
  • Small Molecule Machine Learning:
    • Build rich embedding models using DNA-Encode Library (DEL) data, and use these representations for downstream drug discovery tasks such as hit-discovery.
    • Explore generative models of small molecules in various data modalities such as 2D and 3D representations for hit-to-lead drug discovery efforts.
    • Develop new geometric deep learning methods to better characterize nuanced molecular properties and relationships.

What you will learn through this experience:

  • In the course of the internship you will learn diverse machine learning techniques and rigorously analyze complex dataset and design metrics to ensure robustness of our methods.
  • You can expect to develop and prototype solutions to enable ML based decisions in our automated workflows.
  • You will work closely with machine learning engineers and scientists, biologists, chemists, microscopy experts, and automation engineers.
  • You will be mentored by one of our senior researchers, who has significant experience in machine learning and computational biology.
  • You will also attend our machine learning team meetings and will be exposed to a diverse set of novel technologies and machine learning concepts that tackle various biological questions.

In return, we will support you by:

  • Placing a high degree of trust in your ideas and execution
  • Bringing you up to speed in the domain of drug development
  • Strive to provide a low-stress work environment
  • Making ourselves available for collaboration
  • Caring about you as a whole person - not a resource
  • Being a well funded startup with conservative runway

About You

  • Working towards a BS, MS, or Ph.D. in engineering, computational biology, systems biology, computer science, mathematics, statistics, life science, chemistry, physics, or a related field
  • Proficiency in one or more general-purpose programming languages. We primarily use Python
  • Interest in using and developing brand new statistical and machine learning methods inspired by real problems
  • Curiosity about human physiology or disease biology
  • Committed to writing high-quality, well-commented code and documentation
  • Ability to communicate effectively and collaborate with people of diverse group of backgrounds and job functions
  • Passion for making a difference in the world

Nice to Have

  • First-hand experience with biological data, preferably using computational approaches
  • Passion for learning how to work with diverse functional genomic assays (RNA/DNase/ATAC/ChIP-seq, etc)
  • Interest in learning how to analyze single-cell RNA-seq data
  • Solid understanding of computational chemistry, including virtual screening (classic QSAR modeling, structure based drug-discovery), library design, etc
  • Demonstrated ability to use and develop cutting edge statistical and machine learning methods inspired by real problems
  • Experience with Machine and Deep Learning frameworks (e.g., scikit-learn, PyTorch, etc.)
  • Demonstrated ability to write high-quality, production-ready code (readable, well-tested, with well-designed APIs)
  • Experience in Linux environment, database languages (e.g., SQL, No-SQL) and version control practices and tools such as Git or Mercurial
  • Publications of high-quality work in relevant computational biology, bioinformatics, systems biology, life sciences, or biomedical venues, including journals and conferences
  • Passionate about solving problems, asking questions and learning independently
  • Familiarity with the SciPy/PyData ecosystem (numpy, pandas, scipy, dask etc.)
  • Familiarity with cloud computing services (AWS or GCP)
  • Familiarity with statistical analysis software, e.g. R
Compensation & Benefits at insitro
Our target starting salary for successful US-based applicants for this role is $55/hr - $65/hr. To determine starting pay, we consider multiple job-related factors including a candidate’s skills, education and experience, market demand, business needs, and internal parity. We may also adjust this range in the future based on market data.

In addition, insitro also provides our interns:

  • Excellent medical, dental, and vision coverage; insitro pays 100% of premiums for employees
  • Excellent mental health and well-being support
  • Access to free onsite baristas and cafe with daily lunch and breakfast
  • Access to free onsite fitness center
  • Commuter benefits
About insitro
insitro is a drug discovery and development company using machine learning (ML) and data at scale to decode biology for transformative medicines. At the core of insitro’s approach is the convergence of in-house generated multi-modal cellular data and high-content phenotypic human cohort data. We rely on these data to develop ML-driven, predictive disease models that uncover underlying biologic state and elucidate critical drivers of disease. These powerful models rely on extensive biological and computational infrastructure and allow insitro to advance novel targets and patient biomarkers, design therapeutics and inform clinical strategy. insitro is advancing a wholly owned and partnered pipeline of insights and therapeutics in neuroscience, oncology and metabolism. Since launching in 2018, insitro has raised over $700 million from top tech, biotech and crossover investors, and from collaborations with pharmaceutical partners. For more information on insitro, please visit