General usage

General structure

The goal of the pipeline is to prepare cases obtained from Face2Gene containing information on genomic variants, phenotypic features and scores inferred from the features and a provided photograph with additional data sources, such as an alternative interpretation of phenotypic features with Phenomizer and the parsing and conversion of variant HGVS strings into VCF files.

This process requires the following steps:

  • Downloading of Face2Gene cases
  • Parsing of json format cases into internal representation
  • Adding phenomizer data
  • Creating VCF files from HGVS information
  • Checking whether data quality requirements are met*
  • for performance and parser implementation reasons an initial check is already done on json level

Downloading

Data is acquired from a AWS S3 Bucket, which is updated periodically with new cases and recalculations of old cases.

JSON Parsing

The raw json files provided by Face2Gene are processed into an internal case representation which attempts to bundle information from various JSON fields into a simpler structure.

Phenomization

Phenomizer information is included into cases by calling the phenomizer object.

VCF Generation

VCF files are generated by caling the vcf function of the case. Jannovar can be run as an individual process or in server mode by calling run_jannovar.sh first.

Quality Check functions

Whether certain criteria have been met for the cases can be checked by using the check function available in the json object and the case object.

The quality check results are saved to a json format log found per default at quality_check.log

Error correction overrides

There are multiple possibilities to provide corrections for erroneous input data.

Complete sections in the raw json file can be replaced by placing corrected files in the correction folder. (See the config.ini for the location of the correction directory.) These have to follow to the folder hierarchy of the base json structure with a separation into cases and genomics_entries.

Alternatively gene information and hgvs strings can be inserted into an error dictionary, found per default in hgvs_errors.json. This file can be versioned with version enforcement configurable in lib/constants.py.