Assembly Management Tutorial
An assembly represents a group of related genomic sequence and annotated features on that sequence. The assembly itself has a name and a few configuration options for representing this dataset. Assemblies can also be used to share datasets with a group of users.
A name and version is required for every assembly. After submitting the form, your new assembly will be listed in the assemblies table with links to assembly details and edit form.
GxSeq stores NCBI taxonomy names for use with assemblies. The name field will autocomplete from this dataset. If you can't find the species you are studying, or this is a multi-organism assembly, you can type any descriptive name in the field.
The same organism may be uploaded multiple times into the site. To help differentiate, each assembly requires a version along with the name. For example Arabidospsis thaliana might have TAIR version 9 and TAIR version 10. This should be concise but it can be any text you choose. Version numbers are displayed in parenthesis after the assembly name: TAIR (10)
If you want to share this assembly and its data with other users, assign it to a group. All of the users in this group will have access to the assembly.
You can update your assemblies at any time by clicking the Edit link in the assembly listing. All assembly attributes can be changed at any time.
Sequence can be any string of nucleotides applicable to your experiment. Chromosomes, scaffolds and de-novo assembled transcripts are all valid. New sequence must be uploaded to an existing Assembly in FASTA format.
Start by clicking 'Add Sequence' on the assembly details page.
After selecting a FASTA file from you local system, a preview of the sequence will be displayed. A button below the preview 'Check Format' will show how the sequence accession and description will be parsed by the database.
If you are loading sequence for denovo contigs from an RNA-Seq experiment, or another source of sequence with no annotation, you may want to add a feature to each contig. Entering a feature name into the Feature Type field will create these features for you automatically. For Transcriptome studies, we suggest using mRNA as the feature type. These features will be used to upload expression and functional annotation data for transcriptome.
If you want to rename the contigs you can enter a prefix in the 'Re-number Prefix' field. Contigs will be enumerated and given unique names with your prefix + enumeration. For example the prefix "Contig" will create names:
Contig000001, Contig000002 ...
The format used to pad zeros can also be changed. By default it is your prefix followed by a six digit padded decimal.
Features represent interesting regions of genomic sequence. They may have functional annotations, multiple locations, and expression data assigned to them. Generally expression data is assigned to features of type Gene or mRNA. Features must be uploaded onto existing sequence in the GFF format.
Start by clicking 'Add Features' on the assembly details page.
It is important to match the sequence identifiers in the GFF file with the sequence identifiers in GX. To assist with this, GX stores Concordance Sets or alias files of Sequence ID's. A default concordance set is creating when sequence is uploaded. Additional concordance sets can be created with a simple cut and paste interface as described below.
GFF files have 1 start and stop position per line. Annotations in GxSeq can have multiple locations. To help convert GFF entries an ID attribute can be selected. Only this first entry for each unique value will be entered. Subsequent GFF entries with the same ID value will only have their locations recorded.
It some feature types in the file are not important or will add clutter to visual representations they can be skipped during load.
After selecting your GFF file a preview and results of database lookup will be displayed. It is important to check these results and address any issues.
After upload, the features listing will include your new data. You can view details pages for each feature or visualize them in the genomic context. The features will also be available for further annotation and expression upload.
You may want to load data that has sequence identifiers different from the database accessions. Concordance sets allow you to do this. You will need to enter a table of ID's with 1 row per sequence. Each row should contain the current database ID, followed by the new alias in your file.
Start by clicking 'New Concordance' on the concordance set listing
Aliases can be comma, tab or whitespace delimited. After creating a new concordance, you can use it to upload feature data or sample files such as aligned reads in BAM format.