# Import

## Import results

Import functions allow the users to import a `JSON` file (extension `.json`), or multiple files, with their own plasmid data. These files are generated by [pATLASflow](https://github.com/tiagofilipe12/pATLASflow), a pipeline to run mapping, mash screen and assembly methods for pATLAS. They can also be generated through [FlowCraft](https://flowcraft.readthedocs.io/en/latest/) recipes

The `json` files can be imported using a the `Upload file...` button or by dragging and droping the files to the text box on the right of this button.

To do so, you can use two different programs:

* [pATLASflow](#patlasflow) - In this approach it is assumed that the user has

  already performed qc analysis, assemblies and every required analysis before, mash dist,

  mash screen and mapping approaches here provided.
* [FlowCraft](#flowcraft) - Here you can use raw reads and feed an assembly, mapping

  or mash screen approach. The pipeline will handle qc analysis and trimming with default

  parameters described in FlowCraft documentation and then perform the desired

  analysis (either mash dist / assembly, mash screen or mapping).

**Note**: Check also the [redundancy removal](#redundancy-removal) rules described at the end of this file.

## pATLASflow

### Download and install requirements to run the pipeline

[**pATLASflow**](https://github.com/tiagofilipe12/pATLASflow) **is a** [**NextFlow**](https://www.nextflow.io/) **pipeline.**

#### Requirements

* [Java 8](http://www.oracle.com/technetwork/java/javase/downloads/index.html) or higher.
* [Docker](https://docs.docker.com/install/) or [Singularity](http://singularity.lbl.gov/install-linux).
* [Nextflow](https://www.nextflow.io/docs/latest/getstarted.html#installation)&#x20;

#### Conda recipe for nextflow

Nextflow can be installed through bioconda: [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square)](http://bioconda.github.io/recipes/nextflow/README.html)

```
conda install nextflow
```

### Mapping

The mapping pipeline can be run with the following command:

`nextflow run tiagofilipe12/pATLASflow --mapping --reads "your_folder/*.fastq"`

The resulting `JSON` file can then be provided to pATLAS in the **Mapping** menu.

### Mash screen

The mash screen pipeline can be run with the following command:

`nextflow run tiagofilipe12/pATLASflow --mash_screen --reads "your_folder/*.fastq"`

The resulting `JSON` file can then be provided to pATLAS in the **Mash screen** menu.

### Assembly

The sequence pipeline can be run with the following command:

`nextflow run tiagofilipe12/pATLASflow --assembly --fasta "your_folder/*.fasta"`

The resulting `JSON` file can then be provided to pATLAS in the **Assembly** menu.

### Consensus

A consensus approach between the Mash screen and Mapping results. To generate this `JSON` input users must run the following command:

`nextflow run tiagofilipe12/pATLASflow --mapping --mash_screen --reads "your_folder/*.fastq"`

Then, the following `JSON` file can then be provided to pATLAS in the **Consensus** menu.

## FlowCraft

### Download and install requirements

In order to download and install FlowCraft please follow the [official instructions](https://flowcraft.readthedocs.io/en/latest/getting_started/installation.html).

### Use FlowCraft recipes

In order to use pATLAS recipes using FlowCraft there a 4 recipes that you can use:

* **Mapping**

First build the pipeline script with this command:

```
flowcraft.py build -r plasmids_mapping -o pipeline
```

And then execute the pipeline by running nextflow in the script:

```
nextflow run pipeline.nf
```

* **Assembly / Mash Dist**

First build the pipeline script with this command:

```
flowcraft.py build -r plasmids_assembly -o pipeline
```

And then execute the pipeline by running nextflow in the script:

```
nextflow run pipeline.nf
```

* **Mash Screen**

First build the pipeline script with this command:

```
flowcraft.py build -r plasmids_mash -o pipeline
```

And then execute the pipeline by running nextflow in the script:

```
nextflow run pipeline.nf
```

* **All**

This will run all the above pipelines in the same command and generate different outputs for each one of the approaches.

First build the pipeline script with this command:

```
flowcraft.py build -r plasmids -o pipeline
```

And then execute the pipeline by running nextflow in the script:

```
nextflow run pipeline.nf
```

#### Import results from FlowCraft

Results will be available within the current working directory in a folder named: `results`. These files can be uploaded to their respective menus within the pATLAS sidebar menu.

You can also use [`flowcraft.py report` module](https://flowcraft.readthedocs.io/en/latest/user/basic_usage.html#reports) to generate interactive reports that can send requests to pATLAS directly without importing a file to pATLAS.

## Redundancy removal

After loading the files through any of these popup menus and setting the desired cutoffs, a new popup will appear asking if the user wants to use the redundancy option for importing results into the pATLAS matrix.

#### The rational

This option was created because plasmids are highly chimeric and modular by nature and this renders that results often contains redundant information. Consider the following examples:

* Two plasmids are highly related (and thus they are linked in pATLAS) and results show that HTS data has a 100% identity with both, but one of them is larger than the other (let's say one has 5kb and another has 50kb). In this case the plasmid with the same % identity but that is larger is the more likely plasmid to be present in our data.
* HTS data suggest that we may have:
  * one plasmid with 100% identity and sequence length of 5kb.
  * another plasmid with 90% identity and sequence length of 50kb.
  * both plasmids are highly related (and thus they are linked in

    pATLAS matrix).

In the 2nd case, despite the first plasmid presents a higher identity, the second plasmid presents an overall larger sequence similarity and thus the second plasmid should be the more likely plasmid to be contained in the sequencing data.

Hence, this option was added in order to help dealing with this problem and to make a "guess" of the most likely plasmids instead of reporting all hits from the pipelines described [above](#import-results).

#### The calculation

All linked plasmids are compared with each other in order to know which one is the best hit from a given **group of linked plasmids**. If they are not linked, they will not be compared. So, if we have two different groups of plasmids it is likely that HTS data contain two plasmids.

However, each different import types has different calculations to "guess" the best hit for the plasmids within each group, since they are generated by different approaches and pipelines.

Therefore each pair of linked plasmids will be compared as described below for each one of the imports:

* **Mapping**

```
plasmid1 percentage * plasmid1 length - plasmid2 percentage * plasmid2 length
```

**Interpretation**: If this result is `> 0` it means that the plasmid1 is the "best hit". If this result is `< 0` it means that the plasmid2 is a "better hit" than plasmid1. However, if calc results is `= 0` this means that both are "best hits".

Note: `percentage` is the percentage of the queried plasmid that is covered by HTS data, resulting from mapping.

* **Mash screen**

```
plasmid1 identity * plasmid1 length - plasmid2 identity * plasmid2 length
```

Note: `identity` is the percentage identity, from the mash screen output, of the queried plasmid and the HTS data.

**Interpretation**: If this result is `> 0` it means that the plasmid1 is the "best hit". If this result is `< 0` it means that the plasmid2 is a "better hit" than plasmid1. However, if calc results is `= 0` this means that both are "best hits".

* **Assembly**

```
plasmid1 identity * plasmid1 shared hashes * plasmid1 length - plasmid2 identity * plasmid2 shared hashes * plasmid2 length
```

Note: `identity` is the percentage identity, from the mash dist output, of the queried plasmid and the HTS data Note 2: `shared hashes` is a measure of the percentage of sequence that are shared between the HTS data and the plasmid. This is useful because mash dist reports identity of the smallest sequence against the larger sequence.

**Interpretation**: If this result is `> 0` it means that the plasmid1 is the "best hit". If this result is `< 0` it means that the plasmid2 is a "better hit" than plasmid1. However, if calc results is `= 0` this means that both are "best hits".
