STRprofiler
STRprofiler is a python package, CLI tool, and Shiny application to compare short tandem repeat (STR) profiles. In particular, it is designed to aid research labs in comparing models (e.g. cell lines or xenografts) generated from primary tissue samples to ensure authenticity. It includes basic checks for sample mixing and contamination.
STRprofiler is intended only for research purposes.
For each STR profile provided, STRprofiler will generate a sample-specific report that includes the following similarity scores as compared to every other profile:
Tanabe, AKA the Sørenson-Dice coefficient:
Amelogenin is not included in the score computation by default but can be included by passing the –score_amel flag.
Installation
STRprofiler is available on PyPi and can be installed with pip:
pip install strprofiler
Usage
strprofiler
strprofiler [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
app
STRprofiler shiny application for interactive comparisons & querying of STR profiles.
strprofiler app [OPTIONS]
Options
- -db, --database <database>
Path to an STR database file in csv, xlsx, tsv, or txt format.
- --version
Show the version and exit.
clastr
clastr compares STR profiles to the human Cellosaurus knowledge base via the CLASTR REST API.
strprofiler clastr [OPTIONS] INPUT_FILES...
Options
- -sa, --search_algorithm <search_algorithm>
Search algorithm to use in the Clastr query. 1 - Tanabe, 2 - Masters (vs. query); 3 - Masters (vs. reference)
- Default:
1
- -sm, --scoring_mode <scoring_mode>
Search mode to account for missing alleles in query or reference. 1 - Non-empty markers, 2 - Query markers, 3 - Reference markers.
- Default:
1
- -sf, --score_filter <score_filter>
Minimum score to report as potential matches in summary table.
- Default:
80
- -mr, --max_results <max_results>
Filter defining the maximum number of results to be returned.
- Default:
200
- -mm, --min_markers <min_markers>
Filter defining the minimum number of markers for matches to be reported.
- Default:
8
- -sm, --sample_map <sample_map>
Path to sample map in csv format for renaming. First column should be sample names as given in STR file(s), second should be new names to assign. No header.
- -scol, --sample_col <sample_col>
Name of sample column in STR file(s).
- Default:
'Sample'
- -mcol, --marker_col <marker_col>
Name of marker column in STR file(s). Only used if format is ‘wide’.
- Default:
'Marker'
- -pfix, --penta_fix <penta_fix>
Whether to try to harmonize PentaE/D allele spelling.
- Default:
True
- -amel, --score_amel <score_amel>
Use Amelogenin for similarity scoring.
- Default:
False
- -o, --output_dir <output_dir>
Path to the output directory.
- Default:
'./STRprofiler'
- --version
Show the version and exit.
Arguments
- INPUT_FILES
Required argument(s)
compare
STRprofiler compares STR profiles to each other.
strprofiler compare [OPTIONS] INPUT_FILES...
Options
- -tanth, --tan_threshold <tan_threshold>
Minimum Tanabe score to report as potential matches in summary table.
- Default:
80
- -masqth, --mas_q_threshold <mas_q_threshold>
Minimum Masters (vs. query) score to report as potential matches in summary table.
- Default:
80
- -masrth, --mas_r_threshold <mas_r_threshold>
Minimum Masters (vs. reference) score to report as potential matches in summary table.
- Default:
80
- -mix, --mix_threshold <mix_threshold>
Number of markers with >= 2 alleles allowed before a sample is flagged for potential mixing.
- Default:
3
- -sm, --sample_map <sample_map>
Path to sample map in csv format for renaming. First column should be sample names as given in STR file(s), second should be new names to assign. No header.
- -db, --database <database>
Path to an STR database file in csv, xlsx, tsv, or txt format.
- -acol, --amel_col <amel_col>
Name of Amelogenin column in STR file(s).
- Default:
'AMEL'
- -scol, --sample_col <sample_col>
Name of sample column in STR file(s).
- Default:
'Sample'
- -mcol, --marker_col <marker_col>
Name of marker column in STR file(s). Only used if format is ‘wide’.
- Default:
'Marker'
- -pfix, --penta_fix <penta_fix>
Whether to try to harmonize PentaE/D allele spelling.
- Default:
True
- -amel, --score_amel <score_amel>
Use Amelogenin for similarity scoring.
- Default:
False
- -o, --output_dir <output_dir>
Path to the output directory.
- Default:
'./STRprofiler'
- --version
Show the version and exit.
Arguments
- INPUT_FILES
Required argument(s)
Querying CLASTR
STRprofiler can also be used to directly query CLASTR via their API.
This can be done from within the Shiny application or from the command line via the strprofiler clastr subcommand or using the clastr_query function directly.
Input Files(s)
An example input file and reference database are available on GitHub.
STRprofiler can take either a single STR file or multiple STR files as input. These files can be csv, tsv, tab-separated text, or xlsx (first sheet used) files. The STR file(s) should be in either ‘wide’ or ‘long’ format. The long format expects all columns to map to the markers except for the designated sample name column with each row reflecting a different profile, e.g.:
Sample |
D1S1656 |
DYS391 |
D3S1358 |
D2S441 |
D16S539 |
D5S818 |
|---|---|---|---|---|---|---|
Line1 |
12,14 |
12 |
13 |
12,14 |
17.3 |
16,17 |
Line2 |
12,14 |
11.3,12 |
13,15 |
12,14 |
17.3 |
16,17 |
… |
The wide format expects a line for each marker for each sample, e.g.:
Sample Name |
Marker |
Allele 1 |
Size 1 |
Height 1 |
Allele 2 |
Size 2 |
Height 2 |
Allele 3 |
|---|---|---|---|---|---|---|---|---|
Sample1 |
DYS391 |
|||||||
Sample1 |
D3S1358 |
16 |
128.29 |
8268 |
18 |
136.84 |
5467 |
16 |
Sample1 |
D16S539 |
12 |
110.7 |
9660 |
13 |
115.17 |
5215 |
|
Sample1 |
Penta D |
9 |
415.04 |
5099 |
13 |
435.88 |
9426 |
|
Sample1 |
D22S1045 |
15 |
455.95 |
13504 |
17 |
462.06 |
6186 |
|
Sample1 |
Penta E |
11 |
397.7 |
7420 |
14 |
412.02 |
5986 |
|
Sample1 |
D18S51 |
12 |
153.72 |
9134 |
16 |
170.48 |
10501 |
|
Sample1 |
D2S1338 |
20 |
263.91 |
3209 |
21 |
267.97 |
3834 |
|
Sample1 |
TH01 |
7 |
85.33 |
8305 |
9.3 |
97.43 |
7853 |
|
Sample1 |
D7S820 |
10 |
292.51 |
12340 |
14 |
308.71 |
11784 |
|
Sample1 |
D12S391 |
15 |
141.53 |
12870 |
18.3 |
157.12 |
13731 |
|
Sample1 |
AMEL |
X |
81.97 |
16696 |
||||
Sample1 |
D10S1248 |
16 |
283.82 |
8469 |
||||
Sample1 |
D13S317 |
12 |
328.21 |
7079 |
||||
Sample1 |
D21S11 |
32.2 |
239.67 |
19231 |
||||
Sample1 |
TPOX |
11 |
424.02 |
12239 |
||||
Sample1 |
D19S433 |
14 |
228.37 |
14273 |
||||
Sample1 |
FGA |
23 |
302.23 |
14599 |
||||
Sample2 |
D16S539 |
9 |
97.59 |
9286 |
11 |
106.43 |
8592 |
|
Sample2 |
TH01 |
9.3 |
97.45 |
5920 |
||||
Sample2 |
D8S1179 |
13 |
101.1 |
26414 |
||||
Sample2 |
AMEL |
X |
82.1 |
7476 |
Y |
88.34 |
8029 |
|
Sample2 |
D3S1358 |
14 |
119.87 |
10146 |
15 |
124.14 |
10160 |
19 |
Sample2 |
D18S51 |
12 |
153.8 |
9316 |
18 |
178.79 |
9182 |
19 |
Sample2 |
Penta D |
10 |
420.13 |
7693 |
11 |
425.25 |
7945 |
12 |
Sample2 |
vWA |
17 |
156.9 |
7953 |
18 |
160.86 |
8230 |
|
Sample2 |
TPOX |
9 |
416 |
6596 |
11 |
424.02 |
5304 |
|
Sample2 |
D12S391 |
21 |
166.75 |
13481 |
22 |
170.9 |
14232 |
|
Sample2 |
D22S1045 |
15 |
455.95 |
14310 |
17 |
462.06 |
10898 |
|
Sample2 |
D2S441 |
14 |
236.24 |
18628 |
||||
Sample2 |
DYS391 |
10 |
468.83 |
6722 |
||||
Sample2 |
FGA |
21 |
294.67 |
11941 |
In this format, the marker_col must be specified.
Only columns beginning with “Allele” will be used to parse the alleles for each sample/marker.
Any other size or height columns will be ignored.
Output Files
STRprofiler generates two types of output files. The first is a summary file, which contains the top hits for each sample above the specified scoring thresholds. This file provides a useful overview in addition to a flag to identify samples with potential mixing for closer inspection. In the output directory, this file will be named full_summary.strprofiler.YYYYMMDD.HH_MM_SS.csv where the date and time are the time the program was run.
In addition to the marker columns, the summary file contains the following columns:
Column Name |
Description |
|---|---|
mixed |
Flag to indicate sample mixing. |
top_hit |
Name and Tanabe score of top match. |
next_best |
Name and Tanabe score of next best match. |
tanabe_matches |
Name and Tanabe score of matches above scoring threshold. |
masters_query_matches |
Name and Masters (vs. query) score of matches above scoring threshold. |
masters_ref_matches |
Name and Masters (vs. reference) score of matches above scoring threshold. |
The second is a sample-specific comparison file, which contains the results of the comparison between the query sample and all other provided samples. These files are generated for each STR profile provided in the input file(s) and named after the query sample in question. For example, if the input file contains a sample named Sample1, the output file will be named Sample1.strprofiler.YYYYMMDD.HH_MM_SS.csv.
In addition to the marker columns, this output contains the following columns:
Column Name |
Description |
|---|---|
mixed |
Flag to indicate sample mixing. |
query_sample |
Flag to indicate query sample. |
n_shared_markers |
Number of shared markers between query and reference sample. |
n_shared_alleles |
Number of shared alleles between query and reference sample. |
n_query_alleles |
Total number of alleles in query sample. |
n_reference_alleles |
Total number of alleles in reference sample. |
tanabe_score |
Tanabe similarity score. |
masters_query_score |
Masters (vs query) similarity score. |
masters_ref_score |
Masters (vs reference) similarity score. |
Database Comparison
STRprofiler can be also used to compare batches of samples against a larger database of samples.
strprofiler compare -db ExampleSTR_database.csv -o ./strprofiler_output STR1.xlsx
In this mode, inputs are compared against the database samples only, and not among themselves. Outputs will be as described above for sample input(s).
Database Format
The database should be formatted as a samples by markers matrix and saved as a csv file, e.g:
Sample |
Amelogenin |
CSF1PO |
D13S317 |
D16S539 |
D18S51 |
D19S433 |
D21S11 |
D2S1338 |
D3S1358 |
|---|---|---|---|---|---|---|---|---|---|
sample1 |
X,Y |
12 |
8 |
13 |
14 |
14 |
31,31.2 |
17,19 |
15 |
sample2 |
X |
10 |
9 |
13 |
16 |
12,14 |
29 |
20,23 |
15,16 |
Optionally, one may provide two metadata columns - “Center” and “Passage”, which will be recognized as non-marker columns.
The STRprofiler App
New in v0.2.0 is strprofiler app, a subcommand that launches a Shiny application that allows for user queries against an uploaded or pre-defined database (provided with the -db parameter) of STR profiles.
This application can provide a convenient portal to a group’s STR database and can be hosted on standard Shiny servers, Posit Connect instances, or ShinyApps.io.
An example of the application can be seen here.
Deploying an STRprofiler App
Building an app for deployment to any of the above options is simple.
First, make your app.py file:
from strprofiler.shiny_app.shiny_app import create_app
database = "./tester_db.csv"
app = create_app(db=database)
If no database is provided, an example database included with the package will be used. The database file should be a csv file with the same format as described above.
Then create a requirements.txt file in the same directory with strprofiler listed:
This app can then be deployed to any of the above endpoints as one would with any other Shiny app.
Alternatively, one could export it as a shinylive app and host it on Github pages or similar.
Other Functions
- strprofiler.utils.make_summary(samp_df, alleles, tan_threshold, mas_q_threshold, mas_r_threshold, mixed, s_name)
Generate summary line from full sample-specific output.
- Parameters:
samp_df (pandas.DataFrame) – Sample-specific output DataFrame containing all comparisons to other samples.
alleles (dict) – Alleles for sample.
tan_threshold (float) – Tanabe score threshold to report matching samples.
mas_q_threshold (float) – Masters (query) score threshold to report matching samples.
mas_r_threshold (float) – Masters (reference) score threshold to report matching samples.
mixed (bool) – Flag for whether sample is potentially mixed.
s_name (str) – Sample name.
- Returns:
Dictonary of summary line output for sample.
- Return type:
OrderedDict
- strprofiler.utils.mixing_check(alleles, three_allele_threshold=3)
Checks for potential sample mixing.
- Parameters:
alleles (dict) – Alleles for sample.
three_allele_threshold (int, optional) – Number of markers with >2 alleles allowed before sample is flagged for potential mixing, defaults to 3
- Returns:
Whether sample is potentially mixed.
- Return type:
bool
- strprofiler.utils.score_query(query, reference, use_amel=False, amel_col='AMEL')
Calculates the Tanabe and Masters scores for a query sample against a reference sample.
- Parameters:
query (dict) – Alleles for query sample.
reference (dict) – Alleles for reference sample.
use_amel (bool, optional) – Whether to include amelogenin in scoring, defaults to False
amel_col (str, optional) – Name of amelogenin column, defaults to “AMEL”
- Returns:
Dictionary of scores for query sample against reference sample.
- Return type:
dict
- strprofiler.utils.str_ingress(paths, sample_col='Sample', marker_col='Marker', sample_map=None, penta_fix=True)
Reads in a list of paths and returns a pandas DataFrame of STR alleles in long format.
- Parameters:
paths (list of pathlib.Path) – STR profile files to read in.
sample_col (str, optional) – Name of sample column in each STR profile, defaults to “Sample”
marker_col (str, optional) – Name of marker identifier column in each STR profile, defaults to “Marker”. Ignored if file in long format.
sample_map (pandas.DataFrame, optional) – Two column DataFrame containing sample identifiers in first column and new sample names to apply in second column, defaults to None
penta_fix (bool, optional) – Whether to try to coerce “Penta” alleles to a common spelling, defaults to True
- Returns:
A pandas DataFrame of STR alleles in long format.
- Return type:
pandas.DataFrame
- strprofiler.utils.validate_api_markers(markers)
Compare list of markers against controlled list of markers names from CLASTR. :param markers: List of markers to compare against controlled marker name list. :type markers: list :return: List of non-compliant marker names. :rtype: list
Contributing
You can contribute by creating issues to highlight bugs and make suggestions for additional features. Pull requests are also very welcome.
License
STRprofiler is released on the MIT license. You are free to use, modify, or redistribute it in almost any way, provided you state changes to the code, disclose the source, and use the same license. It is released with zero warranty for any purpose and the authors retain no liability for its use. Read the full license for additional details.
Reference
If you use STRprofiler in your research, please cite the following: Jared Andrews, Mike Lloyd, & Sam Culley. (2024). j-andrews7/strprofiler: v0.4.0 (v0.4.0). Zenodo. https://doi.org/10.5281/zenodo.7348386