SimilaritySearchingFingerprints.pl - Perform similarity search using fingerprints strings data in SD, FP and CSV/TSV text file(s)
SimilaritySearchingFingerprints.pl ReferenceFPFile DatabaseFPFile
SimilaritySearchingFingerprints.pl [--alpha number] [--beta number] [-b, --BitVectorComparisonMode TanimotoSimilarity | TverskySimilarity | ...] [--DatabaseColMode ColNum | ColLabel] [--DatabaseCompoundIDCol col number | col name] [--DatabaseCompoundIDPrefix text] [--DatabaseCompoundIDField DataFieldName] [--DatabaseCompoundIDMode DataField | MolName | LabelPrefix | MolNameOrLabelPrefix] [--DatabaseDataCols "DataColNum1, DataColNum2,... " | DataColLabel1, DataCoLabel2,... "] [--DatabaseDataColsMode All | Specify | CompoundID] [--DatabaseDataFields "FieldLabel1, FieldLabel2,... "] [--DatabaseDataFieldsMode All | Common | Specify | CompoundID] [--DatabaseFingerprintsCol col number | col name] [--DatabaseFingerprintsField FieldLabel] []--DistanceCutoff number] [-d, --detail InfoLevel] [-f, --fast] [--FingerprintsMode AutoDetect | FingerprintsBitVectorString | FingerprintsVectorString] [-g, --GroupFusionRule Max, Mean, Median, Min, Sum, Euclidean] [--GroupFusionApplyCutoff Yes | No] [-h, --help] [--InDelim comma | semicolon] [-k, --KNN all | number] [-m, --mode IndividualReference | MultipleReferences] [-n, --NumOfSimilarMolecules number] [--OutDelim comma | tab | semicolon] [--output SD | text | both] [-o, --overwrite] [-p, --PercentSimilarMolecules number] [--precision number] [-q, --quote Yes | No] [--ReferenceColMode ColNum | ColLabel] [--ReferenceCompoundIDCol col number | col name] [--ReferenceCompoundIDPrefix text] [--ReferenceCompoundIDField DataFieldName] [--ReferenceCompoundIDMode DataField | MolName | LabelPrefix | MolNameOrLabelPrefix] [--ReferenceFingerprintsCol col number | col name] [--ReferenceFingerprintsField FieldLabel] [-r, --root RootName] [-s, --SearchMode SimilaritySearch | DissimilaritySearch] [--SimilarCountMode NumOfSimilar | PercentSimilar] [--SimilarityCutoff number] [-v, --VectorComparisonMode TanimotoSimilairy | ... | ManhattanDistance | ...] [--VectorComparisonFormulism AlgebraicForm | BinaryForm | SetTheoreticForm] [-w, --WorkingDir dirname] ReferenceFingerprintsFile DatabaseFingerprintsFile
Perform molecular similarity search [ Ref 94-113 ] using fingerprint bit-vector or vector strings data in SD, FP, or CSV/TSV text files corresponding to ReferenceFingerprintsFile and DatabaseFingerprintsFile, and generate SD and CSV/TSV text file(s) containing database molecules which are similar to reference molecule(s). The reference molecules are also referred to as query or seed molecules and database molecules as target molecules in the literature.
The current release of MayaChemTools supports two types of similarity search modes: IndividualReference or MultipleReferences. For default value of MultipleReferences for -m, --mode option, reference molecules are considered as a set and -g, --GroupFusionRule is used to calculate similarity of a database molecule against reference molecules set. The group fusion rule is also referred to as data fusion or consensus scoring in the literature. However, for IndividualReference value of -m, --mode option, reference molecules are treated as individual molecules and each reference molecule is compared against a database molecule by itself to identify similar molecules.
The molecular dissimilarity search can also be performed using DissimilaritySearch value for -s, --SearchMode option. During dissimilarity search or usage of distance comparison coefficient in similarity similarity search, the meaning of fingerprints comparison value is automatically reversed as shown below:
During IndividualReference value of -m, --Mode option for similarity search, fingerprints bit-vector or vector string of each reference molecule is compared with database molecules using specified similarity or distance coefficients to identify most similar molecules for each reference molecule. Based on value of --SimilarCountMode, up to --n, --NumOfSimilarMolecules or -p, --PercentSimilarMolecules at specified --SimilarityCutoff or --DistanceCutoff are identified for each reference molecule.
During MultipleReferences value -m, --mode option for similarity search, all reference molecules are considered as a set and -g, --GroupFusionRule is used to calculate similarity of a database molecule against reference molecules set either using all reference molecules or number of k-nearest neighbors (k-NN) to a database molecule specified using -k, --kNN. The fingerprints bit-vector or vector string of each reference molecule in a set is compared with a database molecule using a similarity or distance coefficient specified via -b, --BitVectorComparisonMode or -v, --VectorComparisonMode. The reference molecules whose comparison values with a database molecule fall outside specified --SimilarityCutoff or --DistanceCutoff are ignored during Yes value of --GroupFusionApplyCutoff. The specified -g, --GroupFusionRule is applied to -k, --kNN reference molecules to calculate final similarity value between a database molecule and reference molecules set.
The input fingerprints SD, FP, or Text (CSV/TSV) files for ReferenceFingerprintsFile and DatabaseTextFile must contain valid fingerprint bit-vector or vector strings data corresponding to same type of fingerprints.
The valid fingerprints SDFile extensions are .sdf and .sd. The valid fingerprints FPFile extensions are .fpf and .fp. The valid fingerprints TextFile (CSV/TSV) extensions are .csv and .tsv for comma/semicolon and tab delimited text files respectively. The --indelim option determines the format of TextFile. Any file which doesn't correspond to the format indicated by --indelim option is ignored.
Example of FP file containing fingerprints bit-vector string data:
Example of FP file containing fingerprints vector string data:
Example of SD file containing fingerprints bit-vector string data:
Example of CSV TextFile containing fingerprints bit-vector string data:
The current release of MayaChemTools supports the following types of fingerprint bit-vector and vector strings:
Value of alpha parameter for calculating Tversky similarity coefficient specified for -b, --BitVectorComparisonMode option. It corresponds to weights assigned for bits set to "1" in a pair of fingerprint bit-vectors during the calculation of similarity coefficient. Possible values: 0 to 1. Default value: <0.5>.
Value of beta parameter for calculating WeightedTanimoto and WeightedTversky similarity coefficients specified for -b, --BitVectorComparisonMode option. It is used to weight the contributions of bits set to "0" during the calculation of similarity coefficients. Possible values: 0 to 1. Default value of <1> makes WeightedTanimoto and WeightedTversky equivalent to Tanimoto and Tversky.
Specify what similarity coefficient to use for calculating similarity between fingerprints bit-vector string data values in ReferenceFingerprintsFile and DatabaseFingerprintsFile during similarity search. Possible values: TanimotoSimilarity | TverskySimilarity | .... Default: TanimotoSimilarity
The current release supports the following similarity coefficients: BaroniUrbaniSimilarity, BuserSimilarity, CosineSimilarity, DiceSimilarity, DennisSimilarity, ForbesSimilarity, FossumSimilarity, HamannSimilarity, JacardSimilarity, Kulczynski1Similarity, Kulczynski2Similarity, MatchingSimilarity, McConnaugheySimilarity, OchiaiSimilarity, PearsonSimilarity, RogersTanimotoSimilarity, RussellRaoSimilarity, SimpsonSimilarity, SkoalSneath1Similarity, SkoalSneath2Similarity, SkoalSneath3Similarity, TanimotoSimilarity, TverskySimilarity, YuleSimilarity, WeightedTanimotoSimilarity, WeightedTverskySimilarity. These similarity coefficients are described below.
For two fingerprint bit-vectors A and B of same size, let:
Then, various similarity coefficients [ Ref. 40 - 42 ] for a pair of bit-vectors A and B are defined as follows:
BaroniUrbaniSimilarity: ( SQRT( Nc * Nd ) + Nc ) / ( SQRT ( Nc * Nd ) + Nc + ( Na - Nc ) + ( Nb - Nc ) ) ( same as Buser )
BuserSimilarity: ( SQRT ( Nc * Nd ) + Nc ) / ( SQRT ( Nc * Nd ) + Nc + ( Na - Nc ) + ( Nb - Nc ) ) ( same as BaroniUrbani )
CosineSimilarity: Nc / SQRT ( Na * Nb ) (same as Ochiai)
DiceSimilarity: (2 * Nc) / ( Na + Nb )
DennisSimilarity: ( Nc * Nd - ( ( Na - Nc ) * ( Nb - Nc ) ) ) / SQRT ( Nt * Na * Nb)
ForbesSimilarity: ( Nt * Nc ) / ( Na * Nb )
FossumSimilarity: ( Nt * ( ( Nc - 1/2 ) ** 2 ) / ( Na * Nb )
HamannSimilarity: ( ( Nc + Nd ) - ( Na - Nc ) - ( Nb - Nc ) ) / Nt
JaccardSimilarity: Nc / ( ( Na - Nc) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc ) (same as Tanimoto)
Kulczynski1Similarity: Nc / ( ( Na - Nc ) + ( Nb - Nc) ) = Nc / ( Na + Nb - 2Nc )
Kulczynski2Similarity: ( ( Nc / 2 ) * ( 2 * Nc + ( Na - Nc ) + ( Nb - Nc) ) ) / ( ( Nc + ( Na - Nc ) ) * ( Nc + ( Nb - Nc ) ) ) = 0.5 * ( Nc / Na + Nc / Nb )
MatchingSimilarity: ( Nc + Nd ) / Nt
McConnaugheySimilarity: ( Nc ** 2 - ( Na - Nc ) * ( Nb - Nc) ) / ( Na * Nb )
OchiaiSimilarity: Nc / SQRT ( Na * Nb ) (same as Cosine)
PearsonSimilarity: ( ( Nc * Nd ) - ( ( Na - Nc ) * ( Nb - Nc ) ) / SQRT ( Na * Nb * ( Na - Nc + Nd ) * ( Nb - Nc + Nd ) )
RogersTanimotoSimilarity: ( Nc + Nd ) / ( ( Na - Nc) + ( Nb - Nc) + Nt) = ( Nc + Nd ) / ( Na + Nb - 2Nc + Nt)
RussellRaoSimilarity: Nc / Nt
SimpsonSimilarity: Nc / MIN ( Na, Nb)
SkoalSneath1Similarity: Nc / ( Nc + 2 * ( Na - Nc) + 2 * ( Nb - Nc) ) = Nc / ( 2 * Na + 2 * Nb - 3 * Nc )
SkoalSneath2Similarity: ( 2 * Nc + 2 * Nd ) / ( Nc + Nd + Nt )
SkoalSneath3Similarity: ( Nc + Nd ) / ( ( Na - Nc ) + ( Nb - Nc ) ) = ( Nc + Nd ) / ( Na + Nb - 2 * Nc )
TanimotoSimilarity: Nc / ( ( Na - Nc) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc ) (same as Jaccard)
TverskySimilarity: Nc / ( alpha * ( Na - Nc ) + ( 1 - alpha) * ( Nb - Nc) + Nc ) = Nc / ( alpha * ( Na - Nb ) + Nb)
YuleSimilarity: ( ( Nc * Nd ) - ( ( Na - Nc ) * ( Nb - Nc ) ) ) / ( ( Nc * Nd ) + ( ( Na - Nc ) * ( Nb - Nc ) ) )
Values of Tanimoto/Jaccard and Tversky coefficients are dependent on only those bit which are set to "1" in both A and B. In order to take into account all bit positions, modified versions of Tanimoto [ Ref. 42 ] and Tversky [ Ref. 43 ] have been developed.
Let:
Tanimoto': Nc' / ( ( Na' - Nc') + ( Nb' - Nc' ) + Nc' ) = Nc' / ( Na' + Nb' - Nc' )
Tversky': Nc' / ( alpha * ( Na' - Nc' ) + ( 1 - alpha) * ( Nb' - Nc' ) + Nc' ) = Nc' / ( alpha * ( Na' - Nb' ) + Nb')
Then:
WeightedTanimotoSimilarity = beta * Tanimoto + (1 - beta) * Tanimoto'
WeightedTverskySimilarity = beta * Tversky + (1 - beta) * Tversky'
Specify how columns are identified in database fingerprints TextFile: using column number or column label. Possible values: ColNum or ColLabel. Default value: ColNum.
This value is --DatabaseColMode mode specific. It specifies column to use for retrieving compound ID from database fingerprints TextFile during similarity and dissimilarity search for output SD and CSV/TSV text files. Possible values: col number or col label. Default value: first column containing the word compoundID in its column label or sequentially generated IDs.
This is only used for CompoundID value of --DatabaseDataColsMode option.
Specify compound ID prefix to use during sequential generation of compound IDs for database fingerprints SDFile and TextFile. Default value: Cmpd. The default value generates compound IDs which look like Cmpd<Number>.
For database fingerprints SDFile, this value is only used during LabelPrefix | MolNameOrLabelPrefix values of --DatabaseCompoundIDMode option; otherwise, it's ignored.
Examples for LabelPrefix or MolNameOrLabelPrefix value of --DatabaseCompoundIDMode:
The values specified above generates compound IDs which correspond to Compound<Number> instead of default value of Cmpd<Number>.
Specify database fingerprints SDFile datafield label for generating compound IDs. This value is only used during DataField value of --DatabaseCompoundIDMode option.
Examples for DataField value of --DatabaseCompoundIDMode:
Specify how to generate compound IDs from database fingerprints SDFile during similarity and dissimilarity search for output SD and CSV/TSV text files: use a SDFile datafield value; use molname line from SDFile; generate a sequential ID with specific prefix; use combination of both MolName and LabelPrefix with usage of LabelPrefix values for empty molname lines.
Possible values: DataField | MolName | LabelPrefix | MolNameOrLabelPrefix. Default: LabelPrefix.
For MolNameAndLabelPrefix value of --DatabaseCompoundIDMode, molname line in SDFile takes precedence over sequential compound IDs generated using LabelPrefix and only empty molname values are replaced with sequential compound IDs.
This is only used for CompoundID value of --DatabaseDataFieldsMode option.
This value is --DatabaseColMode mode specific. It is a comma delimited list of database fingerprints TextFile data column numbers or labels to extract and write to SD and CSV/TSV text files along with other information for SD | text | both values of --output option.
This is only used for Specify value of --DatabaseDataColsMode option.
Examples:
Specify how data columns from database fingerprints TextFile are transferred to output SD and CSV/TSV text files along with other information for SD | text | both values of --output option: transfer all data columns; extract specified data columns; generate a compound ID database compound prefix. Possible values: All | Specify | CompoundID. Default value: CompoundID.
Comma delimited list of database fingerprints SDFile data fields to extract and write to SD and CSV/TSV text files along with other information for SD | text | both values of --output option.
This is only used for Specify value of --DatabaseDataFieldsMode option.
Examples:
Specify how data fields from database fingerprints SDFile are transferred to output SD and CSV/TSV text files along with other information for SD | text | both values of --output option: transfer all SD data field; transfer SD data files common to all compounds; extract specified data fields; generate a compound ID using molname line, a compound prefix, or a combination of both. Possible values: All | Common | specify | CompoundID. Default value: CompoundID.
This value is --DatabaseColMode specific. It specifies fingerprints column to use during similarity and dissimilarity search for database fingerprints TextFile. Possible values: col number or col label. Default value: first column containing the word Fingerprints in its column label.
Fingerprints field label to use during similarity and dissimilarity search for database fingerprints SDFile. Default value: first data field label containing the word Fingerprints in its label
Distance cutoff value to use during comparison of distance value between a pair of database and reference molecule calculated by distance comparison methods for fingerprints vector string data values. Possible values: Any valid number. Default value: 10.
The comparison value between a pair of database and reference molecule must meet the cutoff criterion as shown below:
This option is only used during distance coefficients values of -v, --VectorComparisonMode option.
This option is ignored during No value of --GroupFusionApplyCutoff for MultipleReferences -m, --mode.
Level of information to print about lines being ignored. Default: 1. Possible values: 1, 2 or 3.
In this mode, fingerprints columns specified using --FingerprintsCol for reference and database fingerprints TextFile(s), and --FingerprintsField for reference and database fingerprints SDFile(s) are assumed to contain valid fingerprints data and no checking is performed before performing similarity and dissimilarity search. By default, fingerprints data is validated before computing pairwise similarity and distance coefficients.
Format of fingerprint strings data in reference and database fingerprints SD, FP, or Text (CSV/TSV) files: automatically detect format of fingerprints string created by MayaChemTools fingerprints generation scripts or explicitly specify its format. Possible values: AutoDetect | FingerprintsBitVectorString | FingerprintsVectorString. Default value: AutoDetect.
Specify what group fusion [ Ref 94-97, Ref 100, Ref 105 ] rule to use for calculating similarity of a database molecule against a set of reference molecules during MultipleReferences value of similarity search -m, --mode. Possible values: Max, Min, Mean, Median, Sum, Euclidean. Default value: Max. Mean value corresponds to average or arithmetic mean. The group fusion rule is also referred to as data fusion of consensus scoring in the literature.
For a reference molecules set and a database molecule, let:
Then, various group fusion rules to calculate fused similarity between a database molecule and reference molecules set are defined as follows:
Max: MAX ( C1d, C2d, ..., Cid, ..., Cnd )
Min: MIN ( C1d, C2d, ..., Cid, ..., Cnd )
Mean: SUM ( C1d, C2d, ..., Cid, ..., Cnd ) / N
Median: MEDIAN ( C1d, C2d, ..., Cid, ..., Cnd )
Sum: SUM ( C1d, C2d, ..., Cid, ..., Cnd )
Euclidean: SQRT( SUM( C1d ** 2, C2d ** 2, ..., Cid ** 2, ..., Cnd *** 2) )
The fingerprints bit-vector or vector string of each reference molecule in a set is compared with a database molecule using a similarity or distance coefficient specified via -b, --BitVectorComparisonMode or -v, --VectorComparisonMode. The reference molecules whose comparison values with a database molecule fall outside specified --SimilarityCutoff or --DistanceCutoff are ignored during Yes value of --GroupFusionApplyCutoff. The specified -g, --GroupFusionRule is applied to -k, --kNN reference molecules to calculate final fused similarity value between a database molecule and reference molecules set.
During dissimilarity search or usage of distance comparison coefficient in similarity search, the meaning of fingerprints comaprison value is automatically reversed as shown below:
Consequently, Max implies highest and lowest comparison value for usage of similarity and distance coefficient respectively during similarity search. And it corresponds to lowest and highest comparison value for usage of similarity and distance coefficient respectively during dissimilarity search. During Min fusion rule, the highest and lowest comparison values are appropriately reversed.
Specify whether to apply --SimilarityCutoff or --DistanceCutoff values during application of -g, --GroupFusionRule to reference molecules set. Possible values: Yes or No. Default value: Yes.
During Yes value of --GroupFusionApplyCutoff, the reference molecules whose comparison values with a database molecule fall outside specified --SimilarityCutoff or --DistanceCutoff are not used to calculate final fused similarity value between a database molecule and reference molecules set.
Print this help message.
Input delimiter for reference and database fingerprints CSV TextFile(s). Possible values: comma or semicolon. Default value: comma. For TSV files, this option is ignored and tab is used as a delimiter.
Number of k-nearest neighbors (k-NN) reference molecules to use during -g, --GroupFusionRule for calculating similarity of a database molecule against a set of reference molecules. Possible values: all | positive integers. Default: all.
After ranking similarity values between a database molecule and reference molecules during MultipleReferences value of similarity search -m, --mode option, a top -k, --KNN reference molecule are selected and used during -g, --GroupFusionRule.
This option is -s, --SearchMode dependent: It corresponds to dissimilar molecules during DissimilaritySearch value of -s, --SearchMode option.
Specify how to treat reference molecules in ReferenceFingerprintsFile during similarity search: Treat each reference molecule individually during similarity search or perform similarity search by treating multiple reference molecules as a set. Possible values: IndividualReference | MultipleReferences. Default value: MultipleReferences.
During IndividualReference value of -m, --Mode for similarity search, fingerprints bit-vector or vector string of each reference molecule is compared with database molecules using specified similarity or distance coefficients to identify most similar molecules for each reference molecule. Based on value of --SimilarCountMode, upto --n, NumOfSimilarMolecules or -p, --PercentSimilarMolecules at specified <--SimilarityCutoff> or --DistanceCutoff are identified for each reference molecule.
During MultipleReferences value -m, --mode for similarity search, all reference molecules are considered as a set and -g, --GroupFusionRule is used to calculate similarity of a database molecule against reference molecules set either using all reference molecules or number of k-nearest neighbors (k-NN) to a database molecule specified using -k, --kNN. The fingerprints bit-vector or vector string of each reference molecule in a set is compared with a database molecule using a similarity or distance coefficient specified via -b, --BitVectorComparisonMode or -v, --VectorComparisonMode. The reference molecules whose comparison values with a database molecule fall outside specified --SimilarityCutoff or --DistanceCutoff are ignored. The specified -g, --GroupFusionRule is applied to rest of -k, --kNN reference molecules to calculate final similarity value between a database molecule and reference molecules set.
The meaning of similarity and distance is automatically reversed during DissimilaritySearch value of -s, --SearchMode along with appropriate handling of --SimilarityCutoff or --DistanceCutoff values.
Maximum number of most similar database molecules to find for each reference molecule or set of reference molecules based on IndividualReference or MultipleReferences value of similarity search -m, --mode option. Default: 10. Valid values: positive integers.
This option is ignored during PercentSimilar value of --SimilarCountMode option.
This option is -s, --SearchMode dependent: It corresponds to dissimilar molecules during DissimilaritySearch value of -s, --SearchMode option.
Delimiter for output CSV/TSV text file. Possible values: comma, tab, or semicolon Default value: comma.
Type of output files to generate. Possible values: SD, text, or both. Default value: text.
Overwrite existing files
Maximum percent of mosy similar database molecules to find for each reference molecule or set of reference molecules based on IndividualReference or MultipleReferences value of similarity search -m, --mode option. Default: 1 percent of database molecules. Valid values: non-zero values in between 0 to 100.
This option is ignored during NumOfSimilar value of --SimilarCountMode option.
During PercentSimilar value of --SimilarCountMode option, the number of molecules in DatabaseFingerprintsFile is counted and number of similar molecules correspond to --PercentSimilarMolecules of the total number of database molecules.
This option is -s, --SearchMode dependent: It corresponds to dissimilar molecules during DissimilaritySearch value of -s, --SearchMode option.
Precision of calculated similarity values for comparison and generating output files. Default: up to 2 decimal places. Valid values: positive integers.
Put quote around column values in output CSV/TSV text file. Possible values: Yes or No. Default value: Yes.
Specify how columns are identified in reference fingerprints TextFile: using column number or column label. Possible values: ColNum or ColLabel. Default value: ColNum.
This value is --ReferenceColMode mode specific. It specifies column to use for retrieving compound ID from reference fingerprints TextFile during similarity and dissimilarity search for output SD and CSV/TSV text files. Possible values: col number or col label. Default value: first column containing the word compoundID in its column label or sequentially generated IDs.
Specify compound ID prefix to use during sequential generation of compound IDs for reference fingerprints SDFile and TextFile. Default value: Cmpd. The default value generates compound IDs which looks like Cmpd<Number>.
For reference fingerprints SDFile, this value is only used during LabelPrefix | MolNameOrLabelPrefix values of --ReferenceCompoundIDMode option; otherwise, it's ignored.
Examples for LabelPrefix or MolNameOrLabelPrefix value of --DatabaseCompoundIDMode:
The values specified above generates compound IDs which correspond to Compound<Number> instead of default value of Cmpd<Number>.
Specify reference fingerprints SDFile datafield label for generating compound IDs. This value is only used during DataField value of --ReferenceCompoundIDMode option.
Examples for DataField value of --ReferenceCompoundIDMode:
Specify how to generate compound IDs from reference fingerprints SDFile during similarity and dissimilarity search for output SD and CSV/TSV text files: use a SDFile datafield value; use molname line from SDFile; generate a sequential ID with specific prefix; use combination of both MolName and LabelPrefix with usage of LabelPrefix values for empty molname lines.
Possible values: DataField | MolName | LabelPrefix | MolNameOrLabelPrefix. Default: LabelPrefix.
For MolNameAndLabelPrefix value of --ReferenceCompoundIDMode, molname line in SDFiles takes precedence over sequential compound IDs generated using LabelPrefix and only empty molname values are replaced with sequential compound IDs.
This value is --ReferenceColMode specific. It specifies fingerprints column to use during similarity and dissimilarity search for reference fingerprints TextFile. Possible values: col number or col label. Default value: first column containing the word Fingerprints in its column label.
Fingerprints field label to use during similarity and dissimilarity search for reference fingerprints SDFile. Default value: first data field label containing the word Fingerprints in its label
New file name is generated using the root: <Root>.<Ext>. Default for new file name: <ReferenceFileName>SimilaritySearching.<Ext>. The output file type determines <Ext> value. The sdf, csv, and tsv <Ext> values are used for SD, comma/semicolon, and tab delimited text files respectively.
Specify how to find molecules from database molecules for individual reference molecules or set of reference molecules: Find similar molecules or dissimilar molecules from database molecules. Possible values: SimilaritySearch | DissimilaritySearch. Default value: SimilaritySearch.
During DissimilaritySearch value of -s, --SearchMode option, the meaning of the following options is switched and they correspond to dissimilar molecules instead of similar molecules: --SimilarCountMode, -n, --NumOfSimilarMolecules, --PercentSimilarMolecules, -k, --kNN.
Specify method used to count similar molecules found from database molecules for individual reference molecules or set of reference molecules: Find number of similar molecules or percent of similar molecules from database molecules. Possible values: NumOfSimilar | PercentSimilar. Default value: NumOfSimilar.
The values for number of similar molecules and percent similar molecules are specified using options -n, NumOfSimilarMolecule and --PercentSimilarMolecules.
This option is -s, --SearchMode dependent: It corresponds to dissimilar molecules during DissimilaritySearch value of -s, --SearchMode option.
Similarity cutoff value to use during comparison of similarity value between a pair of database and reference molecules calculated by similarity comparison methods for fingerprints bit-vector vector strings data values. Possible values: Any valid number. Default value: 0.75.
The comparison value between a pair of database and reference molecule must meet the cutoff criterion as shown below:
This option is ignored during No value of --GroupFusionApplyCutoff for MultipleReferences -m, --mode.
This option is -s, --SearchMode dependent: It corresponds to dissimilar molecules during DissimilaritySearch value of -s, --SearchMode option.
Specify what similarity or distance coefficient to use for calculating similarity between fingerprint vector strings data values in ReferenceFingerprintsFile and DatabaseFingerprintsFile during similarity search. Possible values: TanimotoSimilairy | ... | ManhattanDistance | .... Default value: TanimotoSimilarity.
The value of -v, --VectorComparisonMode, in conjunction with --VectorComparisonFormulism, decides which type of similarity and distance coefficient formulism gets used.
The current releases supports the following similarity and distance coefficients: CosineSimilarity, CzekanowskiSimilarity, DiceSimilarity, OchiaiSimilarity, JaccardSimilarity, SorensonSimilarity, TanimotoSimilarity, CityBlockDistance, EuclideanDistance, HammingDistance, ManhattanDistance, SoergelDistance. These similarity and distance coefficients are described below.
FingerprintsVector.pm module, used to calculate similarity and distance coefficients, provides support to perform comparison between vectors containing three different types of values:
Type I: OrderedNumericalValues
Type II: UnorderedNumericalValues
Type III: AlphaNumericalValues
Before performing similarity or distance calculations between vectors containing UnorderedNumericalValues or AlphaNumericalValues, the vectors are transformed into vectors containing unique OrderedNumericalValues using value IDs for UnorderedNumericalValues and values itself for AlphaNumericalValues.
Three forms of similarity and distance calculation between two vectors, specified using --VectorComparisonFormulism option, are supported: AlgebraicForm, BinaryForm or SetTheoreticForm.
For BinaryForm, the ordered list of processed final vector values containing the value or count of each unique value type is simply converted into a binary vector containing 1s and 0s corresponding to presence or absence of values before calculating similarity or distance between two vectors.
For two fingerprint vectors A and B of same size containing OrderedNumericalValues, let:
For SetTheoreticForm of calculation between two vectors, let:
For BinaryForm of calculation between two vectors, let:
Additionally, for BinaryForm various values also correspond to:
Various similarity and distance coefficients [ Ref 40, Ref 62, Ref 64 ] for a pair of vectors A and B in AlgebraicForm, BinaryForm and SetTheoreticForm are defined as follows:
CityBlockDistance: ( same as HammingDistance and ManhattanDistance)
AlgebraicForm: SUM ( ABS ( Xai - Xbi ) )
BinaryForm: ( Na - Nc ) + ( Nb - Nc ) = Na + Nb - 2 * Nc
SetTheoreticForm: | SetDifferenceXaXb | - | SetIntersectionXaXb | = SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) )
CosineSimilarity: ( same as OchiaiSimilarityCoefficient)
AlgebraicForm: SUM ( Xai * Xbi ) / SQRT ( SUM ( Xai ** 2) * SUM ( Xbi ** 2) )
BinaryForm: Nc / SQRT ( Na * Nb)
SetTheoreticForm: | SetIntersectionXaXb | / SQRT ( |Xa| * |Xb| ) = SUM ( MIN ( Xai, Xbi ) ) / SQRT ( SUM ( Xai ) * SUM ( Xbi ) )
CzekanowskiSimilarity: ( same as DiceSimilarity and SorensonSimilarity)
AlgebraicForm: ( 2 * ( SUM ( Xai * Xbi ) ) ) / ( SUM ( Xai ** 2) + SUM ( Xbi **2 ) )
BinaryForm: 2 * Nc / ( Na + Nb )
SetTheoreticForm: 2 * | SetIntersectionXaXb | / ( |Xa| + |Xb| ) = 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) )
DiceSimilarity: ( same as CzekanowskiSimilarity and SorensonSimilarity)
AlgebraicForm: ( 2 * ( SUM ( Xai * Xbi ) ) ) / ( SUM ( Xai ** 2) + SUM ( Xbi **2 ) )
BinaryForm: 2 * Nc / ( Na + Nb )
SetTheoreticForm: 2 * | SetIntersectionXaXb | / ( |Xa| + |Xb| ) = 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) )
EuclideanDistance:
AlgebraicForm: SQRT ( SUM ( ( ( Xai - Xbi ) ** 2 ) ) )
BinaryForm: SQRT ( ( Na - Nc ) + ( Nb - Nc ) ) = SQRT ( Na + Nb - 2 * Nc )
SetTheoreticForm: SQRT ( | SetDifferenceXaXb | - | SetIntersectionXaXb | ) = SQRT ( SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) )
HammingDistance: ( same as CityBlockDistance and ManhattanDistance)
AlgebraicForm: SUM ( ABS ( Xai - Xbi ) )
BinaryForm: ( Na - Nc ) + ( Nb - Nc ) = Na + Nb - 2 * Nc
SetTheoreticForm: | SetDifferenceXaXb | - | SetIntersectionXaXb | = SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) )
JaccardSimilarity: ( same as TanimotoSimilarity)
AlgebraicForm: SUM ( Xai * Xbi ) / ( SUM ( Xai ** 2 ) + SUM ( Xbi ** 2 ) - SUM ( Xai * Xbi ) )
BinaryForm: Nc / ( ( Na - Nc ) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc )
SetTheoreticForm: | SetIntersectionXaXb | / | SetDifferenceXaXb | = SUM ( MIN ( Xai, Xbi ) ) / ( SUM ( Xai ) + SUM ( Xbi ) - SUM ( MIN ( Xai, Xbi ) ) )
ManhattanDistance: ( same as CityBlockDistance and HammingDistance)
AlgebraicForm: SUM ( ABS ( Xai - Xbi ) )
BinaryForm: ( Na - Nc ) + ( Nb - Nc ) = Na + Nb - 2 * Nc
SetTheoreticForm: | SetDifferenceXaXb | - | SetIntersectionXaXb | = SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) )
OchiaiSimilarity: ( same as CosineSimilarity)
AlgebraicForm: SUM ( Xai * Xbi ) / SQRT ( SUM ( Xai ** 2) * SUM ( Xbi ** 2) )
BinaryForm: Nc / SQRT ( Na * Nb)
SetTheoreticForm: | SetIntersectionXaXb | / SQRT ( |Xa| * |Xb| ) = SUM ( MIN ( Xai, Xbi ) ) / SQRT ( SUM ( Xai ) * SUM ( Xbi ) )
SorensonSimilarity: ( same as CzekanowskiSimilarity and DiceSimilarity)
AlgebraicForm: ( 2 * ( SUM ( Xai * Xbi ) ) ) / ( SUM ( Xai ** 2) + SUM ( Xbi **2 ) )
BinaryForm: 2 * Nc / ( Na + Nb )
SetTheoreticForm: 2 * | SetIntersectionXaXb | / ( |Xa| + |Xb| ) = 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) )
SoergelDistance:
AlgebraicForm: SUM ( ABS ( Xai - Xbi ) ) / SUM ( MAX ( Xai, Xbi ) )
BinaryForm: 1 - Nc / ( Na + Nb - Nc ) = ( Na + Nb - 2 * Nc ) / ( Na + Nb - Nc )
SetTheoreticForm: ( | SetDifferenceXaXb | - | SetIntersectionXaXb | ) / | SetDifferenceXaXb | = ( SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) - SUM ( MIN ( Xai, Xbi ) ) )
TanimotoSimilarity: ( same as JaccardSimilarity)
AlgebraicForm: SUM ( Xai * Xbi ) / ( SUM ( Xai ** 2 ) + SUM ( Xbi ** 2 ) - SUM ( Xai * Xbi ) )
BinaryForm: Nc / ( ( Na - Nc ) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc )
SetTheoreticForm: | SetIntersectionXaXb | / | SetDifferenceXaXb | = SUM ( MIN ( Xai, Xbi ) ) / ( SUM ( Xai ) + SUM ( Xbi ) - SUM ( MIN ( Xai, Xbi ) ) )
Specify fingerprints vector comparison formulism to use for calculation similarity and distance coefficients during -v, --VectorComparisonMode. Possible values: AlgebraicForm | BinaryForm | SetTheoreticForm. Default value: AlgebraicForm.
For fingerprint vector strings containing AlphaNumericalValues data values - ExtendedConnectivityFingerprints, AtomNeighborhoodsFingerprints and so on - all three formulism result in same value during similarity and distance calculations.
Location of working directory. Default: current directory.
To perform similarity search using Tanimoto coefficient by treating all reference molecules as a set to find 10 most similar database molecules with application of Max group fusion rule and similarity cutoff to supported fingerprints strings data in SD fingerprints files present in a data fields with Fingerprint substring in their labels, and create a ReferenceFPHexSimilaritySearching.csv file containing sequentially generated database compound IDs with Cmpd prefix, type:
To perform similarity search using Tanimoto coefficient by treating all reference molecules as a set to find 10 most similar database molecules with application of Max group fusion rule and similarity cutoff to supported fingerprints strings data in FP fingerprints files, and create a SimilaritySearchResults.csv file containing database compound IDs retireved from FP file, type:
To perform similarity search using Tanimoto coefficient by treating all reference molecules as a set to find 10 most similar database database molecules with application of Max group fusion rule and similarity cutoff to supported fingerprints strings data in text fingerprints files present in a column names containing Fingerprint substring in their names, and create a ReferenceFPHexSimilaritySearching.csv file containing database compound IDs retireved column name containing CompoundID substring or sequentially generated compound IDs, type:
To perform similarity search using Tanimoto coefficient by treating reference molecules as individual molecules to find 10 most similar database molecules for each reference molecule with application of similarity cutoff to supported fingerprints strings data in SD fingerprints files present in a data fields with Fingerprint substring in their labels, and create a ReferenceFPHexSimilaritySearching.csv file containing sequentially generated reference and database compound IDs with Cmpd prefix, type:
To perform similarity search using Tanimoto coefficient by treating reference molecules as individual molecules to find 10 most similar database molecules for each reference molecule with application of similarity cutoff to supported fingerprints strings data in FP fingerprints files, and create a ReferenceFPHexSimilaritySearching.csv file containing references and database compound IDs retireved from FP file, type:
To perform similarity search using Tanimoto coefficient by treating reference molecules as individual molecules to find 10 most similar database molecules for each reference molecule with application of similarity cutoff to supported fingerprints strings data in text fingerprints files present in a column names containing Fingerprint substring in their names, and create a ReferenceFPHexSimilaritySearching.csv file containing reference and database compound IDs retrieved column name containing CompoundID substring or sequentially generated compound IDs, type:
To perform dissimilarity search using Tanimoto coefficient by treating all reference molecules as a set to find 10 most dissimilar database molecules with application of Max group fusion rule and similarity cutoff to supported fingerprints strings data in SD fingerprints files present in a data fields with Fingerprint substring in their labels, and create a ReferenceFPHexSimilaritySearching.csv file containing sequentially generated database compound IDs with Cmpd prefix, type:
To perform similarity search using CityBlock distance by treating reference molecules as individual molecules to find 10 most similar database molecules for each reference molecule with application of distance cutoff to supported vector fingerprints strings data in SD fingerprints files present in a data fields with Fingerprint substring in their labels, and create a ReferenceFPHexSimilaritySearching.csv file containing sequentially generated reference and database compound IDs with Cmpd prefix, type:
To perform similarity search using Tanimoto coefficient by treating all reference molecules as a set to find 100 most similar database molecules with application of Mean group fusion rule to to top 10 reference molecules with in similarity cutoff of 0.75 to supported fingerprints strings data in FP fingerprints files, and create a ReferenceFPHexSimilaritySearching.csv file containing database compound IDs retrieved from FP file, type:
To perform similarity search using Tanimoto coefficient by treating reference molecules as individual molecules to find 2 percent of most similar database molecules for each reference molecule with application of similarity cutoff of 0.85 to supported fingerprints strings data in text fingerprints files present in specific columns and create a ReferenceFPHexSimilaritySearching.csv file containing reference and database compoundIDs retrieved from specific columns, type:
To perform similarity search using Tanimoto coefficient by treating reference molecules as individual molecules to find top 50 most similar database molecules for each reference molecule with application of similarity cutoff of 0.85 to supported fingerprints strings data in SD fingerprints files present in specific data fields and create both ReferenceFPHexSimilaritySearching.csv and ReferenceFPHexSimilaritySearching.sdf files containing reference and database compoundIDs retrieved from specific data fields, type:
To perform similarity search using Tanimoto coefficient by treating reference molecules as individual molecules to find 1 percent of most similar database molecules for each reference molecule with application of similarity cutoff to supported fingerprints strings data in SD fingerprints files present in specific data field labels, and create both ReferenceFPHexSimilaritySearching.csv ReferenceFPHexSimilaritySearching.sdf files containing reference and database compound IDs retrieved from specific data field labels along with other specific data for database molecules, type:
InfoFingerprintsFiles.pl, SimilarityMatricesFingerprints.pl, AtomNeighborhoodsFingerprints.pl, ExtendedConnectivityFingerprints.pl, MACCSKeysFingerprints.pl, PathLengthFingerprints.pl, TopologicalAtomPairsFingerprints.pl, TopologicalAtomTorsionsFingerprints.pl, TopologicalPharmacophoreAtomPairsFingerprints.pl, TopologicalPharmacophoreAtomTripletsFingerprints.pl
Copyright (C) 2024 Manish Sud. All rights reserved.
This file is part of MayaChemTools.
MayaChemTools is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.