SimilarityMatricesFingerprints.pl - Calculate similarity matrices using fingerprints strings data in SD, FP and CSV/TSV text file(s)
SimilarityMatricesFingerprints.pl SDFile(s) FPFile(s) TextFile(s)...
SimilarityMatricesFingerprints.pl [--alpha number] [--beta number] [-b, --BitVectorComparisonMode All | "TanimotoSimilarity,[ TverskySimilarity, ... ]"] [-c, --ColMode ColNum | ColLabel] [--CompoundIDCol col number | col name] [--CompoundIDPrefix text] [--CompoundIDField DataFieldName] [--CompoundIDMode DataField | MolName | LabelPrefix | MolNameOrLabelPrefix] [-d, --detail InfoLevel] [-f, --fast] [--FingerprintsCol col number | col name] [--FingerprintsField FieldLabel] [-h, --help] [--InDelim comma | semicolon] [--InputDataMode LoadInMemory | ScanFile] [-m, --mode AutoDetect | FingerprintsBitVectorString | FingerprintsVectorString] [--OutDelim comma | tab | semicolon] [--OutMatrixFormat RowsAndColumns | IDPairsAndValue] [--OutMatrixType FullMatrix | UpperTriangularMatrix | LowerTriangularMatrix] [-o, --overwrite] [-p, --precision number] [-q, --quote Yes | No] [-r, --root RootName] [-v, --VectorComparisonMode All | "TanimotoSimilairy, [ ManhattanDistance, ...]"] [--VectorComparisonFormulism All | "AlgebraicForm, [BinaryForm, SetTheoreticForm]"] [-w, --WorkingDir dirname] SDFile(s) FPFile(s) TextFile(s)...
Calculate similarity matrices using fingerprint bit-vector or vector strings data in SD, FP and CSV/TSV text file(s) and generate CSV/TSV text file(s) containing values for specified similarity and distance coefficients.
The scripts SimilarityMatrixSDFiles.pl and SimilarityMatrixTextFiles.pl have been removed from the current release of MayaChemTools and their functionality merged with this script.
The valid SDFile extensions are .sdf and .sd. All SD files in a current directory can be specified either by *.sdf or the current directory name.
The valid FPFile extensions are .fpf and .fp. All FP files in a current directory can be specified either by *.fpf or the current directory name.
The valid TextFile extensions are .csv and .tsv for comma/semicolon and tab delimited text files respectively. All other file names are ignored. All text files in a current directory can be specified by *.csv, *.tsv, or the current directory name. The --indelim option determines the format of TextFile(s). Any file which doesn't correspond to the format indicated by --indelim option is ignored.
Example of FP file containing fingerprints bit-vector string data:
Example of FP file containing fingerprints vector string data:
Example of SD file containing fingerprints bit-vector string data:
Example of CSV Text file containing fingerprints bit-vector string data:
The current release of MayaChemTools supports the following types of fingerprint bit-vector and vector strings:
Value of alpha parameter for calculating Tversky similarity coefficient specified for -b, --BitVectorComparisonMode option. It corresponds to weights assigned for bits set to "1" in a pair of fingerprint bit-vectors during the calculation of similarity coefficient. Possible values: 0 to 1. Default value: <0.5>.
Value of beta parameter for calculating WeightedTanimoto and WeightedTversky similarity coefficients specified for -b, --BitVectorComparisonMode option. It is used to weight the contributions of bits set to "0" during the calculation of similarity coefficients. Possible values: 0 to 1. Default value of <1> makes WeightedTanimoto and WeightedTversky equivalent to Tanimoto and Tversky.
Specify what similarity coefficients to use for calculating similarity matrices for fingerprints bit-vector strings data values in TextFile(s): calculate similarity matrices for all supported similarity coefficients or specify a comma delimited list of similarity coefficients. Possible values: All | "TanimotoSimilarity,[TverskySimilarity,...]. Default: TanimotoSimilarity
All uses complete list of supported similarity coefficients: BaroniUrbaniSimilarity, BuserSimilarity, CosineSimilarity, DiceSimilarity, DennisSimilarity, ForbesSimilarity, FossumSimilarity, HamannSimilarity, JacardSimilarity, Kulczynski1Similarity, Kulczynski2Similarity, MatchingSimilarity, McConnaugheySimilarity, OchiaiSimilarity, PearsonSimilarity, RogersTanimotoSimilarity, RussellRaoSimilarity, SimpsonSimilarity, SkoalSneath1Similarity, SkoalSneath2Similarity, SkoalSneath3Similarity, TanimotoSimilarity, TverskySimilarity, YuleSimilarity, WeightedTanimotoSimilarity, WeightedTverskySimilarity. These similarity coefficients are described below.
For two fingerprint bit-vectors A and B of same size, let:
Then, various similarity coefficients [ Ref. 40 - 42 ] for a pair of bit-vectors A and B are defined as follows:
BaroniUrbaniSimilarity: ( SQRT( Nc * Nd ) + Nc ) / ( SQRT ( Nc * Nd ) + Nc + ( Na - Nc ) + ( Nb - Nc ) ) ( same as Buser )
BuserSimilarity: ( SQRT ( Nc * Nd ) + Nc ) / ( SQRT ( Nc * Nd ) + Nc + ( Na - Nc ) + ( Nb - Nc ) ) ( same as BaroniUrbani )
CosineSimilarity: Nc / SQRT ( Na * Nb ) (same as Ochiai)
DiceSimilarity: (2 * Nc) / ( Na + Nb )
DennisSimilarity: ( Nc * Nd - ( ( Na - Nc ) * ( Nb - Nc ) ) ) / SQRT ( Nt * Na * Nb)
ForbesSimilarity: ( Nt * Nc ) / ( Na * Nb )
FossumSimilarity: ( Nt * ( ( Nc - 1/2 ) ** 2 ) / ( Na * Nb )
HamannSimilarity: ( ( Nc + Nd ) - ( Na - Nc ) - ( Nb - Nc ) ) / Nt
JaccardSimilarity: Nc / ( ( Na - Nc) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc ) (same as Tanimoto)
Kulczynski1Similarity: Nc / ( ( Na - Nc ) + ( Nb - Nc) ) = Nc / ( Na + Nb - 2Nc )
Kulczynski2Similarity: ( ( Nc / 2 ) * ( 2 * Nc + ( Na - Nc ) + ( Nb - Nc) ) ) / ( ( Nc + ( Na - Nc ) ) * ( Nc + ( Nb - Nc ) ) ) = 0.5 * ( Nc / Na + Nc / Nb )
MatchingSimilarity: ( Nc + Nd ) / Nt
McConnaugheySimilarity: ( Nc ** 2 - ( Na - Nc ) * ( Nb - Nc) ) / ( Na * Nb )
OchiaiSimilarity: Nc / SQRT ( Na * Nb ) (same as Cosine)
PearsonSimilarity: ( ( Nc * Nd ) - ( ( Na - Nc ) * ( Nb - Nc ) ) / SQRT ( Na * Nb * ( Na - Nc + Nd ) * ( Nb - Nc + Nd ) )
RogersTanimotoSimilarity: ( Nc + Nd ) / ( ( Na - Nc) + ( Nb - Nc) + Nt) = ( Nc + Nd ) / ( Na + Nb - 2Nc + Nt)
RussellRaoSimilarity: Nc / Nt
SimpsonSimilarity: Nc / MIN ( Na, Nb)
SkoalSneath1Similarity: Nc / ( Nc + 2 * ( Na - Nc) + 2 * ( Nb - Nc) ) = Nc / ( 2 * Na + 2 * Nb - 3 * Nc )
SkoalSneath2Similarity: ( 2 * Nc + 2 * Nd ) / ( Nc + Nd + Nt )
SkoalSneath3Similarity: ( Nc + Nd ) / ( ( Na - Nc ) + ( Nb - Nc ) ) = ( Nc + Nd ) / ( Na + Nb - 2 * Nc )
TanimotoSimilarity: Nc / ( ( Na - Nc) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc ) (same as Jaccard)
TverskySimilarity: Nc / ( alpha * ( Na - Nc ) + ( 1 - alpha) * ( Nb - Nc) + Nc ) = Nc / ( alpha * ( Na - Nb ) + Nb)
YuleSimilarity: ( ( Nc * Nd ) - ( ( Na - Nc ) * ( Nb - Nc ) ) ) / ( ( Nc * Nd ) + ( ( Na - Nc ) * ( Nb - Nc ) ) )
Values of Tanimoto/Jaccard and Tversky coefficients are dependent on only those bit which are set to "1" in both A and B. In order to take into account all bit positions, modified versions of Tanimoto [ Ref. 42 ] and Tversky [ Ref. 43 ] have been developed.
Let:
Tanimoto': Nc' / ( ( Na' - Nc') + ( Nb' - Nc' ) + Nc' ) = Nc' / ( Na' + Nb' - Nc' )
Tversky': Nc' / ( alpha * ( Na' - Nc' ) + ( 1 - alpha) * ( Nb' - Nc' ) + Nc' ) = Nc' / ( alpha * ( Na' - Nb' ) + Nb')
Then:
WeightedTanimotoSimilarity = beta * Tanimoto + (1 - beta) * Tanimoto'
WeightedTverskySimilarity = beta * Tversky + (1 - beta) * Tversky'
Specify how columns are identified in TextFile(s): using column number or column label. Possible values: ColNum or ColLabel. Default value: ColNum.
This value is -c, --ColMode mode specific. It specifies input TextFile(s) column to use for generating compound ID for similarity matrices in output TextFile(s). Possible values: col number or col label. Default value: first column containing the word compoundID in its column label or sequentially generated IDs.
Specify compound ID prefix to use during sequential generation of compound IDs for input SDFile(s) and TextFile(s). Default value: Cmpd. The default value generates compound IDs which look like Cmpd<Number>.
For input SDFile(s), this value is only used during LabelPrefix | MolNameOrLabelPrefix values of --CompoundIDMode option; otherwise, it's ignored.
Examples for LabelPrefix or MolNameOrLabelPrefix value of --CompoundIDMode:
The values specified above generates compound IDs which correspond to Compound<Number> instead of default value of Cmpd<Number>.
Specify input SDFile(s) datafield label for generating compound IDs. This value is only used during DataField value of --CompoundIDMode option.
Examples for DataField value of --CompoundIDMode:
Specify how to generate compound IDs from input SDFile(s) for similarity matrix CSV/TSV text file(s): use a SDFile(s) datafield value; use molname line from SDFile(s); generate a sequential ID with specific prefix; use combination of both MolName and LabelPrefix with usage of LabelPrefix values for empty molname lines.
Possible values: DataField | MolName | LabelPrefix | MolNameOrLabelPrefix. Default: LabelPrefix.
For MolNameAndLabelPrefix value of --CompoundIDMode, molname line in SDFile(s) takes precedence over sequential compound IDs generated using LabelPrefix and only empty molname values are replaced with sequential compound IDs.
Level of information to print about lines being ignored. Default: 1. Possible values: 1, 2 or 3.
In this mode, fingerprints columns specified using --FingerprintsCol for TextFile(s) and --FingerprintsField for SDFile(s) are assumed to contain valid fingerprints data and no checking is performed before calculating similarity matrices. By default, fingerprints data is validated before computing pairwise similarity and distance coefficients.
This value is -c, --colmode specific. It specifies fingerprints column to use during calculation similarity matrices for TextFile(s). Possible values: col number or col label. Default value: first column containing the word Fingerprints in its column label.
Fingerprints field label to use during calculation similarity matrices for SDFile(s). Default value: first data field label containing the word Fingerprints in its label
Print this help message.
Input delimiter for CSV TextFile(s). Possible values: comma or semicolon. Default value: comma. For TSV files, this option is ignored and tab is used as a delimiter.
Specify how fingerprints bit-vector or vector strings data from SD, FP and CSV/TSV fingerprint file(s) is processed: Retrieve, process and load all available fingerprints data in memory; Retrieve and process data for fingerprints one at a time. Possible values : LoadInMemory | ScanFile. Default: LoadInMemory.
During LoadInMemory value of --InputDataMode, fingerprints bit-vector or vector strings data from input file is retrieved, processed, and loaded into memory all at once as fingerprints objects for generation for similarity matrices.
During ScanFile value of --InputDataMode, multiple passes over the input fingerprints file are performed to retrieve and process fingerprints bit-vector or vector strings data one at a time to generate fingerprints objects used during generation of similarity matrices. A temporary copy of the input fingerprints file is made at the start and deleted after generating the matrices.
ScanFile value of --InputDataMode allows processing of arbitrary large fingerprints files without any additional memory requirement.
Format of fingerprint strings data in TextFile(s): automatically detect format of fingerprints string created by MayaChemTools fingerprints generation scripts or explicitly specify its format. Possible values: AutoDetect | FingerprintsBitVectorString | FingerprintsVectorString. Default value: AutoDetect.
Delimiter for output CSV/TSV text file(s). Possible values: comma, tab, or semicolon Default value: comma.
Specify how similarity or distance values calculated for fingerprints vector and bit-vector strings are written to the output CSV/TSV text file(s): Generate text files containing rows and columns with their labels corresponding to compound IDs and each matrix element value corresponding to similarity or distance between corresponding compounds; Generate text files containing rows containing compoundIDs for two compounds followed by similarity or distance value between these compounds.
Possible values: RowsAndColumns, or IDPairsAndValue. Default value: RowsAndColumns.
The value of --OutMatrixFormat in conjunction with --OutMatrixType determines type of data written to output files and allows generation of up to 6 different output data formats:
Example of data in output file for RowsAndColumns --OutMatrixFormat value for FullMatrix valueof --OutMatrixType:
Example of data in output file for RowsAndColumns --OutMatrixFormat value for UpperTriangularMatrix value of --OutMatrixType:
Example of data in output file for RowsAndColumns --OutMatrixFormat value for LowerTriangularMatrix value of --OutMatrixType:
Example of data in output file for IDPairsAndValue --OutMatrixFormat value for <FullMatrix> value of OutMatrixType:
Example of data in output file for IDPairsAndValue --OutMatrixFormat value for <UpperTriangularMatrix> value of --OutMatrixType:
Example of data in output file for IDPairsAndValue --OutMatrixFormat value for <LowerTriangularMatrix> value of --OutMatrixType:
Type of similarity or distance matrix to calculate for fingerprints vector and bit-vector strings: Calculate full matrix; Calculate lower triangular matrix including diagonal; Calculate upper triangular matrix including diagonal.
Possible values: FullMatrix, UpperTriangularMatrix, or LowerTriangularMatrix. Default value: FullMatrix.
The value of --OutMatrixType in conjunction with --OutMatrixFormat determines type of data written to output files.
Overwrite existing files
Precision of calculated values in the output file. Default: up to 2 decimal places. Valid values: positive integers.
Put quote around column values in output CSV/TSV text file(s). Possible values: Yes or No. Default value: Yes.
New file name is generated using the root: <Root><BitVectorComparisonMode>.<Ext> or <Root><VectorComparisonMode><VectorComparisonFormulism>.<Ext>. The csv, and tsv <Ext> values are used for comma/semicolon, and tab delimited text files respectively. This option is ignored for multiple input files.
Specify what similarity or distance coefficients to use for calculating similarity matrices for fingerprint vector strings data values in TextFile(s): calculate similarity matrices for all supported similarity and distance coefficients or specify a comma delimited list of similarity and distance coefficients. Possible values: All | "TanimotoSimilairy,[ManhattanDistance,..]". Default: TanimotoSimilarity.
The value of -v, --VectorComparisonMode, in conjunction with --VectorComparisonFormulism, decides which type of similarity and distance coefficient formulism gets used.
All uses complete list of supported similarity and distance coefficients: CosineSimilarity, CzekanowskiSimilarity, DiceSimilarity, OchiaiSimilarity, JaccardSimilarity, SorensonSimilarity, TanimotoSimilarity, CityBlockDistance, EuclideanDistance, HammingDistance, ManhattanDistance, SoergelDistance. These similarity and distance coefficients are described below.
FingerprintsVector.pm module, used to calculate similarity and distance coefficients, provides support to perform comparison between vectors containing three different types of values:
Type I: OrderedNumericalValues
Type II: UnorderedNumericalValues
Type III: AlphaNumericalValues
Before performing similarity or distance calculations between vectors containing UnorderedNumericalValues or AlphaNumericalValues, the vectors are transformed into vectors containing unique OrderedNumericalValues using value IDs for UnorderedNumericalValues and values itself for AlphaNumericalValues.
Three forms of similarity and distance calculation between two vectors, specified using --VectorComparisonFormulism option, are supported: AlgebraicForm, BinaryForm or SetTheoreticForm.
For BinaryForm, the ordered list of processed final vector values containing the value or count of each unique value type is simply converted into a binary vector containing 1s and 0s corresponding to presence or absence of values before calculating similarity or distance between two vectors.
For two fingerprint vectors A and B of same size containing OrderedNumericalValues, let:
For SetTheoreticForm of calculation between two vectors, let:
For BinaryForm of calculation between two vectors, let:
Additionally, for BinaryForm various values also correspond to:
Various similarity and distance coefficients [ Ref 40, Ref 62, Ref 64 ] for a pair of vectors A and B in AlgebraicForm, BinaryForm and SetTheoreticForm are defined as follows:
CityBlockDistance: ( same as HammingDistance and ManhattanDistance)
AlgebraicForm: SUM ( ABS ( Xai - Xbi ) )
BinaryForm: ( Na - Nc ) + ( Nb - Nc ) = Na + Nb - 2 * Nc
SetTheoreticForm: | SetDifferenceXaXb | - | SetIntersectionXaXb | = SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) )
CosineSimilarity: ( same as OchiaiSimilarityCoefficient)
AlgebraicForm: SUM ( Xai * Xbi ) / SQRT ( SUM ( Xai ** 2) * SUM ( Xbi ** 2) )
BinaryForm: Nc / SQRT ( Na * Nb)
SetTheoreticForm: | SetIntersectionXaXb | / SQRT ( |Xa| * |Xb| ) = SUM ( MIN ( Xai, Xbi ) ) / SQRT ( SUM ( Xai ) * SUM ( Xbi ) )
CzekanowskiSimilarity: ( same as DiceSimilarity and SorensonSimilarity)
AlgebraicForm: ( 2 * ( SUM ( Xai * Xbi ) ) ) / ( SUM ( Xai ** 2) + SUM ( Xbi **2 ) )
BinaryForm: 2 * Nc / ( Na + Nb )
SetTheoreticForm: 2 * | SetIntersectionXaXb | / ( |Xa| + |Xb| ) = 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) )
DiceSimilarity: ( same as CzekanowskiSimilarity and SorensonSimilarity)
AlgebraicForm: ( 2 * ( SUM ( Xai * Xbi ) ) ) / ( SUM ( Xai ** 2) + SUM ( Xbi **2 ) )
BinaryForm: 2 * Nc / ( Na + Nb )
SetTheoreticForm: 2 * | SetIntersectionXaXb | / ( |Xa| + |Xb| ) = 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) )
EuclideanDistance:
AlgebraicForm: SQRT ( SUM ( ( ( Xai - Xbi ) ** 2 ) ) )
BinaryForm: SQRT ( ( Na - Nc ) + ( Nb - Nc ) ) = SQRT ( Na + Nb - 2 * Nc )
SetTheoreticForm: SQRT ( | SetDifferenceXaXb | - | SetIntersectionXaXb | ) = SQRT ( SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) )
HammingDistance: ( same as CityBlockDistance and ManhattanDistance)
AlgebraicForm: SUM ( ABS ( Xai - Xbi ) )
BinaryForm: ( Na - Nc ) + ( Nb - Nc ) = Na + Nb - 2 * Nc
SetTheoreticForm: | SetDifferenceXaXb | - | SetIntersectionXaXb | = SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) )
JaccardSimilarity: ( same as TanimotoSimilarity)
AlgebraicForm: SUM ( Xai * Xbi ) / ( SUM ( Xai ** 2 ) + SUM ( Xbi ** 2 ) - SUM ( Xai * Xbi ) )
BinaryForm: Nc / ( ( Na - Nc ) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc )
SetTheoreticForm: | SetIntersectionXaXb | / | SetDifferenceXaXb | = SUM ( MIN ( Xai, Xbi ) ) / ( SUM ( Xai ) + SUM ( Xbi ) - SUM ( MIN ( Xai, Xbi ) ) )
ManhattanDistance: ( same as CityBlockDistance and HammingDistance)
AlgebraicForm: SUM ( ABS ( Xai - Xbi ) )
BinaryForm: ( Na - Nc ) + ( Nb - Nc ) = Na + Nb - 2 * Nc
SetTheoreticForm: | SetDifferenceXaXb | - | SetIntersectionXaXb | = SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) )
OchiaiSimilarity: ( same as CosineSimilarity)
AlgebraicForm: SUM ( Xai * Xbi ) / SQRT ( SUM ( Xai ** 2) * SUM ( Xbi ** 2) )
BinaryForm: Nc / SQRT ( Na * Nb)
SetTheoreticForm: | SetIntersectionXaXb | / SQRT ( |Xa| * |Xb| ) = SUM ( MIN ( Xai, Xbi ) ) / SQRT ( SUM ( Xai ) * SUM ( Xbi ) )
SorensonSimilarity: ( same as CzekanowskiSimilarity and DiceSimilarity)
AlgebraicForm: ( 2 * ( SUM ( Xai * Xbi ) ) ) / ( SUM ( Xai ** 2) + SUM ( Xbi **2 ) )
BinaryForm: 2 * Nc / ( Na + Nb )
SetTheoreticForm: 2 * | SetIntersectionXaXb | / ( |Xa| + |Xb| ) = 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) )
SoergelDistance:
AlgebraicForm: SUM ( ABS ( Xai - Xbi ) ) / SUM ( MAX ( Xai, Xbi ) )
BinaryForm: 1 - Nc / ( Na + Nb - Nc ) = ( Na + Nb - 2 * Nc ) / ( Na + Nb - Nc )
SetTheoreticForm: ( | SetDifferenceXaXb | - | SetIntersectionXaXb | ) / | SetDifferenceXaXb | = ( SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) - SUM ( MIN ( Xai, Xbi ) ) )
TanimotoSimilarity: ( same as JaccardSimilarity)
AlgebraicForm: SUM ( Xai * Xbi ) / ( SUM ( Xai ** 2 ) + SUM ( Xbi ** 2 ) - SUM ( Xai * Xbi ) )
BinaryForm: Nc / ( ( Na - Nc ) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc )
SetTheoreticForm: | SetIntersectionXaXb | / | SetDifferenceXaXb | = SUM ( MIN ( Xai, Xbi ) ) / ( SUM ( Xai ) + SUM ( Xbi ) - SUM ( MIN ( Xai, Xbi ) ) )
Specify fingerprints vector comparison formulism to use for calculation similarity and distance coefficients during -v, --VectorComparisonMode: use all supported comparison formulisms or specify a comma delimited. Possible values: All | "AlgebraicForm,[BinaryForm,SetTheoreticForm]". Default value: AlgebraicForm.
All uses all three forms of supported vector comparison formulism for values of -v, --VectorComparisonMode option.
For fingerprint vector strings containing AlphaNumericalValues data values - ExtendedConnectivityFingerprints, AtomNeighborhoodsFingerprints and so on - all three formulism result in same value during similarity and distance calculations.
Location of working directory. Default: current directory.
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints in text file present in a column name containing Fingerprint substring by loading all fingerprints data into memory and create a SampleFPHexTanimotoSimilarity.csv file containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints in SD File present in a data field with Fingerprint substring in its label by loading all fingerprints data into memory and create a SampleFPHexTanimotoSimilarity.csv file containing sequentially generated compound IDs with Cmpd prefix, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints in FP file by loading all fingerprints data into memory and create a SampleFPHexTanimotoSimilarity.csv file along with compound IDs retrieved from FP file, type:
To generate a lower triangular similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints in text file present in a column name containing Fingerprint substring by loading all fingerprints data into memory and create a SampleFPHexTanimotoSimilarity.csv file containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate a upper triangular similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints in text file present in a column name containing Fingerprint substring by loading all fingerprints data into memory and create a SampleFPHexTanimotoSimilarity.csv file in IDPairsAndValue format containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate a full similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints in text file present in a column name containing Fingerprint substring by scanning file without loading all fingerprints data into memory and create a SampleFPHexTanimotoSimilarity.csv file containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate a lower triangular similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints in text file present in a column name containing Fingerprint substring by scanning file without loading all fingerprints data into memory and create a SampleFPHexTanimotoSimilarity.csv file in IDPairsAndValue format containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient using algebraic formulism for fingerprints vector strings data corresponding to supported fingerprints in text file present in a column name containing Fingerprint substring and create a SampleFPCountTanimotoSimilarityAlgebraicForm.csv file containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient using algebraic formulism for fingerprints vector strings data corresponding to supported fingerprints in SD file present in a data field with Fingerprint substring in its label and create a SampleFPCountTanimotoSimilarityAlgebraicForm.csv file containing sequentially generated compound IDs with Cmpd prefix, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient using algebraic formulism vector strings data corresponding to supported fingerprints in FP file and create a SampleFPCountTanimotoSimilarityAlgebraicForm.csv file along with compound IDs retrieved from FP file, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints in text file present in a column name containing Fingerprint substring and create a SampleFPHexTanimotoSimilarity.csv file in IDPairsAndValue format containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints in SD file present in a data field with Fingerprint substring in its label and create a SampleFPHexTanimotoSimilarity.csv file in IDPairsAndValue format containing sequentially generated compound IDs with Cmpd prefix, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints in FP file and create a SampleFPHexTanimotoSimilarity.csv file in IDPairsAndValue format along with compound IDs retrieved from FP file, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints in SD file present in a data field with Fingerprint substring in its label and create a SampleFPHexTanimotoSimilarity.csv file containing compound IDs from mol name line, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create a SampleFPHexTanimotoSimilarity.csv file containing compound IDs from data field name Mol_ID, type:
To generate similarity matrices corresponding to Buser, Dice and Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a column name containing Fingerprint substring and create SampleFPBin[CoefficientName]Similarity.csv files containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate similarity matrices corresponding to Buser, Dice and Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create SampleFPBin[CoefficientName]Similarity.csv files containing sequentially generated compound IDs with Cmpd prefix, type:
To generate similarity matrices corresponding to CityBlock distance and Tanimoto similarity coefficients using algebraic formulism for fingerprints vector strings data corresponding to supported fingerprints present in a column name containing Fingerprint substring and create SampleFPCount[CoefficientName]AlgebraicForm.csv files containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate similarity matrices corresponding to CityBlock distance and Tanimoto similarity coefficients using algebraic formulism for fingerprints vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create SampleFPCount[CoefficientName]AlgebraicForm.csv files containing sequentially generated compound IDs with Cmpd prefix, type:
To generate similarity matrices corresponding to CityBlock distance Tanimoto similarity coefficients using binary formulism for fingerprints vector strings data corresponding to supported fingerprints present in a column name containing Fingerprint substring and create SampleFPCount[CoefficientName]Binary.csv files containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate similarity matrices corresponding to CityBlock distance Tanimoto similarity coefficients using binary formulism for fingerprints vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create SampleFPCount[CoefficientName]Binary.csv files containing sequentially generated compound IDs with Cmpd prefix, type:
To generate similarity matrices corresponding to CityBlock distance Tanimoto similarity coefficients using all supported comparison formulisms for fingerprints vector strings data corresponding to supported fingerprints present in a column name containing Fingerprint substring and create SampleFPCount[CoefficientName][FormulismName].csv files containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate similarity matrices corresponding to CityBlock distance Tanimoto similarity coefficients using all supported comparison formulisms for fingerprints vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create SampleFPCount[CoefficientName][FormulismName].csv files containing sequentially generated compound IDs with Cmpd prefix, type:
To generate similarity matrices corresponding to all available similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a column name containing Fingerprint substring and create SampleFPHex[CoefficientName].csv files containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate similarity matrices corresponding to all available similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create SampleFPHex[CoefficientName].csv files containing sequentially generated compound IDs with Cmpd prefix, type
To generate similarity matrices corresponding to all available similarity and distance coefficients using all comparison formulism for fingerprints vector strings data corresponding to supported fingerprints present in a column name containing Fingerprint substring and create SampleFPCount[CoefficientName][FormulismName].csv files containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate similarity matrices corresponding to all available similarity and distance coefficients using all comparison formulism for fingerprints vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create SampleFPCount[CoefficientName][FormulismName].csv files containing sequentially generated compound IDs with Cmpd prefix, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a column number 2 and create a SampleFPHexTanimotoSimilarity.csv file containing compound IDs retrieved column number 1, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field name Fingerprints and create a SampleFPHexTanimotoSimilarity.csv file containing compound IDs present in data field name Mol_ID, type:
To generate a similarity matrix corresponding to Tversky similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a column named Fingerprints and create a SampleFPHexTverskySimilarity.tsv file containing compound IDs retrieved column named CompoundID, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create a SampleFPHexTanimotoSimilarity.csv file containing compound IDs from molname line or sequentially generated compound IDs with Mol prefix, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create a SampleFPHexTanimotoSimilarity.tsv file containing sequentially generated compound IDs with Cmpd prefix, type:
InfoFingerprintsFiles.pl, SimilaritySearchingFingerprints.pl, AtomNeighborhoodsFingerprints.pl, ExtendedConnectivityFingerprints.pl, MACCSKeysFingerprints.pl, PathLengthFingerprints.pl, TopologicalAtomPairsFingerprints.pl, TopologicalAtomTorsionsFingerprints.pl, TopologicalPharmacophoreAtomPairsFingerprints.pl, TopologicalPharmacophoreAtomTripletsFingerprints.pl
Copyright (C) 2024 Manish Sud. All rights reserved.
This file is part of MayaChemTools.
MayaChemTools is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.