VisualizeChemspaceUsingTMAP.py - Visualize chemspace
VisualizeChemspaceUsingTMAP.py [--categoricalDataCols <collabel1,... or colnum1,...>] [--categoricalDataColormaps <Colormap1, Colormap2,...>] [--categoricalDataMaxDisplay <number>] [--colmode <collabel or colnum>] [--colSMILES <text or number>] [--faerunConfigParams <Name,Value,...>] [--faerunScatterPlotParams <Name,Value,...>] [--infileDelimiter <comma, tab, or space>] [--lshForestFileWrite <yes or no>] [--lshForestFileRestore <yes or no>] [--lshForestParams <Name,Value,...>] [--lshLayoutConfigParams <Name,Value,...>] [--mergeHTMLandJSFiles <yes or no>] [--minHashFPParams <Name,Value,...>] [--mp <yes or no>] [--mpParams <Name,Value,...>] [--numericalDataCols <collabel1,... or colnum1,...>] [--numericalDataColormaps <Colormap1, Colormap2,...>] [--overwrite] [--quiet <yes or no>] [--structureDisplayDataCols <collabel1,... or colnum1,...> ] [--tmapDisplayMsg <text>] [-w <dir>] -i <infile> -o <outfile>
VisualizeChemspaceUsingTMAP.py -h | --help | -e | --examples
Generate an interactive TreeMAP (TMAP) [Ref 171, 172] visualization for molecules in a text input file. The text input file must have a column containing SMILES strings. In addition, it must contain at least one column corresponding to categorical or numerical data for coloring TMAP nodes. You may optionally map multiple categorical and numerical data columns on to a TMAP visualization. A HTML file is generated for interactive visualization of chemspace in a browser.
The TMAP methodology is able to generate a reasonably interactive visualization for relatively large data sets. A brief description of the methodology is as follows. A set of MinHash Fingerprints (MHFPs) are calculated for molecules in input file followed by the generation of a Locality Sensitivity Hashing (LSH) forest employing MHFPs. A c-approximate k-Nearest Neighbor Graph (c-k-NNG) is constructed from LSH, which is used to construct a Minimum Spanning Tree (MST) or Forest (MSF). The final TMAP visualization is generated by laying out MST and MSF on a plane using an algorithm provided by the Open Graph Drawing Framework (OGDF). The OGDF provides flexibility to adjust graph layout methodology in terms of not only aesthetics but also computational time.
The supported input file formats are: CSV (.csv) TSV (.txt or .tsv), SMILES (.smi)
The supported output file format is: HTML (.html).
A comma delimited list of column labels or numbers corresponding to categorical data to map on a TMAP visualization.
A comma delimited list of color map names corresponding to categorical data. The default is to use 'tab10' color map name for mapping categorical data on a TMAP. The number of specified color maps must match the number of categorical data columns. You must specify valid color map names supported by Matplotlib. No validation is performed. Example color map names for categorical data: Pastel1, Pastel2, Paired, Accent, Dark2, Set1, Set2, Set3, tab10, tab20, tab20b, tab20c.
Maximum number of categories in a category column to display on a TMAP visualization. The rest of the categories are aggregated under a new category named 'Other' before mapping on to a TMAP visualization.
Use column number or name for the specification of columns in input text file containing SMILES strings and molecule names along with any categorical or numerical data.
Column name or number corresponding to SMILES strings. The default value is automatically set based on the value of '-c, --colmode': 'SMILES' for 'collabel'; SMILES string column number for 'colnum'. SMILES strings must be present in input file.
Print examples.
A comma delimited list of parameter name and value pairs for configuring faerun (Ref 172) to generate a TMAP visualization.
The supported parameter names along with their default and possible values are shown below:
A brief description of parameters, as available in the code for faerun, is provided below:
A comma delimited list of parameter name and value pairs for generating scatter plot representing a TMAP using faerun (Ref 172).
The supported parameter names along with their default and possible values are shown below:
A brief description of parameters is provided below:
Print this help message.
Input file name. The SMILES strings must be present in the input file. Supported formats: CSV (.csv) TSV (.txt or .tsv), or SMILES (.smi)
Input file delimiter for processing data. The default value is automatically set based on the type of input file: comma - CSV (.csv); tab - TSV (.txt or .tsv); space - SMILES (.smi)
Write LSH forest data a file for subsequent generation of a TMAP visualization. Default file name: <OutfileRoot>_LSHForest.dat. The LSH forest data is generated using MinHash fingerprints. You may restore LSH forest data using '--lshForestFileRestore' option to skip the generation of fingerprints.
Check and restore LSH forest data from a file for generating a TMAP visualization and skip the generation of MinHash fingerprints. Default file name: <OutfileRoot>_LSHForest.dat
A comma delimited list of parameter name and value pairs for generating LSH (Locality Sensitivity Hashing) forest from MinHash fingerprints.
The supported parameter names along with their default and possible values are shown below:
A brief description of parameters, as available in the code for LSH, is provided below:
A comma delimited list of parameter name and value pairs for configuring LSH (Locality Sensitivity Hashing) layout.
The supported parameter names along with their default and possible values are shown below:
A brief description of parameters, as available in the code for LSH, is provided below:
Merge TMAP JS data file into HTML file and delete JS data file. Default file names: <OutfileRoot>.html, <OutfileRoot>.js.
A comma delimited list of parameter name and value pairs for generating Min Hash Fingerprints (MHFP).
The supported parameter names along with their default and possible values are shown below:
A brief description of parameters, as available in the code for MHFP, is provided below:
Use multiprocessing for the generation of fingerprints.
By default, input data is retrieved in a lazy manner via mp.Pool.imap() function employing lazy RDKit data iterable. This allows processing of arbitrary large data sets without any additional requirements memory.
All input data may be optionally loaded into memory by mp.Pool.map() before starting worker processes in a process pool by setting the value of 'inputDataMode' to 'InMemory' in '--mpParams' option.
A word to the wise: The default 'chunkSize' value of 1 during 'Lazy' input data mode may adversely impact the performance. The '--mpParams' section provides additional information to tune the value of 'chunkSize'.
A comma delimited list of parameter name and value pairs to configure multiprocessing during the generation of fingerprints.
The supported parameter names along with their default and possible values are shown below:
These parameters are used by the following functions to configure and control the behavior of multiprocessing: mp.Pool(), mp.Pool.map(), and mp.Pool.imap().
The chunkSize determines chunks of input data passed to each worker process in a process pool by mp.Pool.map() and mp.Pool.imap() functions. The default value of chunkSize is dependent on the value of 'inputDataMode'.
The mp.Pool.map() function, invoked during 'InMemory' input data mode, automatically converts RDKit data iterable into a list, loads all data into memory, and calculates the default chunkSize using the following method as shown in its code:
For example, the default chunkSize will be 7 for a pool of 4 worker processes and 100 data items.
The mp.Pool.imap() function, invoked during 'Lazy' input data mode, employs 'lazy' RDKit data iterable to retrieve data as needed, without loading all the data into memory. Consequently, the size of input data is not known a priori. It's not possible to estimate an optimal value for the chunkSize. The default chunkSize is set to 1.
The default value for the chunkSize during 'Lazy' data mode may adversely impact the performance due to the overhead associated with exchanging small chunks of data. It is generally a good idea to explicitly set chunkSize to a larger value during 'Lazy' input data mode, based on the size of your input data and number of processes in the process pool.
The mp.Pool.map() function waits for all worker processes to process all the data and return the results. The mp.Pool.imap() function, however, returns the the results obtained from worker processes as soon as the results become available for specified chunks of data.
The order of data in the results returned by both mp.Pool.map() and mp.Pool.imap() functions always corresponds to the input data.
A comma demlimited list of column labels or numbers corresponding to numerical data to map on a TMAP visualization.
A comma demlimited list of color map names corresponding to numerical data. The default is to use 'viridis' color map name for mapping numerical data on a TMAP. The number of specified color maps must mtach the number of numerical data columns. You must specify valid color map names supported by Matplotlib. No validation is performed. Example color map names for numerical data: viridis, plasma, inferno, magma, cividis.
Output HTML file name for writing out a TMAP visualization.
Overwrite existing files.
Use quiet mode. The warning and information messages will not be printed.
A comma delimited list of column labels or numbers corresponding to data to display under a thumbnail image of a structure in a TMAP visualization. The default column is set to 'Name' and it is automatically shown. In addition, the SMILES string column is always used to display SMILES under the structures.
A brief message to display at the top left in HTML page containing a TMAP visualization. You must specify a valid HTML string. No validation is performed. Default message: TMAP chemspace visualization<br/> Input file: <InfileName><br/>Number of molecules: <Count>
Location of working directory which defaults to the current directory.
To visualize chemspace for SMILES strings present in a column name SMILES in input file, mapping a categorical data column on TMAP, writing out LSH forest for subsequent use to skip the generation of fingerprints, merging TMAP JS file into HTML file, and write out a HTML file containing TMAP visualization, type:
To run the first example for SMILES strings in column name SMILES in input file and write out a HTML file containing TMAP visualization, type:
To run the first example for mapping categrorical data in column number 4 in input file and write out a HTML file containing TMAP visualization, type:
To run the first example for mapping both categrorical and numerical data coumns and write out a HTML file containing TMAP visualization, type:
To run the first example for mapping both categrorical and numerical data coumns along with specified colormaps and write out a HTML file containing TMAP visualization, type:
To run the first example for mapping both categrorical and numerical data coumns along with displaying specific data under the structure display and write out a HTML file containing TMAP visualization, type:
To run the first example for restoring LSH forest data from a file to skip the generation of fingerpritns and write out a HTML file containing TMAP visualization, type:
To run the first example in multiprocessing mode on all available CPUs without loading all data into memory and write out a HTML file containing TMAP visualization, type:
To run the first example in multiprocessing mode on all available CPUs by loading all data into memory and write out a HTML file containing TMAP visualization, type:
To run the first example in multiprocessing mode on specific number of CPUs and chunk size without loading all data into memory and write out a HTML file containing TMAP visualization, type:
To run the first example using a set of specified parameters to generate fingerprints and LSH forest, configure faerun and scatter plot layout, and write out a HTML file containing TMAP visualization, type:
RDKitConvertFileFormat.py, RDKitCalculateMolecularDescriptors.py, RDKitStandardizeMolecules.py
Copyright (C) 2024 Manish Sud. All rights reserved.
The functionality available in this script is implemented using TMAP and Faerun, open source software packages for visualizing chemspace, and RDKit, an open source toolkit for cheminformatics developed by Greg Landrum.
This file is part of MayaChemTools.
MayaChemTools is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.