Title: | A Comprehensive Interface for Accessing the Protein Data Bank |
---|---|
Description: | Streamlines the interaction with the 'RCSB' Protein Data Bank ('PDB') <https://www.rcsb.org/>. This interface offers an intuitive and powerful tool for searching and retrieving a diverse range of data types from the 'PDB'. It includes advanced functionalities like BLAST and sequence motif queries. Built upon the existing XML-based API of the 'PDB', it simplifies the creation of custom requests, thereby enhancing usability and flexibility for researchers. |
Authors: | Selcuk Korkmaz [aut, cre] , Bilge Eren Yamasan [aut] |
Maintainer: | Selcuk Korkmaz <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.1.1 |
Built: | 2024-11-20 06:23:14 UTC |
Source: | CRAN |
This function facilitates the management of properties and subproperties required for data retrieval from the Protein Data Bank (PDB). It accepts a list of properties where each key represents a property category (e.g., 'cell', 'exptl'), and the corresponding value is a character vector of subproperties (e.g., 'volume', 'method'). The function ensures that if a property already exists, its subproperties are merged without duplication, guaranteeing that each subproperty remains unique.
add_property(property)
add_property(property)
property |
A list where each element corresponds to a property category. The names of the list elements are the properties, and their values are character vectors containing the subproperties. Each subproperty should be provided as a character vector. The full list of available properties and their descriptions can be found at https://data.rcsb.org/#data-schema. For example, a 'property' list might look like:
|
The 'add_property' function is particularly useful when users need to dynamically build or update a list of properties required for complex queries in the PDB. By automatically handling duplicate entries, this function streamlines the process of constructing property lists, which can then be used in subsequent data retrieval operations.
The function operates as follows:
Checks if the input 'property' is a list. If not, it throws an error.
Iterates through each property in the list, ensuring that subproperties are unique and in character vector format.
If a property already exists in the list, it merges the subproperties while eliminating duplicates.
A modified list that consolidates the input properties. If a property already exists in the input list, its subproperties are merged, removing any duplicates.
It is important to ensure that the subproperties are correctly formatted as character vectors. The function does not modify the format of the subproperties.
fetch_data
, query_search
for related functions that utilize properties in querying the PDB.
# Example usage: properties <- list(cell = c("length_a", "length_b", "length_c"), exptl = c("method")) # Add new properties or merge existing ones updated_properties <- add_property(properties) print(updated_properties)
# Example usage: properties <- list(cell = c("length_a", "length_b", "length_c"), exptl = c("method")) # Add new properties or merge existing ones updated_properties <- add_property(properties) print(updated_properties)
The 'autoresolve_sequence_type' function analyzes the characters in a given sequence to determine whether it is a DNA, RNA, or protein sequence. The function uses standard IUPAC nucleotide and amino acid codes to classify the sequence based on its composition.
autoresolve_sequence_type(sequence)
autoresolve_sequence_type(sequence)
sequence |
A string representing the nucleotide or protein sequence to be analyzed. The sequence should be composed of characters corresponding to standard IUPAC codes. |
A string indicating the resolved sequence type: 'DNA', 'RNA', or 'PROTEIN'. If the sequence contains ambiguous characters or does not fit clearly into one of these categories, an error is thrown.
# Example of determining the sequence type for a DNA sequence seq_type_dna <- autoresolve_sequence_type("ATGCGTACGTAGC") print(seq_type_dna) # Should return "DNA" # Example of determining the sequence type for a protein sequence seq_type_protein <- autoresolve_sequence_type("MVLSPADKTNVKAAW") print(seq_type_protein) # Should return "PROTEIN" # Example of an ambiguous sequence that causes an error # autoresolve_sequence_type("ATGB") # Should throw an error due to ambiguity
# Example of determining the sequence type for a DNA sequence seq_type_dna <- autoresolve_sequence_type("ATGCGTACGTAGC") print(seq_type_dna) # Should return "DNA" # Example of determining the sequence type for a protein sequence seq_type_protein <- autoresolve_sequence_type("MVLSPADKTNVKAAW") print(seq_type_protein) # Should return "PROTEIN" # Example of an ambiguous sequence that causes an error # autoresolve_sequence_type("ATGB") # Should throw an error due to ambiguity
The 'ChemicalOperator' function constructs an operator object used for chemical searches within the RCSB Protein Data Bank (PDB). This function is particularly useful for querying the PDB database using chemical structure descriptors, such as SMILES (Simplified Molecular Input Line Entry System) or InChI (International Chemical Identifier) strings. The function supports various matching criteria to tailor the search results according to specific needs.
ChemicalOperator(descriptor, matching_criterion = "graph-strict")
ChemicalOperator(descriptor, matching_criterion = "graph-strict")
descriptor |
A string representing the chemical structure in either SMILES or InChI format. The function automatically detects the format based on the input string. If the descriptor starts with "InChI=", it is treated as an InChI string; otherwise, it is assumed to be a SMILES string. |
matching_criterion |
A string specifying the criterion for matching the chemical structure. The matching criterion determines how closely the input descriptor should match the structures in the PDB database. The possible values are predefined in the 'DescriptorMatchingCriterion' list, with "graph-strict" being the default. Other options may include "graph-exact," "graph-relaxed," and "fingerprint-similarity," among others. |
The 'ChemicalOperator' function is designed for advanced users who need to search for chemical structures in the PDB using specific descriptors. The function allows flexibility in defining the level of matching precision, making it suitable for both exact and fuzzy searches.
The matching criteria provided by the 'matching_criterion' argument allow users to control the strictness of the search. For example:
Matches chemical structures based on atom type, bond order, and chirality, with strict graph matching.
Allows for a more relaxed matching by ignoring certain structural details.
Uses molecular fingerprints to find similar structures based on a similarity threshold.
The function returns a list structured as a 'ChemicalOperator' object. This object contains the input descriptor, the type of descriptor (SMILES or InChI), and the specified matching criterion. The resulting 'ChemicalOperator' object can be used in subsequent functions that perform chemical searches in the PDB database.
'perform_search' for executing a search using the created 'ChemicalOperator'.
# Example 1: Search for a chemical using a SMILES string smiles_operator <- ChemicalOperator(descriptor = "C1=CC=CC=C1", matching_criterion = "graph-strict") smiles_operator # Example 2: Search using an InChI string with a relaxed matching criterion inchi_operator <- ChemicalOperator(descriptor = "InChI=1S/C7H8O/c1-6-2-4-7(9)5-3-6/h2-5,9H,1H3", matching_criterion = "graph-relaxed") inchi_operator
# Example 1: Search for a chemical using a SMILES string smiles_operator <- ChemicalOperator(descriptor = "C1=CC=CC=C1", matching_criterion = "graph-strict") smiles_operator # Example 2: Search using an InChI string with a relaxed matching criterion inchi_operator <- ChemicalOperator(descriptor = "InChI=1S/C7H8O/c1-6-2-4-7(9)5-3-6/h2-5,9H,1H3", matching_criterion = "graph-relaxed") inchi_operator
Constructs a 'ComparisonOperator' object for search operations that perform comparison checks on attribute values. This operator allows for evaluating attributes using comparison operators such as 'equal', 'greater_than', or 'less_than', making it suitable for numerical and date-based searches.
ComparisonOperator(attribute, value, comparison_type)
ComparisonOperator(attribute, value, comparison_type)
attribute |
The attribute to be compared. This should be the field within the RCSB PDB that you want to evaluate. |
value |
The value to compare against. This is the reference value for the comparison. |
comparison_type |
A string specifying the type of comparison (e.g., 'equal', 'greater_than', 'less_than'). Supported comparison types are 'equal', 'not_equal', 'greater_than', 'less_than', etc. |
An object of class 'ComparisonOperator' that can be used in search queries to retrieve entries where the attribute meets the specified comparison criteria.
# Search for entries where an attribute equals a specific value operator <- ComparisonOperator(attribute = "rcsb_entry_info.resolution_combined", value = 2.0, comparison_type = "EQUAL") operator
# Search for entries where an attribute equals a specific value operator <- ComparisonOperator(attribute = "rcsb_entry_info.resolution_combined", value = 2.0, comparison_type = "EQUAL") operator
Constructs a 'ContainsPhraseOperator' object for search operations that look for attributes containing a specific phrase. This operator is ideal for scenarios where the search needs to be more precise than just individual words, such as finding an exact phrase within a text attribute.
ContainsPhraseOperator(attribute, value)
ContainsPhraseOperator(attribute, value)
attribute |
The attribute to be evaluated. This should be the text field within the RCSB PDB that you want to search against. |
value |
The phrase to search for in the attribute. The search will look for this exact sequence of words within the specified attribute. |
An object of class 'ContainsPhraseOperator' that can be used in search queries to retrieve entries where the attribute contains the specified phrase.
# Search for entries containing a specific phrase in an attribute operator <- ContainsPhraseOperator(attribute = "rcsb_primary_citation.title", value = "molecular dynamics") print(operator)
# Search for entries containing a specific phrase in an attribute operator <- ContainsPhraseOperator(attribute = "rcsb_primary_citation.title", value = "molecular dynamics") print(operator)
Constructs a 'ContainsWordsOperator' object for search operations that look for attributes containing specific words. This operator is particularly useful for text-based searches where the goal is to find entries that include particular keywords or phrases within a specified attribute.
ContainsWordsOperator(attribute, value)
ContainsWordsOperator(attribute, value)
attribute |
The attribute to be evaluated. This should be the text field within the RCSB PDB that you want to search against. |
value |
The words to search for in the attribute. This can be a single word or a set of words, and the search will return entries containing any of these words in the specified attribute. |
An object of class 'ContainsWordsOperator' that can be used in search queries to retrieve entries where the attribute contains the specified words.
# Search for entries containing specific words in an attribute operator <- ContainsWordsOperator(attribute = "rcsb_primary_citation.title", value = "crystal structure") print(operator)
# Search for entries containing specific words in an attribute operator <- ContainsWordsOperator(attribute = "rcsb_primary_citation.title", value = "crystal structure") print(operator)
The 'data_fetcher' function provides a flexible way to access data from the RCSB Protein Data Bank (PDB). By specifying an identifier, data type, and a set of properties, users can tailor the data retrieval process to meet their specific research needs. The function integrates several steps, including validating IDs, generating a JSON query, fetching the data, and formatting the response.
data_fetcher( id = NULL, data_type = "ENTRY", properties = NULL, return_as_dataframe = TRUE, verbosity = FALSE )
data_fetcher( id = NULL, data_type = "ENTRY", properties = NULL, return_as_dataframe = TRUE, verbosity = FALSE )
id |
A single identifier or a list of identifiers for the data to be fetched. These IDs correspond to the entries, assemblies, polymer entities, or other entities within the RCSB PDB. The ID must match the data type you are querying (e.g., PDB ID for entries, assembly ID for assemblies). |
data_type |
A string specifying the type of data to fetch. The available options for
Each |
properties |
A list or dictionary of properties to be included in the data fetching process. The properties should match the data type you are querying. For example, if you are fetching |
return_as_dataframe |
A boolean indicating whether to return the response as a dataframe. If |
verbosity |
A boolean flag indicating whether to print status messages during the function execution. When set to |
The 'data_fetcher' function is particularly useful for researchers who need to access and analyze specific subsets of PDB data. By providing a list of IDs and the corresponding data type, users can retrieve only the information relevant to their study, reducing the need to manually filter or process large datasets. The function also supports fetching multiple properties simultaneously, allowing for a more comprehensive data retrieval process.
Depending on the value of return_as_dataframe
, this function returns either a dataframe or the raw data in its original format. The dataframe format is particularly useful for further data analysis and visualization within R, while the raw format may be preferred for more complex or custom data processing tasks.
# Example 1: Fetching basic entry information properties <- list(cell = c("length_a", "length_b", "length_c"), exptl = c("method")) data_fetcher( id = c("4HHB"), data_type = "ENTRY", properties = properties, return_as_dataframe = TRUE ) # Example 2: Fetching polymer entity data properties <- list( rcsb_entity_source_organism = c("ncbi_taxonomy_id", "ncbi_scientific_name"), rcsb_cluster_membership = c("cluster_id", "identity") ) data_fetcher( id = c("4HHB_1", "12CA_1"), data_type = "POLYMER_ENTITY", properties = properties, return_as_dataframe = TRUE ) # Example 3: Fetching non-polymer entity data properties <- list( rcsb_nonpolymer_entity = c("details", "formula_weight", "pdbx_description"), rcsb_nonpolymer_entity_container_identifiers = c("chem_ref_def_id") ) data_fetcher( id = c("3PQR_5", "3PQR_6"), data_type = "NONPOLYMER_ENTITY", properties = properties, return_as_dataframe = TRUE ) # Example 4: Fetching chemical component data properties <- list( rcsb_id = list(), chem_comp = list("type", "formula_weight", "name", "formula"), rcsb_chem_comp_info = list("initial_release_date") ) data_fetcher( id = c("NAG", "EBW"), data_type = "CHEMICAL_COMPONENT", properties = properties, return_as_dataframe = TRUE )
# Example 1: Fetching basic entry information properties <- list(cell = c("length_a", "length_b", "length_c"), exptl = c("method")) data_fetcher( id = c("4HHB"), data_type = "ENTRY", properties = properties, return_as_dataframe = TRUE ) # Example 2: Fetching polymer entity data properties <- list( rcsb_entity_source_organism = c("ncbi_taxonomy_id", "ncbi_scientific_name"), rcsb_cluster_membership = c("cluster_id", "identity") ) data_fetcher( id = c("4HHB_1", "12CA_1"), data_type = "POLYMER_ENTITY", properties = properties, return_as_dataframe = TRUE ) # Example 3: Fetching non-polymer entity data properties <- list( rcsb_nonpolymer_entity = c("details", "formula_weight", "pdbx_description"), rcsb_nonpolymer_entity_container_identifiers = c("chem_ref_def_id") ) data_fetcher( id = c("3PQR_5", "3PQR_6"), data_type = "NONPOLYMER_ENTITY", properties = properties, return_as_dataframe = TRUE ) # Example 4: Fetching chemical component data properties <- list( rcsb_id = list(), chem_comp = list("type", "formula_weight", "name", "formula"), rcsb_chem_comp_info = list("initial_release_date") ) data_fetcher( id = c("NAG", "EBW"), data_type = "CHEMICAL_COMPONENT", properties = properties, return_as_dataframe = TRUE )
Constructs a 'DefaultOperator' object for use in general search operations within the RCSB PDB. This operator is used when a simple, non-specific search operation is needed, based on a single value. The 'DefaultOperator' can be employed in scenarios where the search criteria do not require complex conditions or additional logic.
DefaultOperator(value)
DefaultOperator(value)
value |
The value to be used in the search operation. This is the key criterion for the search and should be a valid string or numeric value that the search will use as the matching term. |
An object of class 'DefaultOperator' representing the default search operator. This object can be used directly in query formulations or further manipulated within complex search logic.
# Create a basic search operator with a specific value operator <- DefaultOperator("4HHB") print(operator)
# Create a basic search operator with a specific value operator <- DefaultOperator("4HHB") print(operator)
Retrieves detailed information about a chemical compound from the RCSB Protein Data Bank (PDB) based on its chemical ID.
describe_chemical(chem_id, url_root = URL_ROOT)
describe_chemical(chem_id, url_root = URL_ROOT)
chem_id |
A string representing the 3-character chemical ID. This ID is typically an alphanumeric string used to uniquely identify ligands, cofactors, or other small molecules within macromolecular structures. The string must not exceed 3 characters. |
url_root |
A string representing the URL for retrieving information about chemical compounds. By default, this is set to the global constant |
A list containing detailed information about the chemical compound. This list includes various fields such as:
A sublist containing chemical descriptors like SMILES, InChI strings, molecular weight, and other chemical properties.
Information regarding the compound’s classification, formula, and additional relevant data.
Other fields may also be included, depending on the specific compound and the data available from the RCSB PDB.
## Not run: # Retrieve chemical information for N-Acetyl-D-Glucosamine (NAG) chem_desc <- describe_chemical('NAG') chem_desc # Access the SMILES string of the compound smiles_string <- chem_desc$rcsb_chem_comp_descriptor$smiles smiles_string ## End(Not run)
## Not run: # Retrieve chemical information for N-Acetyl-D-Glucosamine (NAG) chem_desc <- describe_chemical('NAG') chem_desc # Access the SMILES string of the compound smiles_string <- chem_desc$rcsb_chem_comp_descriptor$smiles smiles_string ## End(Not run)
Constructs an 'ExactMatchOperator' object for precise search operations within the RCSB PDB. This operator is designed to match an exact attribute value, making it ideal for searches where specificity is required. For example, if you need to find all entries that exactly match a certain attribute value, this operator will ensure only those precise matches are returned.
ExactMatchOperator(attribute, value)
ExactMatchOperator(attribute, value)
attribute |
The attribute to match. This should be the name of the field within the RCSB PDB that you want to search against. |
value |
The exact value to search for. This is the specific value of the attribute you are interested in. The search will return only those records where the attribute exactly matches this value. |
An object of class 'ExactMatchOperator'. This object can be used in search queries to retrieve precise matches within the RCSB PDB database.
# Search for entries with an exact match to a given attribute operator <- ExactMatchOperator(attribute = "rcsb_entry_info.resolution_combined", value = "2.0") print(operator)
# Search for entries with an exact match to a given attribute operator <- ExactMatchOperator(attribute = "rcsb_entry_info.resolution_combined", value = "2.0") print(operator)
Constructs an 'ExistsOperator' object for search operations to check the existence of an attribute. This operator is useful in queries where you need to ensure that a certain attribute is present within the entries being searched, regardless of its value.
ExistsOperator(attribute)
ExistsOperator(attribute)
attribute |
The attribute whose existence is to be checked. This should be the field within the RCSB PDB that you want to ensure is present in the search results. |
An object of class 'ExistsOperator' that can be used in search queries to retrieve entries where the specified attribute exists.
# Search for entries where a specific attribute exists operator <- ExistsOperator(attribute = "rcsb_primary_citation.doi") print(operator)
# Search for entries where a specific attribute exists operator <- ExistsOperator(attribute = "rcsb_primary_citation.doi") print(operator)
This function sends a GraphQL JSON query to the RCSB Protein Data Bank (PDB) to fetch data corresponding to a specified set of IDs. It is designed to handle the complexities of interacting with the PDB's GraphQL API, ensuring that errors in the query process are handled gracefully and that users are informed of any discrepancies in the expected and returned data.
fetch_data(json_query, data_type, ids)
fetch_data(json_query, data_type, ids)
json_query |
A JSON string representing the query to be sent to the PDB. This query must be well-formed and adhere to the GraphQL query structure required by the PDB's API. It typically includes details such as the fields to be retrieved and the conditions for data selection. |
data_type |
A string indicating the type of data to be fetched. While this parameter is not directly used in the function, it can provide context for the data being retrieved, such as "ENTRY", "ASSEMBLY", "POLYMER_ENTITY", etc. |
ids |
A character vector of identifiers for which data is being requested. These IDs should correspond to valid entries in the PDB and should match the data structure expected by the PDB's API. |
The function performs several checks and operations:
* It validates the 'json_query' to ensure that it is neither 'NULL' nor empty. * It attempts to send the query to the PDB's GraphQL endpoint using a helper function (assumed to be 'search_graphql'). * It checks the server's response to determine if the query was successful or if any errors were encountered. * If the data returned does not match the expected IDs, the function issues warnings and stops execution, providing details on the missing IDs. * The function ensures that the returned data is correctly named according to the provided IDs.
The function is particularly useful for developers and researchers who need to programmatically access and manipulate large datasets from the PDB. It abstracts away some of the complexity of directly interacting with the PDB's API, providing a more user-friendly interface for data retrieval.
A list containing the data fetched from the PDB, with the names of the list elements set to the corresponding IDs. If any issues occur during the fetching process (e.g., if some IDs are not found), the function will return 'NULL' and provide informative error messages to help diagnose the problem.
This function searches the Protein Data Bank (PDB) for scholarly articles related to a specified search term. It retrieves the titles of up to a specified maximum number of papers associated with PDB entries. The function relies on 'query_search' to perform the initial search and 'get_info' to fetch detailed information for each PDB entry, including the citation titles.
find_papers(search_term, max_results = 10)
find_papers(search_term, max_results = 10)
search_term |
A string specifying the term to search for in the PDB. This term can relate to any aspect of the PDB entries, such as keywords, molecular functions, or specific proteins (e.g., "CRISPR"). |
max_results |
An integer indicating the maximum number of paper titles to retrieve. Defaults to 10. The function will retrieve the titles for the first 'max_results' PDB entries returned by the search. |
This function is useful for researchers who want to quickly find relevant literature associated with specific PDB entries. The process involves two main steps:
**Search Query**: The function uses 'query_search' to find PDB entries matching the search term.
**Fetching Paper Titles**: For each PDB ID returned by the search, 'get_info' is used to retrieve detailed information, including the titles of any associated citations.
The function includes robust error handling to manage cases where the search term does not return any results, or where there are issues retrieving details for specific PDB entries. Warnings are provided if no citations are found for a given PDB ID or if other issues are encountered.
A named list where each element's name is a PDB ID and its value is the title of the corresponding paper. If no papers are found or an error occurs, the function returns an empty list with warnings or error messages to help diagnose the issue.
# Find papers related to CRISPR and retrieve up to 5 paper titles crispr_papers <- find_papers("CRISPR", max_results = 5) print(crispr_papers)
# Find papers related to CRISPR and retrieve up to 5 paper titles crispr_papers <- find_papers("CRISPR", max_results = 5) print(crispr_papers)
This function searches the Protein Data Bank (PDB) for entries related to a specified search term and retrieves specific information from those entries. It is useful for extracting targeted data from search results, such as citations, experimental methods, or structural details. The function leverages 'query_search' to perform the initial search and 'get_info' to fetch detailed data for each PDB entry.
find_results(search_term, field = "citation")
find_results(search_term, field = "citation")
search_term |
A string specifying the term to search for in the PDB. This term can relate to various aspects of the PDB entries, such as keywords, molecular functions, protein names, or specific research areas. |
field |
A string indicating the specific field to retrieve for each search result. The default is "citation". The field should correspond to one of the following valid options:
|
This function is ideal for researchers who need to extract specific data fields from multiple PDB entries efficiently. The process involves two main steps:
**Search Query**: The function uses 'query_search' to find PDB entries that match the provided search term.
**Field Retrieval**: For each PDB ID returned by the search, 'get_info' is used to retrieve the specified field.
Error handling is robust, with informative messages provided when the search term yields no results, when an individual PDB entry cannot be retrieved, or when the specified field is not found in the retrieved data.
A named list where each element's name is a PDB ID and its value is the information for the specified field from the corresponding search result. If no results are found, or if an error occurs during data retrieval, the function returns an empty list with appropriate warnings or error messages.
# Retrieve citation information for PDB entries related to CRISPR crispr_citations <- find_results("CRISPR", field = "citation") crispr_citations
# Retrieve citation information for PDB entries related to CRISPR crispr_citations <- find_results("CRISPR", field = "citation") crispr_citations
This function constructs a JSON query string tailored for retrieving data from the RCSB Protein Data Bank (PDB). It is designed to accommodate a variety of data types, such as entries, polymer entities, assemblies, and chemical components. The function is particularly useful when specific properties need to be queried across multiple PDB identifiers.
generate_json_query(ids, data_type, properties)
generate_json_query(ids, data_type, properties)
ids |
A character vector of identifiers corresponding to the data you wish to retrieve. These identifiers should match the type of data specified in the ‘data_type' argument. For example, if 'data_type' is ’ENTRY', the identifiers should be PDB entry IDs like "1XYZ" or "2XYZ". |
data_type |
A string indicating the type of data to query. The function supports the following types:
|
properties |
A named list where each element represents a property to be included in the query for the corresponding 'data_type'. Each element of the list should be a character vector containing the specific properties you wish to retrieve. For example, for 'data_type = "ENTRY"', properties might include 'cell = c("volume", "angle_beta")' to retrieve cell volume and angle beta of the unit cell. |
The function is designed to be flexible and extensible, allowing users to construct complex queries with multiple properties and data types. The function internally maps the 'data_type' to the appropriate query format and ensures that the JSON string is properly constructed. The 'properties' argument should align with the data type; otherwise, the query might not return the desired results.
The 'generate_json_query' function is particularly useful for researchers and bioinformaticians who need to retrieve specific datasets from the PDB, especially when dealing with large-scale structural biology data. The generated JSON query can be used in subsequent API calls to fetch the required data in a structured and efficient manner.
A string representing the generated JSON query. This string is formatted according to the requirements of the RCSB PDB API and can be used directly in a GraphQL query to retrieve the specified data.
# Example 1: Generate a query for PDB entries with specific properties ids <- c("1XYZ", "2XYZ") properties <- list(cell = c("volume", "angle_beta"), exptl = c("method")) json_query <- generate_json_query(ids, "ENTRY", properties) print(json_query) # Example 2: Generate a query for chemical components with specified properties ids <- c("ATP", "NAD") properties <- list(chem_comp = c("formula_weight", "type")) json_query <- generate_json_query(ids, "CHEMICAL_COMPONENT", properties) print(json_query) # Example 3: Generate a query for polymer entities within a PDB entry ids <- c("1XYZ_1", "2XYZ_2") properties <- list(entity_src_gen = c("organism_scientific", "gene_src_common")) json_query <- generate_json_query(ids, "POLYMER_ENTITY", properties) print(json_query)
# Example 1: Generate a query for PDB entries with specific properties ids <- c("1XYZ", "2XYZ") properties <- list(cell = c("volume", "angle_beta"), exptl = c("method")) json_query <- generate_json_query(ids, "ENTRY", properties) print(json_query) # Example 2: Generate a query for chemical components with specified properties ids <- c("ATP", "NAD") properties <- list(chem_comp = c("formula_weight", "type")) json_query <- generate_json_query(ids, "CHEMICAL_COMPONENT", properties) print(json_query) # Example 3: Generate a query for polymer entities within a PDB entry ids <- c("1XYZ_1", "2XYZ_2") properties <- list(entity_src_gen = c("organism_scientific", "gene_src_common")) json_query <- generate_json_query(ids, "POLYMER_ENTITY", properties) print(json_query)
This function retrieves FASTA sequences from the RCSB Protein Data Bank (PDB) for a specified entry ID (rcsb_id
). It can return either the full set of sequences associated with the entry or, if specified, the sequence corresponding to a particular chain within that entry. This flexibility makes it a useful tool for bioinformaticians and structural biologists needing access to protein or nucleic acid sequences.
get_fasta_from_rcsb_entry( rcsb_id, chain_id = NULL, verbosity = TRUE, fasta_base_url = FASTA_BASE_URL )
get_fasta_from_rcsb_entry( rcsb_id, chain_id = NULL, verbosity = TRUE, fasta_base_url = FASTA_BASE_URL )
rcsb_id |
A string representing the PDB ID for which the FASTA sequence is to be retrieved. This is the primary identifier of the entry in the PDB database. |
chain_id |
A string representing the specific chain ID within the PDB entry for which the FASTA sequence is to be retrieved. If |
verbosity |
A boolean flag indicating whether to print status messages during the function execution. When set to |
fasta_base_url |
A string representing the base URL for the FASTA retrieval. By default, this is set to the global constant |
The function queries the RCSB PDB database using the provided entry ID (rcsb_id
) and optionally a chain ID (chain_id
). It sends an HTTP GET request to retrieve the corresponding FASTA file. The response is then parsed into a list of sequences. If a chain ID is provided, the function will return only the sequence corresponding to that chain. If no chain ID is provided, all sequences are returned.
If a request fails, the function provides informative error messages. In the case of a network failure, the function will stop execution with a clear error message. Additionally, if the chain ID does not exist within the entry, the function will return an appropriate error message indicating that the chain was not found.
The function also supports passing a custom base URL for the FASTA file retrieval, providing flexibility for users working with different PDB mirrors or services.
* If chain_id
is NULL, the function returns a list of FASTA sequences associated with the provided rcsb_id
, where organism names or chain descriptions are used as keys.
* If chain_id
is specified, the function returns a character string representing the FASTA sequence for that specific chain.
* If the specified chain_id
is not found in the PDB entry, the function will stop execution with an informative error message.
# Example 1: Retrieve all FASTA sequences for the entry 4HHB all_sequences <- get_fasta_from_rcsb_entry("4HHB", verbosity = TRUE) print(all_sequences) # Example 2: Retrieve the FASTA sequence for chain A of entry 4HHB chain_a_sequence <- get_fasta_from_rcsb_entry("4HHB", chain_id = "A", verbosity = TRUE) print(chain_a_sequence)
# Example 1: Retrieve all FASTA sequences for the entry 4HHB all_sequences <- get_fasta_from_rcsb_entry("4HHB", verbosity = TRUE) print(all_sequences) # Example 2: Retrieve the FASTA sequence for chain A of entry 4HHB chain_a_sequence <- get_fasta_from_rcsb_entry("4HHB", chain_id = "A", verbosity = TRUE) print(chain_a_sequence)
This function retrieves comprehensive information for a specified PDB (Protein Data Bank) entry by querying the RCSB PDB RESTful API. The function handles HTTP requests, processes JSON responses, and can manage legacy PDB identifiers. It is particularly useful for obtaining all available data related to a specific PDB entry, which can include metadata, structural details, experimental methods, and more.
get_info(pdb_id)
get_info(pdb_id)
pdb_id |
A string specifying the PDB entry of interest. The 'pdb_id' should be a 4-character alphanumeric code, representing the unique identifier of a PDB entry (e.g., "1XYZ"). If a legacy PDB identifier is provided in the format 'PDB_ID:CHAIN_ID', it will be automatically converted to the new format for querying. |
The 'get_info' function is versatile and designed for researchers who need to extract detailed structural and experimental information from the RCSB PDB. The function is robust, providing error handling for various scenarios, including network failures, incorrect PDB IDs, and API errors. It automatically manages legacy PDB IDs, ensuring compatibility with the latest API standards.
The output is a structured list that can be easily parsed or manipulated for further analysis, making it an essential tool for bioinformaticians and structural biologists working with PDB data.
The function also offers flexibility in querying different parts of the RCSB PDB API by adjusting the 'url_root' parameter, allowing users to target specific datasets within the PDB.
A list object (an ordered dictionary in R) containing detailed information about the specified PDB entry. The returned list includes various data fields, depending on the content available for the entry. For example, it may contain information about the structure's authors, resolution, experiment type, macromolecules, ligands, etc. If the data retrieval fails at any stage (e.g., network issues, invalid PDB ID, API downtime), the function will return 'NULL' and provide an informative error message.
pdb_info <- get_info(pdb_id = "1XYZ") print(pdb_info)
pdb_info <- get_info(pdb_id = "1XYZ") print(pdb_info)
This function constructs a full PDB API URL by concatenating the base URL, an API endpoint, and an identifier.
get_pdb_api_url(endpoint, id, base_url = BASE_URL)
get_pdb_api_url(endpoint, id, base_url = BASE_URL)
endpoint |
A character string representing the specific API endpoint to be accessed (e.g., "/pdb/entry/"). |
id |
A character string representing the identifier for the resource (e.g., a PDB ID or other relevant ID). |
base_url |
A string representing the base URL to generate PDB API url. By default, this is set to the global constant |
A character string containing the full URL for accessing the PDB API resource.
The 'get_pdb_file' function is a versatile tool designed to download Protein Data Bank (PDB) files from the RCSB database. It supports various file formats such as 'pdb', 'cif', 'xml', and 'structfact', with options for file compression and handling alternate locations (ALT) and insertion codes (INSERT) in PDB files. This function also provides the flexibility to save the downloaded files to a specified directory or to a temporary directory for immediate use.
get_pdb_file( pdb_id, filetype = "cif", rm.insert = FALSE, rm.alt = TRUE, compression = TRUE, save = FALSE, path = NULL, verbosity = TRUE, download_base_url = DOWNLOAD_BASE_URL )
get_pdb_file( pdb_id, filetype = "cif", rm.insert = FALSE, rm.alt = TRUE, compression = TRUE, save = FALSE, path = NULL, verbosity = TRUE, download_base_url = DOWNLOAD_BASE_URL )
pdb_id |
A 4-character string specifying the PDB entry of interest (e.g., "1XYZ"). This identifier uniquely represents a macromolecular structure within the PDB database. |
filetype |
A string specifying the format of the file to be downloaded. The default is 'cif'. Supported file types include:
|
rm.insert |
Logical flag indicating whether to ignore PDB insertion codes. Default is FALSE. If TRUE, records with insertion codes will be removed from the final data. |
rm.alt |
Logical flag indicating whether to ignore alternate location indicators (ALT) in PDB files. Default is TRUE. If TRUE, only the first alternate location is kept, and others are removed. |
compression |
Logical flag indicating whether to download the file in a compressed format (e.g., .gz). Default is TRUE, which is recommended for faster downloads, especially for CIF files. |
save |
Logical flag indicating whether to save the downloaded file to a specified directory. Default is FALSE, which means the file is processed and optionally saved, but not retained after processing unless specified. |
path |
A string specifying the directory where the downloaded file should be saved. If NULL, the file is saved in a temporary directory. If 'save' is TRUE, this path is required. |
verbosity |
A boolean flag indicating whether to print status messages during the function execution. |
download_base_url |
A string representing the base URL for the PDB file retrieval. By default, this is set to the global constant |
The 'get_pdb_file' function is an essential tool for structural biologists and bioinformaticians who need to download and process PDB files for further analysis. By providing options to handle alternate locations and insertion codes, this function ensures that the data is clean and ready for downstream applications. Additionally, the ability to save files locally or work with them in a temporary directory provides flexibility for various workflows. Error handling and informative messages are included to guide the user in case of issues with file retrieval or processing.
A list of class "pdb"
containing the following components:
atom
A data frame containing atomic coordinate data (ATOM and HETATM records). Each row corresponds to an atom, and each column to a specific record type (e.g., element, residue, chain).
xyz
A numeric matrix of class "xyz"
containing the atomic coordinates from the ATOM and HETATM records.
calpha
A logical vector indicating whether each atom is a C-alpha atom (TRUE) or not (FALSE).
call
The matched call, storing the function call for reference.
path
The file path where the file was saved, if 'save' was TRUE.
The function handles errors and warnings for various edge cases, such as unsupported file types, failed downloads, or issues with reading the file.
# Download a CIF file and process it without saving pdb_file <- get_pdb_file(pdb_id = "4HHB", filetype = "cif") # Download a PDB file, save it, and remove alternate location records pdb_file <- get_pdb_file(pdb_id = "4HHB", filetype = "pdb", save = TRUE, path = tempdir()) # Understanding the tertiary structure of proteins is # crucial for elucidating their functional mechanisms, # especially in the context of ligand binding, enzyme catalysis, # and protein-protein interactions. # The tertiary structure refers to the three-dimensional arrangement # of all atoms within a protein, # including its secondary structure elements like alpha helices # and beta sheets, and how these elements # are organized in space. Using the get_pdb_file function # to retrieve the PDB file and the r3dmol # package for visualization, researchers can gain insights # into the overall 3D structure of a protein. # The following example demonstrates how to visualize the # ltertiary structure of a protein using the # PDB entry 1XYZ: library(r3dmol) # Retrieve and parse a PDB structure pdb_path <- get_pdb_file("1XYZ", filetype = "pdb", save = TRUE) # Visualize the tertiary structure using r3dmol viewer <- r3dmol() %>% m_add_model(pdb_path$path, format = "pdb") %>% # Load the PDB file m_set_style(style = m_style_cartoon()) %>% # Cartoon representation m_zoom_to() # Display the molecular viewer viewer # In this example, the protein structure is represented # in a cartoon style, which is particularly # effective for visualizing the overall fold of the protein, # including the orientation and interaction # of its secondary structure elements. #. To further enhance the analysis, # it is often important to # highlight specific regions of interest, # such as potential ligand-binding sites. # These sites can be identified based on prior knowledge, # experimental data, or computational predictions. # The following code snippet demonstrates # how to highlight potential ligand-binding sites in the # protein structure: # Highlight potential ligand-binding sites # Note: Manually define residues of interest based # on prior knowledge or external analysis binding_sites <- c(45, 60, 85) # Example residue numbers viewer <- viewer %>% m_set_style( sel = m_sel(resi = binding_sites), style = m_style_sphere(color = "red", radius = 1.5) ) # Display the updated viewer with highlighted binding sites viewer # In this step, specific residues that are # hypothesized to participate in ligand binding are #highlighted using a spherical representation. # The residues are selected manually based on either # experimental data or computational predictions. # By highlighting these sites, researchers can # visually inspect the spatial relationship between # these residues and other parts of the protein, # which may provide insights into the # protein's functional mechanisms. # This visualization approach offers a powerful # way to explore and communicate the 3D structure # of proteins, making it easier to hypothesize about their function and # interaction with other molecules.
# Download a CIF file and process it without saving pdb_file <- get_pdb_file(pdb_id = "4HHB", filetype = "cif") # Download a PDB file, save it, and remove alternate location records pdb_file <- get_pdb_file(pdb_id = "4HHB", filetype = "pdb", save = TRUE, path = tempdir()) # Understanding the tertiary structure of proteins is # crucial for elucidating their functional mechanisms, # especially in the context of ligand binding, enzyme catalysis, # and protein-protein interactions. # The tertiary structure refers to the three-dimensional arrangement # of all atoms within a protein, # including its secondary structure elements like alpha helices # and beta sheets, and how these elements # are organized in space. Using the get_pdb_file function # to retrieve the PDB file and the r3dmol # package for visualization, researchers can gain insights # into the overall 3D structure of a protein. # The following example demonstrates how to visualize the # ltertiary structure of a protein using the # PDB entry 1XYZ: library(r3dmol) # Retrieve and parse a PDB structure pdb_path <- get_pdb_file("1XYZ", filetype = "pdb", save = TRUE) # Visualize the tertiary structure using r3dmol viewer <- r3dmol() %>% m_add_model(pdb_path$path, format = "pdb") %>% # Load the PDB file m_set_style(style = m_style_cartoon()) %>% # Cartoon representation m_zoom_to() # Display the molecular viewer viewer # In this example, the protein structure is represented # in a cartoon style, which is particularly # effective for visualizing the overall fold of the protein, # including the orientation and interaction # of its secondary structure elements. #. To further enhance the analysis, # it is often important to # highlight specific regions of interest, # such as potential ligand-binding sites. # These sites can be identified based on prior knowledge, # experimental data, or computational predictions. # The following code snippet demonstrates # how to highlight potential ligand-binding sites in the # protein structure: # Highlight potential ligand-binding sites # Note: Manually define residues of interest based # on prior knowledge or external analysis binding_sites <- c(45, 60, 85) # Example residue numbers viewer <- viewer %>% m_set_style( sel = m_sel(resi = binding_sites), style = m_style_sphere(color = "red", radius = 1.5) ) # Display the updated viewer with highlighted binding sites viewer # In this step, specific residues that are # hypothesized to participate in ligand binding are #highlighted using a spherical representation. # The residues are selected manually based on either # experimental data or computational predictions. # By highlighting these sites, researchers can # visually inspect the spatial relationship between # these residues and other parts of the protein, # which may provide insights into the # protein's functional mechanisms. # This visualization approach offers a powerful # way to explore and communicate the 3D structure # of proteins, making it easier to hypothesize about their function and # interaction with other molecules.
This function checks for errors in the HTTP response and stops execution if the request was not successful.
handle_api_errors(response, url = "")
handle_api_errors(response, url = "")
response |
An HTTP response object. |
url |
A string representing the requested URL (for more informative error messages). |
None. It stops execution if an error is detected.
The 'infer_search_service' function determines the appropriate search service for a given search operator. This function is essential for ensuring that queries are directed to the correct search service, such as basic search, text search, sequence search, etc.
infer_search_service(search_operator)
infer_search_service(search_operator)
search_operator |
A query operator object that specifies the type of search being performed. |
A string representing the inferred search service, which is necessary for constructing a valid query.
Constructs an 'InOperator' object for search operations where the attribute value must be within a specified set. This operator is useful when the search criteria require the attribute to match one of several possible values. It can handle multiple potential matches and is ideal for scenarios where multiple values are acceptable.
InOperator(attribute, value)
InOperator(attribute, value)
attribute |
The attribute to be evaluated. This should be the field within the RCSB PDB that you want to search against. |
value |
The set of values to include in the search. This should be a vector of possible values that the attribute can match. |
An object of class 'InOperator' that can be used in search queries to retrieve entries where the attribute matches any of the specified values.
# Search for entries where the attribute matches one of several values operator <- InOperator(attribute = "rcsb_entity_source_organism.taxonomy_lineage.name", value = c("Homo sapiens", "Mus musculus")) print(operator)
# Search for entries where the attribute matches one of several values operator <- InOperator(attribute = "rcsb_entity_source_organism.taxonomy_lineage.name", value = c("Homo sapiens", "Mus musculus")) print(operator)
This function parses a FASTA-formatted text into a list of sequences, where each sequence is keyed by the header (which includes organism name and chain information).
parse_fasta_text_to_list(fasta_text)
parse_fasta_text_to_list(fasta_text)
fasta_text |
A string containing FASTA-formatted text. |
A list where each element is a FASTA sequence keyed by the header.
This function parses the content of an HTTP response based on the specified format. It supports JSON and plain text formats.
parse_response(response, format = "json")
parse_response(response, format = "json")
response |
An HTTP response object. |
format |
A string indicating the expected response format ("json" or "text"). |
Parsed content from the response.
This function allows users to perform highly customizable searches in the RCSB Protein Data Bank (PDB) by specifying detailed search criteria. It interfaces directly with the RCSB PDB's RESTful API, enabling complex queries to retrieve specific data, such as PDB entries, assemblies, polymer entities, non-polymer entities, and more.
perform_search( search_operator, return_type = "ENTRY", request_options = NULL, return_with_scores = FALSE, return_raw_json_dict = FALSE, verbosity = TRUE )
perform_search( search_operator, return_type = "ENTRY", request_options = NULL, return_with_scores = FALSE, return_raw_json_dict = FALSE, verbosity = TRUE )
search_operator |
An object that specifies the search criteria. This object can be constructed using various operator functions:
Please see the Details section. |
return_type |
A string specifying the type of data to return. The available options for
|
request_options |
A list of additional options to further customize the search request. These options can include:
|
return_with_scores |
Logical; if |
return_raw_json_dict |
Logical; if |
verbosity |
Logical; if |
The operators allow you to build complex search queries tailored to your specific needs. Detailed documentation for each search operator can be found in the RCSB PDB Search Operators. The searchable attributes include annotations from the mmCIF dictionary, external resources, and those added by RCSB PDB. Both internal additions to the mmCIF dictionary and external resource annotations are prefixed with 'rcsb_'. For a complete list of available attributes for text searches, refer to the Structure Attributes Search and Chemical Attributes Search pages.
The function returns search results based on the specified return_type
:
ENTRY
A vector of PDB IDs that match the search criteria.
ASSEMBLY
A list of PDB IDs with appended assembly IDs, formatted as "PDB_ID-ASSEMBLY_ID"
.
POLYMER_ENTITY
A list of PDB IDs with appended entity IDs for polymeric chains.
NON_POLYMER_ENTITY
A list of PDB IDs with appended entity IDs for non-polymeric components.
POLYMER_INSTANCE
A list of PDB IDs with appended asym IDs for specific polymer instances.
CHEMICAL_COMPONENT
A list of chemical component identifiers.
# Example 1: Search for Polymer Entities from Mus musculus and Homo sapiens search_operator <- InOperator( attribute = "rcsb_entity_source_organism.taxonomy_lineage.name", value = c("Mus musculus", "Homo sapiens") ) results <- perform_search( search_operator = search_operator, return_type = "POLYMER_ENTITY" ) results # Example 2: Search for Entries Released After a Specific Date operator_date <- ComparisonOperator( attribute = "rcsb_accession_info.initial_release_date", value = "2019-08-20", comparison_type = "GREATER" ) request_options <- list( facets = list( list( name = "Methods", aggregation_type = "terms", attribute = "exptl.method" ) ) ) results <- perform_search( search_operator = operator_date, return_type = "ENTRY", request_options = request_options ) results # Example 3: Search for Symmetric Dimers with DNA-Binding Domain operator_symbol <- ExactMatchOperator( attribute = "rcsb_struct_symmetry.symbol", value = "C2" ) operator_kind <- ExactMatchOperator( attribute = "rcsb_struct_symmetry.kind", value = "Global Symmetry" ) operator_full_text <- DefaultOperator( value = "\"heat-shock transcription factor\"" ) operator_dna_count <- ComparisonOperator( attribute = "rcsb_entry_info.polymer_entity_count_DNA", value = 1, comparison_type = "GREATER_OR_EQUAL" ) query_group <- list( type = "group", logical_operator = "and", nodes = list( list( type = "terminal", service = "text", parameters = operator_symbol ), list( type = "terminal", service = "text", parameters = operator_kind ), list( type = "terminal", service = "full_text", parameters = operator_full_text ), list( type = "terminal", service = "text", parameters = operator_dna_count ) ) ) results <- perform_search( search_operator = query_group, return_type = "ASSEMBLY" ) results
# Example 1: Search for Polymer Entities from Mus musculus and Homo sapiens search_operator <- InOperator( attribute = "rcsb_entity_source_organism.taxonomy_lineage.name", value = c("Mus musculus", "Homo sapiens") ) results <- perform_search( search_operator = search_operator, return_type = "POLYMER_ENTITY" ) results # Example 2: Search for Entries Released After a Specific Date operator_date <- ComparisonOperator( attribute = "rcsb_accession_info.initial_release_date", value = "2019-08-20", comparison_type = "GREATER" ) request_options <- list( facets = list( list( name = "Methods", aggregation_type = "terms", attribute = "exptl.method" ) ) ) results <- perform_search( search_operator = operator_date, return_type = "ENTRY", request_options = request_options ) results # Example 3: Search for Symmetric Dimers with DNA-Binding Domain operator_symbol <- ExactMatchOperator( attribute = "rcsb_struct_symmetry.symbol", value = "C2" ) operator_kind <- ExactMatchOperator( attribute = "rcsb_struct_symmetry.kind", value = "Global Symmetry" ) operator_full_text <- DefaultOperator( value = "\"heat-shock transcription factor\"" ) operator_dna_count <- ComparisonOperator( attribute = "rcsb_entry_info.polymer_entity_count_DNA", value = 1, comparison_type = "GREATER_OR_EQUAL" ) query_group <- list( type = "group", logical_operator = "and", nodes = list( list( type = "terminal", service = "text", parameters = operator_symbol ), list( type = "terminal", service = "text", parameters = operator_kind ), list( type = "terminal", service = "full_text", parameters = operator_full_text ), list( type = "terminal", service = "text", parameters = operator_dna_count ) ) ) results <- perform_search( search_operator = query_group, return_type = "ASSEMBLY" ) results
This function performs a search query against the RCSB Protein Data Bank using their REST API. It allows for various types of searches based on the provided parameters.
query_search( search_term, query_type = "full_text", return_type = "entry", scan_params = NULL, num_attempts = 1, sleep_time = 0.5 )
query_search( search_term, query_type = "full_text", return_type = "entry", scan_params = NULL, num_attempts = 1, sleep_time = 0.5 )
search_term |
A string specifying the term to search in the database. |
query_type |
A string specifying the type of query to perform. Supported values include "full_text", "PubmedIdQuery", "TreeEntityQuery", "ExpTypeQuery", "AdvancedAuthorQuery", "OrganismQuery", "pfam", and "uniprot". Default is "full_text". |
return_type |
A string specifying the type of search result to return. Possible values are "entry" (default) and "polymer_entity". |
scan_params |
Additional parameters for the scan, provided as a list. This is 'NULL' by default and typically only used for advanced queries. |
num_attempts |
An integer specifying the number of attempts to try the query in case of failure. |
sleep_time |
A numeric value specifying the time in seconds to wait between attempts. |
Depending on the return_type, it either returns a list of PDB IDs (if "entry") or the full response from the API.
# Get a list of PDBs for a specific search term # Search Functions by Specific Terms pdbs <- query_search("ribosome") head(pdbs) # Search by PubMed ID Number pdbs_by_pubmedid <- query_search(search_term = 27499440, query_type = "PubmedIdQuery") head(pdbs_by_pubmedid) # Search by source organism using NCBI TaxId pdbs_by_ncbi_taxid <- query_search(search_term = "6239", query_type = "TreeEntityQuery") head(pdbs_by_ncbi_taxid) # Search by Experimental Method pdbs = query_search(search_term = 'SOLID-STATE NMR', query_type='ExpTypeQuery') head(pdbs) pdbs = query_search(search_term = '4HHB', query_type="structure") head(pdbs) ## Advanced Searches # Search by Author pdbs = query_search(search_term = 'Rzechorzek, N.J.', query_type='AdvancedAuthorQuery') head(pdbs) # Search by Organism pdbs = query_search(search_term = "Escherichia coli", query_type="OrganismQuery") head(pdbs) # Search by Uniprot ID (Escherichia coli beta-lactamase) pdbs = query_search(search_term = "P0A877", query_type="uniprot") head(pdbs) # Search by PFAM number (protein kinase domain) pdbs = query_search(search_term = "PF00069", query_type="pfam") head(pdbs)
# Get a list of PDBs for a specific search term # Search Functions by Specific Terms pdbs <- query_search("ribosome") head(pdbs) # Search by PubMed ID Number pdbs_by_pubmedid <- query_search(search_term = 27499440, query_type = "PubmedIdQuery") head(pdbs_by_pubmedid) # Search by source organism using NCBI TaxId pdbs_by_ncbi_taxid <- query_search(search_term = "6239", query_type = "TreeEntityQuery") head(pdbs_by_ncbi_taxid) # Search by Experimental Method pdbs = query_search(search_term = 'SOLID-STATE NMR', query_type='ExpTypeQuery') head(pdbs) pdbs = query_search(search_term = '4HHB', query_type="structure") head(pdbs) ## Advanced Searches # Search by Author pdbs = query_search(search_term = 'Rzechorzek, N.J.', query_type='AdvancedAuthorQuery') head(pdbs) # Search by Organism pdbs = query_search(search_term = "Escherichia coli", query_type="OrganismQuery") head(pdbs) # Search by Uniprot ID (Escherichia coli beta-lactamase) pdbs = query_search(search_term = "P0A877", query_type="uniprot") head(pdbs) # Search by PFAM number (protein kinase domain) pdbs = query_search(search_term = "PF00069", query_type="pfam") head(pdbs)
The 'QueryGroup' function constructs a grouped query object that allows users to perform complex searches in the RCSB Protein Data Bank (PDB). This function is particularly useful when multiple query objects need to be combined using logical operators like 'AND' or 'OR'. The resulting grouped query can be used in advanced search operations to filter or combine results based on multiple criteria.
QueryGroup(queries, logical_operator)
QueryGroup(queries, logical_operator)
queries |
A list of query objects to be grouped together. Each query object can be either a simple query or another grouped query. |
logical_operator |
A string specifying the logical operator to combine the queries. Common values are 'AND' and 'OR', but other logical operators may also be supported. |
A list representing the grouped query object, which can be passed to search functions for execution.
The 'QueryNode' function constructs a query node, which can be either a terminal node (for a simple query) or a grouped node (for complex queries). This function is crucial for structuring queries that will be sent to the RCSB PDB search system.
QueryNode(search_operator, logical_operator = NULL)
QueryNode(search_operator, logical_operator = NULL)
search_operator |
A search operator or group object. This defines the criteria for the search. |
logical_operator |
A string specifying the logical operator to combine multiple queries. Default is 'NULL'. This is used only if the search_operator is a group. |
A list representing the query node, ready to be included in a larger query structure.
node <- QueryNode(search_operator = DefaultOperator("some_value"))
node <- QueryNode(search_operator = DefaultOperator("some_value"))
Constructs a 'RangeOperator' object for search operations that specify a range for attribute values. This operator is particularly useful for filtering results based on numeric or date ranges, such as finding entries with resolution between specific values or dates within a certain range.
RangeOperator( attribute, from_value, to_value, include_lower = TRUE, include_upper = TRUE, negation = FALSE )
RangeOperator( attribute, from_value, to_value, include_lower = TRUE, include_upper = TRUE, negation = FALSE )
attribute |
The attribute to be evaluated within a range. This should be the numeric or date field within the RCSB PDB that you want to search against. |
from_value |
The starting value of the range. This is the lower bound of the range. |
to_value |
The ending value of the range. This is the upper bound of the range. |
include_lower |
Boolean indicating whether to include the lower bound in the range. Default is TRUE. |
include_upper |
Boolean indicating whether to include the upper bound in the range. Default is TRUE. |
negation |
Boolean indicating whether to negate the range condition. Default is FALSE. |
An object of class 'RangeOperator' that can be used in search queries to retrieve entries where the attribute falls within the specified range.
# Search for entries within a specific range of resolution operator <- RangeOperator(attribute = "rcsb_entry_info.resolution_combined", from_value = 1.5, to_value = 2.5) print(operator)
# Search for entries within a specific range of resolution operator <- RangeOperator(attribute = "rcsb_entry_info.resolution_combined", from_value = 1.5, to_value = 2.5) print(operator)
The 'RequestOptions' function sets various options for search requests to the RCSB PDB, such as pagination and sorting preferences. These options help control the volume of search results returned and the order in which they are presented.
RequestOptions( result_start_index = NULL, num_results = NULL, sort_by = "score", desc = TRUE )
RequestOptions( result_start_index = NULL, num_results = NULL, sort_by = "score", desc = TRUE )
result_start_index |
An integer specifying the starting index for result pagination. If 'NULL', pagination is not applied. |
num_results |
An integer specifying the number of results to return. If 'NULL', the default number of results is returned. |
sort_by |
A string indicating the attribute to sort the results by. The default value is 'score', which ranks results based on relevance. |
desc |
A boolean indicating whether the sorting should be in descending order. Default is 'TRUE'. |
A list of request options that can be included in a search query to control the results.
options <- RequestOptions(result_start_index = 0, num_results = 100, sort_by = "score", desc = TRUE)
options <- RequestOptions(result_start_index = 0, num_results = 100, sort_by = "score", desc = TRUE)
The 'return_data_as_dataframe' function transforms the response data obtained from a query to the RCSB Protein Data Bank (PDB) into a structured dataframe. This function handles various scenarios, including responses with duplicate names, null or empty responses, and nested data structures. It ensures that the resulting dataframe is consistently formatted and ready for downstream analysis.
return_data_as_dataframe(response, data_type, ids)
return_data_as_dataframe(response, data_type, ids)
response |
A list containing the response data from a PDB query. This list is expected to be structured according to the RCSB PDB GraphQL or REST API specifications. |
data_type |
A string indicating the type of data contained in the response (e.g., "ENTRY", "POLYMER_ENTITY"). This parameter is primarily used for contextual information and does not directly influence the function's operations. |
ids |
A vector of identifiers corresponding to the response data. These IDs are used to label the resulting dataframe, ensuring that each row corresponds to a specific query identifier. |
The 'return_data_as_dataframe' function is designed to provide a flexible and robust mechanism for converting PDB query responses into dataframes. It addresses several common challenges in handling API responses, such as:
If the response is null or contains no data, the function immediately returns 'NULL', avoiding unnecessary processing.
The function detects and manages scenarios where the response contains duplicated names. It simplifies such lists by keeping only the first occurrence of each duplicated element, ensuring that the final dataframe has unique column names.
The function flattens nested lists within the response, ensuring that all relevant data is captured in a single-level dataframe structure. This is particularly useful for complex responses that contain deeply nested data elements.
After processing the data, the function ensures that column names are consistent and do not retain unnecessary prefixes. This makes the resulting dataframe easier to interpret and work with in subsequent analyses.
A dataframe constructed from the response data, where each row corresponds to an identifier from the 'ids' vector and each column represents a data field from the response. If the response is null or empty, the function returns 'NULL'.
The function is equipped to handle responses with varying degrees of complexity. It is recommended to provide valid 'ids' corresponding to the query to ensure that the dataframe rows are correctly labeled.
The 'ScoredResult' function constructs a scored result object, typically used in search results to associate an entity ID with a numerical score. This is useful in ranking search results or displaying relevance scores alongside the results.
ScoredResult(entity_id, score)
ScoredResult(entity_id, score)
entity_id |
A string representing the entity ID. This could be a PDB ID or any identifier relevant to the search. |
score |
A numeric value representing the score associated with the entity. The score often indicates the relevance or quality of the match. |
A list representing the scored result, which can be included in the search results or used for further processing.
result <- ScoredResult(entity_id = "1XYZ", score = 95.6)
result <- ScoredResult(entity_id = "1XYZ", score = 95.6)
The 'search_graphql' function sends a GraphQL query to the RCSB Protein Data Bank (PDB) using the provided JSON query format. This function handles the HTTP request, sends the query, and processes the response, including error handling to ensure that the query executes successfully.
search_graphql(graphql_json_query, graphql_url = GRAPHQL_URL)
search_graphql(graphql_json_query, graphql_url = GRAPHQL_URL)
graphql_json_query |
A list containing the GraphQL query formatted as JSON. This list should include the 'query' key with a value that represents the GraphQL query string. The query string can specify various elements to retrieve, such as entry IDs, experimental methods, cell dimensions, etc. |
graphql_url |
A string representing the base URL perform GraphQL query. By default, this is set to the global constant |
A parsed list containing the content of the response from the RCSB PDB, formatted as an R object. If the request fails, the function stops with an error message.
This function sends an HTTP request (GET or POST) to the specified URL. It supports optional request bodies for POST requests, customizable encoding, and content type for API interactions. The function is designed to be a general-purpose API handler for use in querying external APIs.
send_api_request( url, method = "GET", body = NULL, encode = "json", content_type = "application/json", verbosity = TRUE )
send_api_request( url, method = "GET", body = NULL, encode = "json", content_type = "application/json", verbosity = TRUE )
url |
A string representing the target URL for the API request. |
method |
A string specifying the HTTP method to use. The default is "GET", but "POST" can also be used. |
body |
Optional: The body of the request, typically required for POST requests. Default is |
encode |
A string representing the encoding type of the body for POST requests. Default is |
content_type |
A string specifying the content type for POST requests. Default is |
verbosity |
Logical flag indicating whether to print status messages during the function execution. Default is |
The send_api_request
function is a flexible tool for handling API interactions. It supports both GET and POST methods and provides optional parameters for encoding and content type, making it suitable for a wide range of API requests.
If a network error occurs during the request, the function will throw an error with a detailed message about the failure.
A response object from the httr
package representing the server's response to the API request.
The 'SeqMotifOperator' function constructs an operator for searching sequence motifs within the RCSB Protein Data Bank (PDB). This operator is used to specify a search pattern, the type of biological sequence, and the pattern-matching method to be applied in the search.
SeqMotifOperator(pattern, sequence_type, pattern_type)
SeqMotifOperator(pattern, sequence_type, pattern_type)
pattern |
A string representing the motif pattern to search for. This can be a simple string or a more complex pattern, depending on the 'pattern_type'. |
sequence_type |
A string indicating the type of sequence being searched. Accepted values are 'DNA', 'RNA', or 'PROTEIN'. |
pattern_type |
A string indicating the pattern matching method to use. Options include 'SIMPLE' for basic patterns, 'PROSITE' for PROSITE-style patterns, and 'REGEX' for regular expressions. |
An object of class 'SeqMotifOperator' that encapsulates the specified search criteria. This object can be used as part of a search query within the RCSB PDB system.
# Example of creating a sequence motif operator to search for a DNA motif using a regular expression seq_motif_operator <- SeqMotifOperator( pattern = "A[TU]G", sequence_type = "DNA", pattern_type = "REGEX" ) print(seq_motif_operator)
# Example of creating a sequence motif operator to search for a DNA motif using a regular expression seq_motif_operator <- SeqMotifOperator( pattern = "A[TU]G", sequence_type = "DNA", pattern_type = "REGEX" ) print(seq_motif_operator)
The 'SequenceOperator' function constructs an operator for performing sequence-based searches within the RCSB Protein Data Bank (PDB). This operator allows users to specify a nucleotide or protein sequence, define the type of sequence, and set thresholds for e-value and identity in the search process.
SequenceOperator( sequence, sequence_type = NULL, evalue_cutoff = 100, identity_cutoff = 0.95 )
SequenceOperator( sequence, sequence_type = NULL, evalue_cutoff = 100, identity_cutoff = 0.95 )
sequence |
A string representing the nucleotide or protein sequence to search for. The sequence should be provided in standard IUPAC format. |
sequence_type |
Optional: A string indicating the type of sequence. Accepted values are 'DNA', 'RNA', or 'PROTEIN'. If not provided, the sequence type is automatically determined based on the characters present in the sequence using the 'autoresolve_sequence_type' function. |
evalue_cutoff |
A numeric value for the e-value cutoff in the search. This defines the threshold for statistical significance of the search results. Default is 100. |
identity_cutoff |
A numeric value for the identity cutoff in the search. This sets the minimum percentage of identity required for a match to be considered. Default is 0.95. |
An object of class 'SequenceOperator' that encapsulates the search criteria for sequence-based queries within the RCSB PDB.
# Example of creating a sequence operator for a protein sequence with specific cutoffs seq_operator <- SequenceOperator( sequence = "MVLSPADKTNVKAAW", sequence_type = "PROTEIN", evalue_cutoff = 10, identity_cutoff = 0.90 ) print(seq_operator) # Example of creating a sequence operator with automatic sequence type detection seq_operator_auto <- SequenceOperator( sequence = "ATGCGTACGTAGC", evalue_cutoff = 50, identity_cutoff = 0.85 ) print(seq_operator_auto)
# Example of creating a sequence operator for a protein sequence with specific cutoffs seq_operator <- SequenceOperator( sequence = "MVLSPADKTNVKAAW", sequence_type = "PROTEIN", evalue_cutoff = 10, identity_cutoff = 0.90 ) print(seq_operator) # Example of creating a sequence operator with automatic sequence type detection seq_operator_auto <- SequenceOperator( sequence = "ATGCGTACGTAGC", evalue_cutoff = 50, identity_cutoff = 0.85 ) print(seq_operator_auto)
The 'StructureOperator' function constructs an operator object for conducting structure-based searches within the RCSB Protein Data Bank (PDB). This operator allows users to specify a PDB entry ID, an assembly ID, and the mode of search to be used, facilitating precise structural queries.
StructureOperator( pdb_entry_id, assembly_id = 1, search_mode = "STRICT_SHAPE_MATCH" )
StructureOperator( pdb_entry_id, assembly_id = 1, search_mode = "STRICT_SHAPE_MATCH" )
pdb_entry_id |
A string representing the PDB entry ID to search for. The PDB entry ID is a unique identifier for each structure in the PDB. |
assembly_id |
An integer representing the assembly ID within the PDB entry. The assembly ID identifies the specific biological assembly or model within the PDB entry. By default, this is set to 1, which typically corresponds to the first biological assembly or model. |
search_mode |
A string indicating the search mode to be applied during the structure-based search. Accepted values include 'STRICT_SHAPE_MATCH', 'RELAXED_SHAPE_MATCH', etc. The default is 'STRICT_SHAPE_MATCH', which ensures a precise comparison based on the structural shape of the molecules. |
An object of class 'StructureOperator' that encapsulates the criteria for performing structure-based searches in the RCSB PDB.
# Example of creating a structure operator for a specific PDB entry and assembly struct_operator <- StructureOperator( pdb_entry_id = "1XYZ", assembly_id = 1, search_mode = "STRICT_SHAPE_MATCH" ) print(struct_operator) # Example of creating a structure operator with a relaxed search mode struct_operator_relaxed <- StructureOperator( pdb_entry_id = "1ABC", assembly_id = 2, search_mode = "RELAXED_SHAPE_MATCH" ) print(struct_operator_relaxed)
# Example of creating a structure operator for a specific PDB entry and assembly struct_operator <- StructureOperator( pdb_entry_id = "1XYZ", assembly_id = 1, search_mode = "STRICT_SHAPE_MATCH" ) print(struct_operator) # Example of creating a structure operator with a relaxed search mode struct_operator_relaxed <- StructureOperator( pdb_entry_id = "1ABC", assembly_id = 2, search_mode = "RELAXED_SHAPE_MATCH" ) print(struct_operator_relaxed)
This function performs a recursive search through a nested dictionary-like structure in R, looking for a specific term and collecting its values. It's useful for extracting specific pieces of data from complex, deeply nested results.
walk_nested_dict(my_result, term, outputs = list(), depth = 0, maxdepth = 25)
walk_nested_dict(my_result, term, outputs = list(), depth = 0, maxdepth = 25)
my_result |
The nested dictionary-like structure to search through. |
term |
The term to search for within the nested dictionary. |
outputs |
An initially empty list to store the results of the search, default is an empty list. |
depth |
The current depth of the recursion, default is 0. |
maxdepth |
The maximum depth to recurse, default is 25. If exceeded, the function issues a warning and returns NULL. |
A list of values associated with the term found in the nested dictionary. Returns NULL if the term is not found or if maximum recursion depth is exceeded.