---
title: "Supported Queries"
author: "Steven M. Mortimer"
date: "2020-07-11"
output:
rmarkdown::html_vignette:
toc: true
toc_depth: 4
keep_md: true
vignette: >
%\VignetteIndexEntry{Supported Queries}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo = FALSE}
NOT_CRAN <- identical(tolower(Sys.getenv("NOT_CRAN")), "true")
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
purl = NOT_CRAN,
eval = NOT_CRAN
)
options(tibble.print_min = 5L, tibble.print_max = 5L)
```
## Overview
The following vignette outlines the different types of queries that have been
documented and tested. These are the "supported" query types that the {salesforcer}
package currently handles. If you run into an issue, please submit the issue
[HERE](https://github.com/StevenMMortimer/salesforcer/issues/new?assignees=StevenMMortimer&labels=&template=query-issue-template.md&title=)
in the GitHub repository so that we can fix or add your query type to this list.
Thank you!
**Note**: Salesforce's proprietary form of SQL called SOQL (Salesforce Object
Query Language) is a powerful tool that allows you to return the fields of
records in almost any object in Salesforce. This includes standard objects like
Accounts, Contacts, and Tasks along with any custom objects and custom fields
created in your Org. You are encouraged to use Bulk APIs when:
- You anticipate returning 10,000 records or more
- Your query does not involve a parent-to-child nested relationship query
- You would like to reduce the overall number of API calls to your Org
If you are not familiar with SOQL, then please consider reading the following
resources:
- Introduction to SOQL and SOSL
- SOQL SELECT Syntax
- Relationship Queries
- Async SOQL
- Overview of queries in Bulk 1.0 API
- Overview of queries in Bulk 2.0 API
----
## Authentication
First, load the {salesforcer} package and login. There are two ways to authenticate:
1) OAuth 2.0 (SSO) and 2) Basic Username-Password. It is recommended to use OAuth
2.0 so that passwords do not have to be embedded in scripts or environment variables.
By default, OAuth 2.0 stores the user's credentials in a locally cached file
entitled ".httr-oauth-salesforcer" in the current working directory and will be
refreshed automatically when the session expires.
```{r auth, include = FALSE}
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(here)))
library(salesforcer)
token_path <- Sys.getenv("SALESFORCER_TOKEN_PATH")
sf_auth(token = paste0(token_path, "salesforcer_token.rds"))
```
```{r load-package, eval=FALSE}
library(dplyr, warn.conflicts = FALSE)
library(salesforcer)
sf_auth()
```
## Default query behavior
The default API for the `sf_query()` function is the REST API because it is both
fast and flexible. Every effort has been made so that the format of the results
from the REST and SOAP APIs is exactly the same. The only difference will be
speed. The REST API uses JSON, which can generally be processed more quickly
than XML used in the SOAP API.
```{r query-records}
soql <- "SELECT Id,
FirstName,
LastName
FROM Contact
LIMIT 10"
queried_records <- sf_query(soql) # REST API is the default api_type
queried_records
queried_records <- sf_query(soql, api_type = "SOAP")
queried_records
```
## REST vs. SOAP API query performance test
Below is a small example to roughly demonstrate the magnitude of the performance
difference between the REST and SOAP APIs when querying 1,000 records.
**Setup performance test**
```{r setup-performance-test}
# create a new account
# (if replicating, you may or may not have an external id field in your Org)
prefix <- paste0("APerfTest-", as.integer(runif(1,1,99999)))
new_account <- sf_create(
tibble(
Name = "Test Account For Performance Test",
My_External_Id__c = prefix,
Description = paste0("This is a test account with 1,000 records for ",
"testing the performance differences between the ",
"SOAP and REST APIs.")
),
object_name = "Account"
)
# create and associate a thousand new contacts with that account
# (again, you may or may not have an external id field in your Org)
n <- 1000
prefix <- paste0("CPerfTest-", as.integer(runif(1,1,99999)), "-")
new_contacts <- tibble(FirstName = rep("Test", n),
LastName = paste0("Query-Vignette", 1:n),
test_number__c = 999.9,
AccountId = rep(new_account$id, n),
My_External_Id__c=paste0(prefix, 1:n))
new_contacts_res <- sf_create(new_contacts, "Contact", api_type = "Bulk 2.0")
```
**Performance test**
```{r run-performance-test, message=FALSE}
#| fig.alt: >
#| A violin plot showing the distribution of latency times for REST API and SOAP API
qry <- function(api_type){
sf_query(
sprintf("SELECT Id, Name, Owner.Id,
(SELECT Id, LastName, Owner.Id FROM Contacts)
FROM Account
WHERE Id = '%s'",
new_account$id),
api_type = api_type
)
}
res <- microbenchmark::microbenchmark(
qry("REST"),
qry("SOAP"),
times = 5,
unit = "s"
)
res
suppressWarnings(suppressMessages(
ggplot2::autoplot(res) +
ggplot2::scale_y_continuous(name="Time [seconds]", n.breaks=6)
))
```
As seen in the limited test above, the REST API can be anywhere from **4-6x**
faster than the SOAP API for a query on 1,000 contact records associated with a
single Account. Breaking up the number of records returned into smaller batches by setting
`QueryOptions = list(batchSize = 200)` typically does not affect this result very
much but it also depends on the number of fields in the query. For the REST API the
default is 2,000 records per batch with a minimum of 200 and maximum of 2,000.
For the SOAP API the default is 500 records per batch. For both APIs it is important
to note that there is no guarantee that the requested batch size is the actual
batch size. Changes are made as necessary to maximize performance. For example,
the SOAP API states "batch size will be no more than 200 if the SOQL statement
selects two or more custom fields of type long text". The REST API mentions that
the limit imposed by Salesforce's app servers is around 20,000 characters which
can cause batches to be smaller. In short, it's generally okay to use the default
batch sizes since Salesforce may optimize over your specified batch size anyways.
## When to use the Bulk APIs for queries
A general rule of thumb for using the Bulk APIs (Bulk 1.0 and Bulk 2.0) for
queries is anytime you need to retrieve more than 10,000 records. The main reasons
to not use the Bulk APIs are twofold. First, they do not support complex
relationship queries or aggregate queries. If you need to write a nested relationship
or aggregate query involving a large number of records you may be tempted to use
the REST API. However, it is recommended to perform two or more separate bulk
queries that retrieve the records you need and then join or aggregate the results
in R.
```{r unsupported-bulk-queries}
# nested relationship query
# (supposed to return the id and first name of all contacts on each account)
try(
sf_query(
"SELECT Id, Name,
(SELECT Id, FirstName FROM Contacts)
FROM Account",
api_type = "Bulk 2.0"
)
)
# aggregate query
# (supposed to return the count of contacts per account)
try(
sf_query(
"SELECT Account.Id, Count(Name) contacts_n
FROM Contact
GROUP BY Account.Id",
api_type = "Bulk 2.0"
)
)
```
The two queries above were trying to pull all the contacts for each account and
then get a count of how many contacts there are per account. If you have a lot
of records, using the REST API to return these results may not be feasible. Even
though the Bulk APIs cannot handle the same query, they can pull down massive
amounts of data quickly. In this case you can pull down all of the Contact records
and all of the Account records and then perform the calculation using **dplyr**,
like so:
```{r}
contacts <- sf_query("SELECT Id, FirstName, Account.Id
FROM Contact",
api_type = "Bulk 2.0")
accounts <- sf_query("SELECT Id, Name
FROM Account",
api_type = "Bulk 2.0")
nested_query_recs <- accounts %>%
left_join(contacts %>%
rename(`Contact.Id` = Id,
`Contact.FirstName` = FirstName),
by = c("Id" = "Account.Id"))
nested_query_recs
aggregate_query_recs <- nested_query_recs %>%
group_by(Id) %>%
summarize(.groups = 'drop',
contacts_n = sum(!is.na(Contact.Id)))
aggregate_query_recs
```
The second reason to not use the Bulk APIs is that there is a performance overhead
associated with every bulk (asynchronous) job that involves checking the status
of the job until it succeeds or fails before retrieving the results.
The example below is provided so that you can take this code as an example to
run your own performance test of queries that return 10K, 100K, 1M+ records to
see where the Bulk APIs outperform the REST API.
```{r run-performance-test2, eval=FALSE}
qry_compare <- function(api_type){
soql <- sprintf("SELECT Id, LastName, Account.Id, Account.Name, Owner.Id
FROM Contact
WHERE Account.Id = '%s'",
new_account$id)
sf_query(soql, api_type = api_type)
}
res <- microbenchmark::microbenchmark(
qry_compare("REST"),
qry_compare("Bulk 1.0"),
qry_compare("Bulk 2.0"),
times = 5,
unit = "s"
)
```
Note that the Bulk 1.0 API requires users to specify the target object along with
their submitted SOQL. This is because it is needed when creating the bulk job that
will manage and execute the query.
```{r}
queried_records <- sf_query(soql, api_type = "Bulk 1.0")
```
As you can see above the {salesforcer} package will try to infer the object in
the query if not explicitly provided. If it does not guess correctly, then please
specify.
**Cleanup after performance tests**
By keeping track of the account ids used in our tests, it is fairly easy to find
and delete these test records from our Org to save space.
```{r cleanup-performance-test}
# cleanup performance test Contact records ...
contacts_to_delete <- sf_query(
sprintf("SELECT Id
FROM Contact
WHERE Account.Id = '%s'",
new_account$id)
)
sf_delete(contacts_to_delete$Id, "Contact", api_type="Bulk 2.0")
# ... and finally delete the account
sf_delete(new_account$id)
```
## Relationship queries
Salesforce supports retrieving fields from related objects when querying another
object. This is similar to performing a JOIN in SQL, but without having to specify
the join keys because Salesforce already knows the relationship between the two
objects. There are two types of relationship queries (1. child-to-parent lookups
and 2. parent-to-child nested queries) detailed in the sections below.
### child-to-parent "lookup" queries
The first type of relationship query and the most common is child to parent. For
example, the Contact object (child) to their parent, the Account object. In order
to pull down parent object fields with your child record query, you just need to
prefix any fields from the related object by concatenating the name of the
object with the field name separated by a period. In the example below we are
retrieving all Contact object records that have a relationship to an Account.
```{r}
# child-to-parent relationship (e.g. Account.Name from Contact record)
sf_query(
"SELECT Id, FirstName, Account.Name
FROM Contact
WHERE Account.Id != null"
)
```
Sometimes you may notice that the requested relationship fields do not appear in
the query results. This is because the SOAP and REST APIs do not return any
related object information if it does not exist on the record and there is no
reliable way to extract and rebuild the empty columns based on the query string.
In the example below, if there were Account information an additional column
titled `"Account.Name"` would appear in the results.
```{r}
# child-to-parent relationship (e.g. Account.Name from Contact record)
sf_query(
"SELECT Id, FirstName, Account.Name
FROM Contact
WHERE Account.Id = null"
)
```
Note, that the Bulk 1.0 and Bulk 2.0 APIs will return `"Account.Name"` as a
column of all `NA` values for this query because they return results
differently.
Finally, one aspect to note is that the Bulk 2.0 API does not support
child-to-parent-grandparent relationships as seen in the example below:
```{r}
try(
sf_query("SELECT Id, FirstName, Account.Owner.Id
FROM Contact",
api_type = "Bulk 2.0")
)
```
### parent-to-child "nested" queries
Instead of "looking up" a related field, users can write queries that retrieve the
individual records related to a parent. For example, if you would like all of the
Accounts and their Contacts you can write the query like so:
```{r}
sf_query(
"SELECT Id, Name,
(SELECT Id, FirstName FROM Contacts)
FROM Account"
)
```
At first glance this query may appear the same as a lookup query on the Contact object
that includes the account id and name. However, the small difference is that every
Account is included, regardless of whether or not they have a Contact. This can be
helpful when you want to ensure a query contains all of the parent records and their
child records, if they exist. Also, note that the plural object name is used inside
the nested query ("Contacts" instead of "Contact").
Finally, a parent-to-child nested query can also contain a child-to-parent lookup
relationship within it. Below is an example where the Owner Id on the Contact is
included so you can know who is responsible for the Contacts under each Account.
```{r}
sf_query(
"SELECT Name, Owner.Id,
(SELECT Id, FirstName, Owner.Id FROM Contacts)
FROM Account"
)
```
## Troubleshooting
If you are having an issue with a query please submit in the {salesforcer} GitHub
repository at: https://github.com/StevenMMortimer/salesforcer/issues. As a maintainer, queries are tough to debug because every Salesforce
Org is unique. Custom objects or relationships created in your Salesforce Org may
be different or even impossible to test in another Org. When filing your issue please
make an attempt to understand the query and debug a little bit on your own. Here
are a few suggestions:
1. Slightly modify your function call to `sf_query()` to observe the results. Here
are a few prompting questions that may assist you:
- What do you see when you set `verbose=TRUE` argument?
- What happens if you change the `control` argument, specifically the batch size?
- What happens if you try using a different API (e.g. "SOAP" vs "REST" or "Bulk 1.0" vs "Bulk 2.0")?
- What happens if you change your query slightly?
- Do you need a parent-to-child nested relationship query or will a child-to-parent lookup suffice?
2. Check out Salesforce's Workbench tool to see how it constructs specific
queries that you are debugging. The tool is available at
`https://workbench.developerforce.com` and requires a Salesforce login
(the same credentials as you normally would use).
3. Double check Salesforce's SOQL reference guide to see whether your query is supported or limited in some way.
4. Review query unit tests at: https://github.com/StevenMMortimer/salesforcer/blob/main/tests/testthat/test-query.R.
These unit tests were written to cover a variety of use cases and to track any
changes made between newly released versions of the Salesforce API (typically
4 each year). These tests are an excellent source of examples that may be
helpful in troubleshooting your own query.
5. Roll up your sleeves and dive into the source code for the {salesforcer}
package. The main scripts to review are:
- https://github.com/StevenMMortimer/salesforcer/blob/main/R/query.R
- https://github.com/StevenMMortimer/salesforcer/blob/main/R/utils-query.R