Package 'sparkwarc' reference manual

Title:	Load WARC Files into Apache Spark
Description:	Load WARC (Web ARChive) files into Apache Spark using 'sparklyr'. This allows to read files from the Common Crawl project <http://commoncrawl.org/>.
Authors:	Javier Luraschi [aut], Yitao Li [aut] , Edgar Ruiz [aut, cre]
Maintainer:	Edgar Ruiz <[email protected]>
License:	Apache License 2.0
Version:	0.1.6
Built:	2025-03-05 06:54:46 UTC
Source:	CRAN

Provides WARC paths for commoncrawl.org

Description

Provides WARC paths for commoncrawl.org. To be used with spark_read_warc.

Usage

cc_warc(start, end = start)
cc_warc(start, end = start)

Arguments

`start`	The first path to retrieve.
`end`	The last path to retrieve.

Examples


cc_warc(1)
cc_warc(2, 3)

cc_warc(1)
cc_warc(2, 3)

Loads the sample warc file in Rcpp

Description

Loads the sample warc file in Rcpp

Usage

rcpp_read_warc_sample(filter = "", include = "")
rcpp_read_warc_sample(filter = "", include = "")

Arguments

`filter`	A regular expression used to filter to each warc entry efficiently by running native code using `Rcpp`.
`include`	A regular expression used to keep only matching lines efficiently by running native code using `Rcpp`.

Reads a WARC File into using Rcpp

Description

Reads a WARC (Web ARChive) file using Rcpp.

Usage

spark_rcpp_read_warc(path, match_warc, match_line)
spark_rcpp_read_warc(path, match_warc, match_line)

Arguments

`path`	The path to the file. Needs to be accessible from the cluster. Supports the ‘⁠"hdfs://"⁠’, ‘⁠"s3n://"⁠’ and ‘⁠"file://"⁠’ protocols.
`match_warc`	include only warc files mathcing this character string.
`match_line`	include only lines mathcing this character string.

Reads a WARC File into Apache Spark

Description

Reads a WARC (Web ARChive) file into Apache Spark using sparklyr.

Usage

spark_read_warc(
  sc,
  name,
  path,
  repartition = 0L,
  memory = TRUE,
  overwrite = TRUE,
  match_warc = "",
  match_line = "",
  parser = c("r", "scala"),
  ...
)
spark_read_warc(
  sc,
  name,
  path,
  repartition = 0L,
  memory = TRUE,
  overwrite = TRUE,
  match_warc = "",
  match_line = "",
  parser = c("r", "scala"),
  ...
)

Arguments

`sc`	An active `spark_connection`.
`name`	The name to assign to the newly generated table.
`path`	The path to the file. Needs to be accessible from the cluster. Supports the ‘⁠"hdfs://"⁠’, ‘⁠"s3n://"⁠’ and ‘⁠"file://"⁠’ protocols.
`repartition`	The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.
`memory`	Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)
`overwrite`	Boolean; overwrite the table with the given name if it already exists?
`match_warc`	include only warc files mathcing this character string.
`match_line`	include only lines mathcing this character string.
`parser`	which parser implementation to use? Options are "scala" or "r" (default).
`...`	Additional arguments reserved for future use.

Examples


## Not run: 
library(sparklyr)
library(sparkwarc)
sc <- spark_connect(master = "local")
sdf <- spark_read_warc(
  sc,
  name = "sample_warc",
  path = system.file(file.path("samples", "sample.warc"), package = "sparkwarc"),
  memory = FALSE,
  overwrite = FALSE
)

spark_disconnect(sc)

## End(Not run)

## Not run: 
library(sparklyr)
library(sparkwarc)
sc <- spark_connect(master = "local")
sdf <- spark_read_warc(
  sc,
  name = "sample_warc",
  path = system.file(file.path("samples", "sample.warc"), package = "sparkwarc"),
  memory = FALSE,
  overwrite = FALSE
)

spark_disconnect(sc)

## End(Not run)

Loads the sample warc file in Spark

Description

Loads the sample warc file in Spark

Usage

spark_read_warc_sample(sc, filter = "", include = "")
spark_read_warc_sample(sc, filter = "", include = "")

Arguments

`sc`	An active `spark_connection`.
`filter`	A regular expression used to filter to each warc entry efficiently by running native code using `Rcpp`.
`include`	A regular expression used to keep only matching lines efficiently by running native code using `Rcpp`.

Retrieves sample warc path

Description

Retrieves sample warc path

Usage

spark_warc_sample_path()
spark_warc_sample_path()

sparkwarc

Description

Sparklyr extension for loading WARC Files into Apache Spark

Package 'sparkwarc'

Help Index

Provides WARC paths for commoncrawl.org

Description

Usage

Arguments

Examples

Loads the sample warc file in Rcpp

Description

Usage

Arguments

Reads a WARC File into using Rcpp

Description

Usage

Arguments

Reads a WARC File into Apache Spark

Description

Usage

Arguments

Examples

Loads the sample warc file in Spark

Description

Usage

Arguments

Retrieves sample warc path

Description

Usage

sparkwarc

Description