NotMongo
A Maybe DuckDB Extension for JSON APIs
DuckDB. Is there anything they can't do?!
Warning
This DuckB Extension is in early stage and not mature enough for public yet.
-- Using Zine's theme just because it's awesome and convenient. --
What?
I'm discovering how DuckDB works internally. I've been using it mainly for ad hoc stuff with the excellent cli. Also, I have a few use cases for accessing json data from api's. Some are very simple, others are similar to MongoDB.
I could have used the already excellent json extension with an external data fetcher, or even the shellfs extension with curl. Actually, that's how i do it today, use curl and pipe into the read_json /dev/stdin source. However, when I'm in the cli, I want to stay there. Especially when I'm running multiple consecutive queries for joins etc.
Why this site?
Documenting for future self. Also, there's this excellent at the boundary blog. Started at about the same time I was looking into table functions. There was no good documentation and examples at the time of this writing.
Where to start
- duckdb: build extensions
- at the boundary: explains how to get started with table functions with examples.
- read the duckdb extensions code: json, parquet
- table function source
Scalar functions
These return a single value. The extension template has the Quack example.
Table functions
Obviously, these return tables. Retrieve data, process schema and data types. There are no good starters for this yet. You can have a look at the parquet extension or the json extension, but they are a bit of a handful to get started with.
Supports threaded execution.
Anatomy of the table function
Register the TableFunction
- bind
- init
- table function
- filer pushdown (where)
- projection pushdown (columns)
- pushdown complex filter
- positional and named parameters
- extension options: PRAGMA/SET
- cardinality
static void LoadInternal(DatabaseInstance &instance) {
auto myTableFunc = TableFunction("query_json_api", {}, myTableFunction, myBindFunction, myInitFunction);
myTableFunc.filter_pushdown = false;
myTableFunc.projection_pushdown = true;
myTableFunc.cardinality = myCardinalityFunction;
myTableFunc.pushdown_complex_filter = pushdownComplexFilter;
myTableFunc.named_parameters["options"] = LogicalType::VARCHAR;
myTableFunc.named_parameters["api"] = LogicalType::VARCHAR;
ExtensionUtil::RegisterFunction(instance, myTableFunc);
auto &config = DBConfig::GetConfig(instance);
config.AddExtensionOption("rest_api_config_file", "REST API Config File Location", LogicalType::VARCHAR,
Value("rest_api_extension.json"));
}
Order of execution
bind
Here, we have access to positional and named arguments.
This is where I get the schema from the extension's configuration or from a separate endpoint. Return a struct with whatever we want to pass further down the line.init
Get the schema, column ids, filters, etc.
Prepare the Intermediate Representation of the query, analyze pushdown to the json queries and construct the query.execute table function
This is where we get the data from a json api (paged and threaded), parse and return.
doubts
The json extension is really really good and I don't want to duplicate code or reinvent the wheel. I haven't found a supported and documented way yet to reuse the extension or parts of it.