NotMongo

A Maybe DuckDB Extension for JSON APIs

DuckDB. Is there anything they can't do?!

Warning

This DuckB Extension is in early stage and not mature enough for public yet.

-- Using Zine's theme just because it's awesome and convenient. --

What?

I'm discovering how DuckDB works internally. I've been using it mainly for ad hoc stuff with the excellent cli. Also, I have a few use cases for accessing json data from api's. Some are very simple, others are similar to MongoDB.

I could have used the already excellent json extension with an external data fetcher, or even the shellfs extension with curl. Actually, that's how i do it today, use curl and pipe into the read_json /dev/stdin source. However, when I'm in the cli, I want to stay there. Especially when I'm running multiple consecutive queries for joins etc.

Why this site?

Documenting for future self. Also, there's this excellent at the boundary blog. Started at about the same time I was looking into table functions. There was no good documentation and examples at the time of this writing.

Where to start

Scalar functions

These return a single value. The extension template has the Quack example.

Table functions

Obviously, these return tables. Retrieve data, process schema and data types. There are no good starters for this yet. You can have a look at the parquet extension or the json extension, but they are a bit of a handful to get started with.

Supports threaded execution.

Anatomy of the table function

Register the TableFunction

    static void LoadInternal(DatabaseInstance &instance) {
            auto myTableFunc = TableFunction("query_json_api", {}, myTableFunction, myBindFunction, myInitFunction);
        myTableFunc.filter_pushdown = false;
        myTableFunc.projection_pushdown = true;    
        myTableFunc.cardinality = myCardinalityFunction;
        myTableFunc.pushdown_complex_filter = pushdownComplexFilter;

        myTableFunc.named_parameters["options"] = LogicalType::VARCHAR;
        myTableFunc.named_parameters["api"] = LogicalType::VARCHAR;

        ExtensionUtil::RegisterFunction(instance, myTableFunc);
        
        auto &config = DBConfig::GetConfig(instance);

        config.AddExtensionOption("rest_api_config_file", "REST API Config File Location", LogicalType::VARCHAR,
                                Value("rest_api_extension.json"));

    }

Order of execution

  1. bind

    Here, we have access to positional and named arguments.
    This is where I get the schema from the extension's configuration or from a separate endpoint. Return a struct with whatever we want to pass further down the line.

  2. init

    Get the schema, column ids, filters, etc.
    Prepare the Intermediate Representation of the query, analyze pushdown to the json queries and construct the query.

  3. execute table function

    This is where we get the data from a json api (paged and threaded), parse and return.

doubts

The json extension is really really good and I don't want to duplicate code or reinvent the wheel. I haven't found a supported and documented way yet to reuse the extension or parts of it.


jeroenflvr