User:Cook Me Plox/Bucket

From Meta Wiki
Jump to navigation Jump to search

Bucket is a MediaWiki extension that lets you store structured data on articles, which can be queried by other articles and APIs. It is meant as a replacement for how the RuneScape wikis use Semantic MediaWiki, and is (sort of) in the same general category of extensions as SMW, Cargo, or Wikibase.

Why?[edit | edit source]

Continuing to use Semantic MediaWiki poses two major risks to us.

The first is related to performance. In a couple days' worth of profiling, SMW was responsible for about 50% of all database reads on the wiki, and about 75% of all writes (even when no content has changed)! Semantic MediaWiki seems to generally be responsible for about 30% of the total parse time of our articles. This poses a serious obstacle to our attempts at making the wiki faster, more stable, and less expensive to operate. SMW also does its own purging of both the parser cache and HTML cache, which seriously complicates any attempts to more carefully orchestrate the relationship between (for example) Grand Exchange updates and Cloudflare purging.

SMW is also extremely complicated – the code base is about 210,000 lines of PHP, a lot of which is not fully understood by anyone still involved with the project. There's a series of performance improvements that in theory could be done to help improve SMW, but the development is stuck in a bit of a weird place where nobody really knows why particular things were implemented a particular way, which makes further changes riskier and take longer. I came to the conclusion that it would take less time to implement a separate extension and migrate our usage to it. It's also complicated in the sense that very few people on our wikis understand how to do anything other than very basic queries with it.

The third risk is sort of related to the second. SMW is complex enough that I can totally envision a scenario where the latest SMW release is not compatible with the latest MediaWiki release, and this blocks us from staying up to date with MediaWiki. This is especially scary due to the upcoming core MediaWiki parser rework that will make article parsing happen in less than full-page sections (incidentally this is the same thing that might break the Variables extension, but that's a topic for another day). SMW doesn't have a great deal of instiutional support right now (it's possible we are the single biggest user of it currently), and we might need to get urgently involved with SMW development if that situation comes to pass.

All together, it feels like a pretty good reason to explore alternatives. One option we looked at was Cargo, which gives much more of a SQL-y access to the underlying data (which I quite like). Unfortunately there's a number of things Cargo does that I really don't like – tying table definitions to templates, making it so you can only "attach" to a single table from a template (which Cargo users have found horrible workarounds for), and most prominently, making it so that a page's data only updates either on "page save" or on re-creation of the underlying table (meaning, editing a module would not update a page's Cargo data). I really wanted to like Cargo, but I think it won't work for us.

Goals of the extension[edit | edit source]

  • Support for the minimal feature set of the RuneScape wikis' use of SMW, including reading entries from the API and some basic visualization on a special page.
  • Simplicity in implementation: try to write as little code as possible, and avoid unnecessary structures for caching.
  • Simplicity in use: In general there should be one canonical way to do things, and users should not have to worry about manually updating or invalidating entries.
  • High-performance: minimize database queries and latency, minimize unnecessary parses and purges. Try to ride on top of existing "refreshLinks" constructs as much as possible.

Explicit non-goals:

  • Full feature parity with SMW or Cargo, including map support, drilldowns, subobjects, concepts, subquery support (mostly).
  • Generic support for versions of MediaWiki that are not used on Weird Gloop, or for DBMSes that we don't use. This is primarily meant to stay simple and maintainable by us, and supporting a burgeoning ecosystem of different wiki installations with all sorts of different dependencies is not what we're signing up for right now. This could change if other wikis are seriously interested in using Bucket, and if the scope of changes is relatively small, but right now the focus is squarely on MediaWiki 1.37 and MySQL 8.0.18 or higher.

Basic usage[edit | edit source]

Data is stored in "buckets", which are database tables with specific columns that you can define. Imagine we wanted to define a basic table to store monster drops (similar to Property:Dropped item, Property:Drop type, Property:Drop Rarity, etc, in current SMW). Each call to {{DropsLine}} would add a new row to this table.

If we wanted a "bucket" called "drops" and the columns to include the dropped item, rarity and quantity, we could go to the Bucket:Drops and put the following:

    "item": {
        "type": "PAGE"
    "quantity": {
        "type": "TEXT"
    "rarity": {
        "type": "TEXT"

Supported types are PAGE, TEXT, INTEGER, DOUBLE, and BOOLEAN. There will soon be support for REPEATED fields (where we'd want to store, say, all of the inputs to {{Recipe}} as an array).

There are a few "reserved" columns that are added to every table, including "page_name" and "page_id", referring to the page that the bucket "put" call is invoked on.

Currently the only way to write and read from a bucket is via Lua calls on the "bucket" object. I quite like this (and certainly you can implement templates and things to generically write to it without Lua), although some people might prefer to have actual parser functions to use.

To store data in that bucket, via Lua, you'd say

bucket.put("drops", {item="Rune scimitar", quantity="3-5", rarity="1/128"})

And that's it!

Querying has a lot more options, but the simplest thing would be something like

  .select("page_name", "item", "quantity", "rarity")
  .where("item", "Rune scimitar")

Which gives you all of the monsters that drop the Rune scimitar. Full documentation of the select syntax will be forthcoming, but it includes filtering based on values, JOINing on other tables, selecting from arrays, and filtering based on the categories on the page.

Technical description[edit | edit source]

Bucket is made up of three main pieces:

  • Table definitions in the "Bucket" namespace, where you define a database table, and can modify it with additional fields
  • Implementation of "bucket.put" that does INSERT and UPDATE queries on those tables
  • Implementation of queries, which translate the Lua syntax into a SELECT statement

Table definition[edit | edit source]

Bucket (the extension) adds a new namespace called "Bucket", namespace ID 9592. The content model for this namespace is JSON, which should prevent people from saving anything there that is not valid JSON.

A table definition is a JSON object with the field names as keys, and values as JSON objects with the following keys:

  • type, which can take on values of PAGE, TEXT, INTEGER, DOUBLE, and BOOLEAN (note that under the hood nearly everything is MySQL TEXT type)
  • whether the field is repeated (default false)
  • whether the field should have an index (default true). If the field has an index, it can be accessed efficiently as part of a JOIN or WHERE clause, with the caveat that it takes up about twice as much database space. Most fields are okay to be indexed, unless they're some big JSON cluster that is never going to be part of a query condition.

When someone attempts to save a blob in the Bucket namespace, a number of things are required, otherwise it will not be able to save. This is in Bucket::createOrModifyTable.

  • There needs to be at least one column in the schema
  • All field names must be strings
  • All field names must be only alphanumeric characters, or _
  • Field names are converted to lowercase
  • All field names must be unique, after lowercasing
  • The type must be a supported type

If this table is new, the schema will then be converted to a CREATE TABLE statement.

If it already exists, any changes against the old schema will be converted into an ALTER TABLE statement, which could include ADD COLUMN, ADD INDEX, DROP INDEX. For performance reasons, DROP COLUMN is not actually implemented, since this could take many seconds and break the existing SELECT statements. When a field is removed from a bucket, no action is taken on the table – it's more just hidden rather than fully removed.

In a transaction with the CREATE/ALTER TABLE, we also add an entry to the bucket_schemas table, containing a JSON representation of the table's schema.

<NOT IMPLEMENTED> Buckets can also be moved or deleted. Moves are implemented as a RENAME TABLE plus a VIEW to keep the old name around. Deletes are implemented by moving the table to a random name and creating a VIEW on the old name. You should not be allowed to create a new table if the remnant from deleting it or moving it still has usages (either via bucket.put or SELECT).

<NOT IMPLEMENTED> We want to also support REPEATED fields, which will be implemented as the JSON type, using multi-valued indexes and letting people use the MEMBER OF function in WHERE conditions. This is the reason it requires MySQL 8.0.18 – it's not clear if MariaDB will support similar structures in the future.

bucket.put[edit | edit source]

Calling bucket.put in Lua calls the BucketLuaLibrary::bucketPut method.

Notably, this does not immediately do an INSERT with the payload – instead, it adds the puts to a bucketPuts field on the parser output, which the Lua library has access to.

Then the magical part is that the insertion happens when MediaWiki's LinksUpdateComplete hook runs. This hook runs when a page is saved, when it is moved or deleted, or when it gets re-rendered after a transclusion it depends on gets changed. In short, it's exactly the hook we'd hope would exist, since it runs in exactly the right set of situations that we might wish to examine the puts again. It also has access to the bucketPuts that we put on the parser output, so no additional parses need to happen at all. You can see this in BucketHooks::onLinksUpdateComplete.

This calls Bucket::writePuts, which removes any fields that won't be valid columns in the bucket and then inserts them into the relevant table. It also updates the bucket_pages table with all of the tables that are used on each page.

<NOT IMPLEMENTED> Since the vast majority of our LinksUpdates do not change the calls to bucket.put we should avoid doing INSERTs that are exactly the same as the existing values. SMW doesn't do this, and it's the reason they are responsible for 75% of our database writes. Therefore, we want to SELECT all of the existing entries in each bucket for the page we're parsing, and do a diff between the existing entries and the new ones, and only submit the ones that have changed or are new.

Querying[edit | edit source]

Documentation be written...

Things left to do[edit | edit source]

Roughly in order of priority:

  • Fully implement the "PAGE"-type fields, with versions
  • Support repeated fields with MEMBER OF and multi-valued indexes
  • Do the insertion as a delta rather than just INSERT and DELETE all
  • Support deleting/renaming buckets
  • Implement the bucket_pages table, and properly remove bucket entries for a page if every single entry for that type is removed
  • Give a way to specify JOINs on fragment-less pages (maybe .join and .joinVersion?)
  • Prevent writes on un-supported namespaces (specify with wg variable)
  • Support ORDER BY
  • Support IS NULL as a .where condition
  • Make sure the allSchemas cache is getting invalidated at the end of the page parse and teardown
  • Prevent changing the content model in Bucket NS
  • Consider specifying a different database and different DB user for security purposes
  • Set MAX EXECUTION TIME (500ms?)
  • Per-page data display (?action=bucket maybe?)
  • Per-bucket data display (some ?action on Bucket: page?)
  • API query support (required for RSLookup)