User:Cook Me Plox/Exchange proposal

From Meta Wiki
Jump to navigation Jump to search

tl;dr: Move historical data to an offsite database with API endpoint. Move current GE prices and volumes to only be in bulk modules, and rebrand Module:Exchange/ITEM pages to only include rarely-changing metadata.

The wiki's Grand Exchange data is used for two primary purposes:

  • Displaying current median prices in various places in wikitext, including infoboxes, prose, drop tables, etc
  • Displaying historical data for Grand Exchange graphs

Current issues[edit | edit source]

Our current way of putting GE prices on the wiki has it roots in what tech was available to us in 2008 (wiki templates), 2012 (Data as wikitext) and 2014 (Lua modules). These options were constrained by the fact that we didn't have any of our own infrastructure, and we felt obligated to fit what is essentially a database into wikitext.

The current approach has some huge downsides:

  • It requires about 18,000 edits a day by GazBot. This is approximately 90% of all edits that happen on both RS and OSRS.
    • This causes major performance issues for the web servers – pretty much every uptime hiccup we've had in the last year (that wasn't due to Cloudflare or OVH) was while GazBot was running.
    • GazBot's edits need to be spaced out in order to not overload the system. This has an unfortunate effect on the job queue where a single page might get purged multiple times over the course of the ~hour that GazBot takes to run. While it's obvious that the pages need to be updated (and caches invalidated) at least once per day when prices update, it's wasteful (and arguably misleading, even) to do it more than once.
    • It's just a LOT of revisions. It makes parsing the revisions table or recentchanges take much longer than it has any reason to, when so many of the edits are from GE updates.
  • Exchange-related bot edits are a huge portion of the total database size for en_rswiki, en_osrswiki and pt_rswiki. While the existential issue of quadraic blowup in the /Data pages' space usage was resolved by Kitty making a script to auto-delete everything except the latest revision of the /Data pages, just the revisions and associated text add up, which affects database performance, cost, and portability (e.g. the ability to spin up a development version of the wiki).
  • Historical data is basically hardcoded into the HTML of the page by transclusion, and often the GE history is the majority of the byte count for item articles, exacerbated by our handling of switch infobox data and limit-report logs. Roughly 70% of the bytes on are data from the GE charts.
  • It's basically a non-starter to get GE data to pt-br or any other potential non-English wikis, since we'd need to create a similar amount of new edits and storage, completely separately, over there. Doing some sort of transclusion seems dangerous and hard.

Part 1: put historical data offsite with API access[edit | edit source]

Note: The exact routes implemented are different than what was proposed. See here for the production URLs.

  • Create a new DigitalOcean droplet at [see open questions]
  • Move all Exchange history to sqlite[?] databases at rs.db and osrs.db.
    • en_names and pt_names
      • name[string, primary]
      • id[string]
    • history
      • id[string, primary]
      • timestamp[timestamp]
      • price[integer]
      • volume[integer, optional]
  • Create a read API endpoint, probably with node + express, at (e.g.)[all|sample|last90]
    • The last bit here depends on what sort of graph we want to show on page load. Should it be recent data, or all-time? I prefer all-time, and in that case we should take a uniform sample of, say, 100 equally spaced points over the price's history. Or we could just show the last 3 months or something.
    • Note: I'm intentionally not query parameters to force it to a canonical format that will be easier to cache.
  • Create a write API endpoint, requiring an API key, that GazBot will hit with bulk data, which will write to the database and purge Cloudflare. The writes should UPDATE based on the (id, timestamp) pair rather than just being a plain insert.
  • Turn on Cloudflare caching for this endpoint (36h TTL maybe?). I think just regular caching will be enough to seriously reduce the traffic to the point that we can serve just fine, but alternatively we could also add a workers KV layer. We also need code to purge when prices update.
  • Change GECharts gadget to read from this API rather than a div that it expects to have data already.
  • Stop writing to /Data pages, delete them, and remove all references.

Part 2: move all prices and volumes to bulk modules[edit | edit source]

We figured out about a year ago that putting all of the prices in one bulk data module had great performance characteristics, even when that module got to be 200KB or more. While having a page with only one price is imperceptibly faster (0.01 seconds) using single data pages instead of a module, this advantage disappears when you start to have multiple prices. I think the cut-off point where the bulk becomes better for runtime is about 2 or 3. The singles have an advantage for memory usage up until about 20 or so, but memory is rarely the resource of concern – it's hard to imagine we'd ever use more than about 8MB of memory on one page render with this approach, and the Lua limit is set to about 40 times that much.

Exchange timing.png

The good news is that we're already doing bulk modules for prices, and it works pretty well and I haven't seen any complaints.

But...we're still updating the Module:Exchange/ITEM pages. The only things that are updated there from day to day are the price, lastPrice, and volume (and the associated dates for each). We could easily just make bulk modules for volume and lastPrice (or maybe even get rid of lastPrice?), and then the Module:Exchange/ITEM pages wouldn't need to be updated anymore. They could be purely for the other rarely-changing Exchange metadata like names, limits, categories, etc.

We should probably keep some semblance of a last updated date for the bulk modules. This can perhaps just be a separate key in the top-level data object.

This would increase (probably still imperceptibly) the load time for singular Exchange pages, but this seems well worth it to get GazBot down from 18000 edits a day to 9-ish.

What about GazBot?[edit | edit source]

GazBot sadly will not be making 7 million edits a year anymore, but he'll still be very important! Going forward, he would still run on chisel, and do the following:

  • As before, check periodically for GE updates, and grab prices from the APIs.
  • Once all the data has been grabbed:
    • Hit the write API (can probably just be one request - probably <300KB payload)
    • Generate updated GEPrices/data, GELastPrices/data, GEVolumes/data for each relevant language, and put them on the wikis.

How does this relate to the PHP extension?[edit | edit source]

There is a mostly-finished MediaWiki extension by Cqm that solves a lot of the same problems, and whose design influenced the history APIs proposed here. It has some benefits over this proposal (in that single database access for GE prices is probably better for performance). However, I prefer my proposal for a few reasons:

  • Building the history API as a standalone service means we aren't combining exchange behavior with the rest of the wiki application's logic and resources.
  • Keeping the main exchange behavior mostly the same as before, instead of turning it into a black-box-y extension backed by a database most people can't see, makes it more accessible to technical editors. Even the node/express history API is going to be more standard and modifiable by our technical users than a PHP extension that needs to be deployed by the tech peeps.
  • As far as I know, there are still ambiguities in how the extension will purge pages when prices change.
  • The extension requires a level of specialization that very few of us have, which means it hasn't gotten to the finish line in 2+ years. I think we could knock out my proposal in about a week, especially the on-wiki parts are helped along by our technical community.

Open questions[edit | edit source]

  • Do we let users update the GEPrices/data module manually? Leaning no – let's deprecate.
    • Resolved – we are removing the gemwupdate Gadget.
  • Do we want to keep the lastPrice and directional arrows on GEMW? Leaning no but open to feedback if people like them.
    • Resolved – lastPrice is sticking around as a separate bulk module.
  • Does the subdomain we put the history API on matter, for CORS reasons?
    • Resolved - CSP doesn't care about the subdomain. We are using
  • Should we run backups on the history database? I think it'll be about 2 GB.
    • Resolved – Running automatic backups with DigitalOcean
  • How do we handle name changes, or items that get removed? Just Python scripts against the sqlite db?
    • Resolved – We are only using the item ID (or in the case of indexes, a name like "Food Index"). We don't need to know anything about name changes.
  • How feasible or useful is it to permanently delete all of the Gaz GEBot and TyBot edits? It's about 60% of all-time en_rswiki edits, 70% of all-time en_osrswiki edits, and similar bots appear to be about 93% of pt_rswiki edits. Perhaps we could at least delete all consecutive edits from those bots other than the most recent in a streak (or maybe also creation?)
    • Resolved – We are going to purge old bot revisions.
  • Would this make it harder for GazBot to generate the bulk GELimits/data module? I assume right now the bot just grabs that when it reads the Module:Exchange/ITEM pages, which stops being a necessity once those are no longer regularly edited.

Work items[edit | edit source]

History[edit | edit source]

  • Yes check.svg Set up exchange history server (Jayden)
  • Yes check.svg Set up database + tables (Jayden)
  • Yes check.svg Set up write API for history (Jayden)
  • Yes check.svg Set up backups for history database[?] (Jayden)
  • Yes check.svg Get initial dump of last 12 years' data into database (Gaz)
  • Yes check.svg Set up read API for history (Jayden)
  • Yes check.svg Set up Cloudflare domain caching (Jayden)
  • Yes check.svg Add sample endpoint for read API (Cook)
  • Yes check.svg Get index pages into database (Gaz)
  • Yes check.svg Modify GazBot as proposed (Gaz)
  • Yes check.svg Modify GECharts to read from API (Cook)
  • Yes check.svg Deal with timestamp weirdness (Gaz)
  • Yes check.svg Modify exchangeCreate and compare gadgets (Jayden)
  • Yes check.svg Launch new GECharts (Cook)

Bulk modules[edit | edit source]

  • Yes check.svg Modify GazBot to write bulk modules for volumes and maybe last prices (Gaz)
  • Yes check.svg Convert Module:Exchange and Module:ExchangeDefault to not read these fields from Module:Exchange/ITEM pages (Cook)
  • Yes check.svg Look through for other uses of Module:Exchange/ITEM (Cook)

Cleanup[edit | edit source]

  • Yes check.svg Back up old Module:Exchange and Exchange: pages via Special:Export (Cook)
  • Yes check.svg Add Exchange historical data to database (Gaz)

Remaining[edit | edit source]

  • Yes check.svg Export Exchange (historical) to S3 (Cook)
  • Yes check.svg GazBot update bulk modules on pt (Gaz)
  • Yes check.svg Add Exchange: pages for pt (Cook)
  • Yes check.svg Remove all non-current Exchange pages from pt (Cook)
  • Yes check.svg Import modules + gadget to pt (Cook)
  • Yes check.svg Delete all consecutive edits from exchange bots (Kitty?)
  • Yes check.svg Remove non-nils from bulk modules (Gaz)
  • Remove switch infobox resources from RSW (Gaz)