Sunday, October 4, 2015

ORES: Hacking social structures by building infrastructure

So, I just crossed a major milestone on a system I'm building with my shoe-string team of mostly volunteers and I wanted to tell you about it.  We call it ORES.
The ORES logo

The Objective Revision Evaluation Service is one part response to a feminist critique of power structures and one part really cool machine learning and distributed systems project.   It's a machine learning service that is designed to take a very complex design space (advanced quality control tools for Wikipedia) and allow for a more diverse set of standpoints to be expressed.  I hypothesize that systems like these will make Wikipedia more fair and welcoming while also making it more efficient and productive.

Wikipedia's power structures

So...  I'm not going to be able to go into depth here but there's some bits I think I can say plainly.  If you want a bit more, see my recent talk about it.  TL;DR: The technological infrastructure of Wikipedia was build through the lens of a limited standpoint and it was not adapted to reflect a more complete account of the world once additional standpoints entered the popular discussion.  Basically, Wikipedia's quality control tools were designed for what Wikipedia editors needed in 2007 and they haven't changed in a meaningful way since.  

Hacking the tools

I had some ideas on what kind of changes to the available tools would be important.  In 2013, I started work in earnest on Snuggle, a successor system.  Snuggle implements a socialization support system that helps experienced Wikipedia editors find promising newcomers who need some mentorship.  Regretfully, the project wasn't terribly successful.  The system works great and I have a few users, but not as many the system would need to do its job at scale.  In reflecting on this, I can see many reasons why, but I think the most critical one was that I couldn't sufficiently innovate a design that fit into the social dynamics of Wikipedia  It was too big of a job.  It requires the application of many different perspectives and a conversation of iterations.  I was a PhD student -- one of the good ones because Snuggle gets regular maintenance -- but this work required a community.

When I was considering where I went wrong and what I should do next, I was inspired by was the sudden reach that Snuggle gained when the HostBot developer wanted to use my "promising newcomer" prediction model to invite fewer vandals to a new Q&A space.  My system just went from 2-3 users interacting with ~10 newcomers per week to 1 bot interacting with ~2000 newcomers per week.  Maybe I got the infrastructure bit right.  Wikipedia editors do need the means to find promising newcomers to support after all!

Hacking the infrastructure

So, lately I've been thinking about infrastructure rather than direct applications of experimental technology.  Snuggle and HostBot helped to know to ask the question, "What would happen if Wikipedia editors could find good new editors that needed help?" without imagining any one application.  The question requires a much more system-theoretic way of reasoning about Wikipedia, technology and social structures.  Snuggle seemed to be interesting as an infrastructural support for Wikipedia.  What other infrastructural support would be important and what changes might that enable across the system itself?

OK.  Back to quality control tools -- the ones that haven't changed in the past 7 years despite the well known problems.  Why didn't they change?  Wikipedia's always had a large crowd of volunteer tool developers who are looking for ways to make Wikipedia work better.  I haven't measured it directly, but I'd expect that this tech community is as big and functional as it ever was.  There were loads of non-technological responses to the harsh environment for newcomers (including the Teahouse and various WMF initiatives).  AFAICT, the tool I built in 2013 was the *only* substantial technological response.

Why is there not a conversation of innovation happening around quality control tools?  If you want to build a quality control tool for Wikipedia that works efficiently, you need a machine learning model that calls your attention to edits that are likely to be vandalism.  Such an algorithm can reduce the workload of reviewing new edits in Wikipedia by 93%, but standing one up is excessively difficult.  To do it well, you'll need an advanced understand of computer science and some substantial engineering experience in order to get the thing to work in real time.  
The "activation energy" threshold to building a new quality
control tool is primarily due to the difficulty of building a
machine learning model.

So, What would happen if Wikipedia editors could quickly find the good, the bad, and the newcomers in need of support.  I'm a computer scientist.  I can build up infrastructure for that and cut the peak off of that mountain -- or maybe cut it down entirely.  That's what ORES is. 

What ORES is

ORES is a web service that's provides access to a scale-able computing cluster full of state-of-the-art machine learning algorithms for detecting damage, differentiating good-faith edits from bad and measuring article quality.  All that is necessary to use this service is to request a url containing the revision you want scored and the models you would like to apply to it.  For example, if you wanted to know if my first registered edit on Wikipedia was damaging, you could request the following URL. 


Luckily, ORES does not think this is damaging in the slightest.

{
"190057686": {
"prediction": false,
"probability": {
"false": 0.9999998999999902,
"true": 1.0000000994736041e-07
}
}
}
We use a distributed architecture to make scaling up the system to meet demand easy.  The system is built in python.  It uses celery to distribute processing load and redis for a cache.  It is based on revscoring library I wrote to generalize machine learning models of edits in Wikipedia.   This same library will allow you to download one of our model files and use it on your own machine -- or just use our API.



Our latest models are substantially more fit than the state-of-the-art (.84 AUC to our .90-95 AUC) our system has been substantially battle tested.  Last month, Huggle, one of the dominant yet unchanged quality control tools started making use of our service.  We've seen tool devs and other projects leap at the ability to use our service.

No comments:

Post a Comment