-
Notifications
You must be signed in to change notification settings - Fork 3
Adding a data capture system to WhoWasInCommand
By tlongers on 20180328 for the DataMade crew
Description and wireframes for the continued development of a data creation process within WhoWasInCommand.
Our objective is to create an appropriate, easy-to-use, data creation tool for SFM. Presently, SFM’s data creation is done in a Google Sheet. Three different worksheets cover the three main entities (persons, organizations, events); their inter-relations are created manually by Staff Researchers; a particular entry - like a particular person - may have numerous spreadsheet rows corresponding to changes in their status (like a new posting, or a promotion).The benefits of doing it this way are that it is, at root, quite a simple process. Further, we take advantage of Google’s high assurance service, good performance, and an interface that is familiar to most people. Our approach has worked successfully to date but now presents us with a number of serious challenges:
-
As the SFM dataset has grown, so has its complexity. The difficulty of adding, updating, cross-referencing and checking the quality of data has become a threat to the sustainability of our work.
-
The representation of the multi-dimensional nature of SFM data in the flat structure of multiple worksheets means SFM Staff Researchers are doing the things a database should.
-
Analysis of the substance of the data requires it be substantially transformed each time, which are non-trivial challenges for non-programmers.
-
Bringing new Staff Researchers on board is prohibitively difficult - not only do they have to learn the ropes of a difficult method, but also a fiddly toolset.
-
Effective data review processes are difficult to build in the current system, as there is no way to see how records have changed and to easily specify how they should change.
During the creation of WhoWasInCommand.com in mid-2017 we asked DataMade to develop a data creation system. Using this, the Staff Researcher would first enters details about a single source and then passes through a series of data capture screens where they enter data about organizations and persons (and their inter-relations), and incidents.
This “source first” approach suits the use case where a single source contains a lot of information about different entities. In keeping the Staff Researcher focussed “juicing” a specific source it also has a role in quality assurance. However, at the point of developing this, we had not adequately looked at our needs. The workflow breaks down when editing or removing data, which is less linear and requires the analyst to have access to all the sources associated with a datapoint. A number of complementary data management process would be required to enable the Staff Researcher to accomplish this quickly, including views of groups of sources, groups of locations and so on.
In this document we build on the aspirations, specifications and wireframes laid out in What product is Security Force Monitor making? (July 2017) and the excellent progress made in developing WhoWasInCommand.com.
As a reminder, the system roles are:
-
A guest is anyone with a web browser who visits WhoWasInCommand.com
-
A staff researcher is a member of Security Force Monitor tasked with the day-to-day updating and improvement of the data. They could be permanent (staff), short term (interns), or potentially personnel from partner organizations.
-
An administrator is a member of the Security Force Monitor who is a task manager for staff researchers. They control who can do what on WhoWasInCommand, resolve problems with the platform’s use, and manage the overall data publication process
We sketched out the ideas below as a live interactive using Moqups.
Building on the work done by DataMade last year, and following a closer look at our workflows we propose a different approach to building up our data. It involves a number of adaptations to the WhoWasInCommand backend and user interface. It is based on the following six function sets:
-
Exposing sources and locations as freestanding entities.
-
Manually choosing and linking entities rather than a using a single, integrated workflow.
-
Using of a “Source Picker” to link/de-link existing sources with specific data points in a record.
-
Using version control to improve the provenance of the data.
-
Using a simple reviewing and publishing system.
-
Accessing data for use in other tools through daily and on-demand complete data dumps.
Currently, Staff Researchers have to enter a source or location into the spreadsheet every time it is used, irrespective of whether that source or location exists. This is obviously time-consuming and a source of error: we propose an “enter once, use many times” approach. The prototype data capture tool already has the functionality to query a copy of OSM and assign an OSM object as a location. With some tweaks, it will work to enable us to bring OSM objects into WWIC.
Regarding sources, our current approach also means we are not able to manage our collection of sources with the control and certainty we need. To date we have entered sources into the spreadsheet in a semi-colon separated, simple citation format. This means that multiple sources will be found in a single cell, which has low usability. We also can’t perform basic analysis on the corpus of sources, for example to see which publications we have used, the effect that removing a source might have, or how they are connected to different data types.
We have taken the first steps to address this by extracting all the sources from the dataset, cleaning and standardizing them in line with Dublin Core Metadata Standards. The source parsing scripts developed for WWIC and in our own Source Archive project keep the connection between a source and a specific datapoint, but only take us some of the way to getting a clean dataset - there are lots of things which need manual attention. However, this will be completed shortly, enabling the more dynamic use of sources in WWIC (and an error-free import!)
A Guest will be able to:
-
Look at a source and see which records and data points it is linked to.
-
Look at a location and see which organizations and incidents are linked to it.
A Staff Researcher will be able to:
-
Add, update and remove distinct sources into WWIC.
-
Find an OSM object and bring it into WWIC as an object that can be linked to other entities.
-
Link or unlink locations that are in WWIC to relevant fields in organizations, and incident records.
-
Link or unlink sources that are in WWIC to specific data points in any other record.
-
Look at a location and see which organizations and incidents are linked to it.
-
Look at a source and see which records and data points it is linked to.
An Administrator gains all the above capabilities of a Staff Researcher.
The live version of this page is here.
Example screen 3: finding an OSM object, and making it a “Location” that can be linked to other records
The live version of this screen is here.
In exposing sources and locations as building blocks, we propose a data creation workflow that is based on wiring these together dynamically. For example:
Staff Researchers discover a new commander for an existing organization. They first add the relevant sources into WWIC, then create a new person record. They they create a “new post” for that person, linking the person to the relevant organization. Throughout, they use the Source Picker (about which more below) to associate the right source(s) the right data points(s).
A Guest has no new capabilities as a result of this.
A Staff Researcher should be able to:
-
Quickly add, update and trash any entity type
-
Use short workflows to link and unlink existing records.
-
Interact with simple views showing current relationships between existing records, to choose relationships to update or remove.
An Administrator gains all the above capabilities of a Staff Researcher, in addition to:
- Completely remove items from the trash.
The initial cut of this process will have bit of mechanical labour for Staff Researchers, and our aim is to discover through optimizations to the data creation process as we use it.
Example screen 4: choosing and editing a relationship between one organization and another, and choosing the right sources to evidence it
A live version of this screen is here.
A live version of this screen is here.
All SFM’s data is drawn from publicly available sources - these are the evidence we use for our work. All the fields for organizations and units have a corresponding field where the sources for the values are stored. All values have at least one source, and over half between 2 and 60 sources. During the data creation processes, the process of adding and removing sources is very dynamic. Further, when updating and reviewing data, Staff Researchers need to be able to quickly sort through the sources used to evidence different data points.
A Guest gains no new capabilities as a result of these changes.
Staff Researchers should be able to:
-
Search existing sources in WWIC.
-
Select a field to see which sources are linked to the value in it.
-
Select a field and link an existing source to the value within it.
-
Select a field and remove the link an existing source has with the value within it.
An Administrator gains all the above capabilities of a Staff Researcher.
Example screen 6: choosing which sources to link to values in the “Other names” field of a Person record
The Source Picker is ripped off entirely from inspired by the song
lyrics annotation widget on Genius.com
(example). A live version of this
screen is
here.
The provenance of SFM’s data is critical to its credibility, but measures demonstrating this are insufficient to our needs. Currently, WhoWasInCommand.com displays a basic changelog that shows how records were updated between different data imports. We propose to extend this to show in more detail when and how records were updated. This will also form the basis of the review and publishing system (outlined in the following section).
A guest should be able to:
-
Look at the complete history of any specific record in WWIC
-
See what changed between versions of this record
Staff Researchers should be able to:
-
Look at the complete history of any specific record in WWIC
-
See what changed between versions of this record
-
Revert a record back to any previous version of itself (for example to correct an error).
An Administrator gains all the capabilities of a Staff Researcher above.
This is copied verbatim from the version system in use on Democracy Club’s election candidate database (example). It provides a full JSON-formatted version of the record, along with a simple diff between versions. A live version of how we imagine this screen working is here (scroll to bottom to view).
All data created by SFM is reviewed before being published on WWIC. Presently this is accomplished in our spreadsheet through the use of “scratch” sheets, Google Sheets’ comments, a field indicating the review status of a row of data, along with a field for review notes. The visibility of data is controlled crudely: if it’s imported into WWIC, it’s public. As part of an overhauled data creation process, we propose the introduction of features to flag records for review by a second Staff Researcher and control whether a record is publicly visible or not.
A Guest gains no new capabilities as a result of these changes.
Staff Researchers should be able to:
-
Flag a record for review by a second pair of eyes.
-
Look at a list of records that have been recently added, updated, reverted in any way.
-
Edit a record that has been flagged for review, and indicate they have reviewed it.
-
Publish or unpublish a record on WWIC.
An Administrator gains all the capabilities of a Staff Researcher as outlined above.
This is inspired by the “Danger Zone” in Github’s repository settings. This widget can be seen live on this page (and on any record page, when being edited).
Example screen 9: A dashboard showing a logged-in user records that have been created, updated or flagged for review
A live version of this screen can be found here.
SFM uses a range of tools to analyse and present its data, including GIS, graphing and statistical tools. We do not expect WhoWasInCommand to recreate them, but to provide data in a format that is straightforward to use in other tools. WhoWasInCommand already does a lot of heavy lifting to process data into different formats, and we should take more advantage of this. We introduced basic download functionality with the launch of WhoWasInCommand, but have left it fairly unloved to date; it could be adapted to provide a set of dumps that correspond to common types of analysis we have to do.
A Guest should be able to:
- Download a range of complete, segmented (e.g country, security force branch) and simplified (e.g. persons only, no sources) versions of the data and metadata in WhoWasInCommand, generated daily
The Staff Researcher should be able to:
-
Download a range of complete, segmented (e.g country, security force branch) and simplified (e.g. persons only, no sources) versions of the data and metadata in WhoWasInCommand, generated daily.
-
Cue the creation of any prebaked query.
Out of laziness, we have not mocked this up yet.