Skip to content
KristinJensen edited this page Apr 4, 2012 · 2 revisions

Notes from Juxta Camp (Day One)

Introduction

These are notes from Day One of Juxta Camp held at Performant Software's offices in Charlottesville, Virginia on July 11, 2011. These notes represent an attempt to paraphrase the discussion and should not be taken as verbatim quotations from the participants. Brackets and ellipses mark spots where your note taker missed details.

Notes from Juxta Camp Day One (July 11, 2011)

Abigail Firey: [your note taker missed the very beginning of the discussion] … transcription and edition

Medievalists think of an edition as something constructed from multiple witnesses etc—different from a working transcription

Dana Wheeles: two types of scholars: 1. Want working text to play with 2. Bibliographic, careful about specific witnesses etc.

Alex Gil: constraints of book—can only produce one text; electronic allows multiple texts, easier to visualize differences

Same information presented differently

Abigail Firey: we are going back to the original intention—allow user to understand what is going on in each witness

Nick Laiacona: not much movement in intention; movement in product

Abigail Firey: there will be a lot of new questions

I went to a meeting of the European Society for Textual Scholarship—they are grappling with implications of digital world

New questions aren’t formulated yet

Alex Gil: what do the new questions smell like?

Abigail Firey: my question: how we’re going to think about whether variants are significant or not

This has always been a problem

Automation allows us to show all variants all the time—we can get overwhelmed by what were traditionally called insignificant variants

So much was left off the printed page because of physical constraints but also conceptually—marginalia was marginal

People are now more interested in reader response e.g. interlinear glosses

Alex Gil: once we separate content from form you can generate as many views as you want

You can make insignificant stuff invisible

You can generate a series of documentary editions and critical editions on the fly

Each user can produce their own edition, play with it

Abigail Firey: that’s the dream

Anytime you lay hands on the text you’re intervening

Markup is a basic level of intervention

Preparation of texts for philologists and literary people: markup will play out in different ways

Developers and humanists could get into a dark alley if they try to build for all potential users at once

At CCL we decided we can’t build for philological analysis; that’s someone else’s project

Alex Gil: I’ve heard this for a while now: we can’t build for everyone

John Unsworth says there is a common denominator

One common denominator is transcriptions

Another is yes or no variations, basic, logical: difference, similarity

We share those things; everything else can be added on the level of the markup

Abigail Firey: there are transcription issues . . . you’re right, but to play devil’s advocate: most people reading the text will automatically expand Latin contractions—some people will care about distinction; linguists care about u versus v in Latin texts

There’s going to have to be negotiation

Developers can be aware that there are these swirling clouds

Nick Laiacona: the open source development model meshes with how scholars work, how tools need to develop

You can have common code

To the extent that they don’t agree they can build their own forks on the project

Starting from five years worth of development; add the tool that I need

Abigail Firey: agree, think it will happen at the tool level

Dana Wheeles: agree

Alex Gil: agree

Tools separate markup from transcription

You can have specialized marker, do search and replace

Abigail Firey: IF everyone can agree to the same standards

Standards are like toothbrushes, nobody wants to use anyone else’s

Every transcriber is doing things differently

You can make guidelines, but they get unwieldy

Nick Laiacona: TEI is a data format with no tools to read it

E.g. web browsers consume HTML

Software has its own gravitational field that bends the protocols

Abigail Firey: yes; ambivalent about TEI

We have to wrench our protocols around TEI in illogical ways

TEI presumes that any text is highly, systematically structured

Alex Gil: possibilities of Juxta

Most of what it can do is outside of TEI

It can be TEI-agnostic

The future of computational power resides in algorithms, not markup

Any tool for edition-building can be added

Nick Laiacona: the diff layer is TEI-agnostic; the analytical layer is not

Abigail Firey: you’re right to go around TEI

On the other hand, we’re putting in that markup for a reason

Some is about textual matters that matter in collation

How much is it about visual collation?

What if scribe changed a display line?

Rubric in display text versus main text

That would be a distinction in TEI

Is Juxta going to support font differentiation, visual tricks to show such distinctions?

Alex Gil: three solutions

I believe in bibliographic markup: not semantic, interpretive

My markup worries about line length, margins, etc.

That can be translated into HTML markup, compare margin sizes etc.

The way you transition to that to allow different communities to use it . . .

Create style sheet builder within Juxta

I am already working on style sheets

Second level: tool would allow users to indicate the text they want to render

As opposed to building style sheets from scratch

Third possibility: have Juxta generate standard views—flatten out everything but in a meaningful way

Each approach works for different users

Last one works for most users

Second works for specialized user

First for bibliographic users

Abigail Firey: adding forks between users . . .

Bibliographically inclined, physical artifact versus those who are doing semantic work

There are real forks there

TEI modules reflect that fork

We want to use msDesc tags but you can’t use them within the textual module

It’s assumed they describe not contain

But I have paleographic features that have implications for the text; can’t use those tags; need workarounds

Division within the community

Medievalists are an interesting body of people to consider because we have a balance of relating the physical artifact to the text . . . not unique but we push it in different ways

Because so much of our evidence is … [note taker missed the predicate here]

Manuscripts are different from printed books

There are overlaps but you fork into different presumptions

Dana Wheeles: this would be a good point to talk about where the Juxta code stands now

Latin noise filter

More useful?

Abigail Firey: yes

Nick Laiacona: have you seen 1.6?

Need dropbox access

Alex Gil: Juxta allows you to separate layers into virtual versions

Abigail Firey: that’s what we were working on with 1.5

Alex Gil: want Juxta to compare documents in different layers

Abigail Firey: some people are going to want to be able to collate glosses

Abigail Firey: sample text is anomalous

Texts are broken into smaller units (canons, statutes)

Those get shuffled, compilers add prefaces, etc.

Many times you’re not comparing texts that resemble each other

Not an easy problem to solve

We’re good for Juxta: we present extreme cases

Nick Laiacona: macrons . . .

Abigail Firey: in many manuscripts, there will be macrons in every word

If transcribers can’t expand an abbreviation, they’ll leave macrons

e.g. if n-macron for “non” it will appear in transcription as “n”

Things frequently abbreviated in latin manuscripts: inflection endings, verb endings, common words (quam, quoniam, enim) e.g. qm with macron = quam or quoniam

Prefixes e.g. per, pre, post, pro – scribes may use different abbreviations, p with different marks

We made the editorial decision that all abbreviations will be silently expanded; if there’s a problematic one—

Alex Gil: expanding at level of transcription? Really?

Abigail Firey: yes, that is standard practice

Philosophy of editions

Medieval scholars: edition is supposed to help you read the text in a way the manuscript doesn’t

Editor makes text more readable, not fight through abbreviations

It is accepted practice to silently expand abbreviations, note in preface

Can note if scribes consistently use/misuse an abbreviation

Exception: some editors and rich publishers will put expansions into italics

When you are doing that with every word—a nightmare for printers, very expensive

Most critical editions didn’t bother

Heard editors in Europe (Spaniards) proclaiming that every variation would be marked

Are you going to allow people like that to show expanded abbreviations?

Nick Laiacona: options: change font or color of text; put note on side . . .

Alex Gil: it would be easy to code abbreviations and expand them

Technologically speaking, having software run through them and let an operator select from options – not that hard

Abigail Firey: at a practical level, when I’m transcribing a large text, I don’t always see abbreviations; I just read them

I have to stop, think

Need convenient way to record

It’s easier to write “per” than the abbreviation

Some odd ones aren’t easily reproduced

It’s going to make transcription not practical

I think the Spaniards are nuts—they’re going to go at snail’s pace; will have to triple-proofread

Alex Gil: we would like Juxta to remain agnostic; allow both practices

Any meaning happens at the level of markup

If Juxta can have an abstract way of dealing with all markups to indicate how they’ll be represented—allow user to choose—Juxta generates representation

Abigail Firey: when we hit problematic abbreviation where you don’t know if it’s quam or quoniam use TEI abbrev tag; it has certainty level

Alex Gil: useful to think backwards from representation, not forward from manuscript

What do I want to show?

I go back to Cornell editions on paper of Yeats

They used system that captures …

I don’t think we can get better than that for representing layout

Abigail Firey: or edition of The Wasteland

Alex Gil: yes

Start with HTML; how many steps do I need to get there from TEI

How TEI must work

How Juxta must generate this

At an abstract level, all that Juxta needs to do is recognize positions transform into CSS

Comparisons: recognize difference and similarity

Different categories: addition, deletion, move

Different category: abbreviation expansion

Expansion is still going to be compared under addition, deletion, move

All you really need Juxta to do is recognize categories at the tag level that you want highlighted

Whether you want Juxta to mark as difference, similarity, or ignore

Abigail Firey: there are a few cases where for some disciplines there’s additional material e.g. mark folio number—that’s neither addition, deletion, nor move

Alex Gil: there’s two algorithms and the absence

Difference, similarity (mark sameness)

Ignore function: absence of similarity and difference: e.g. for pages—you want to represent but you don’t want algorithms to run on those things

Can mark things to ignore but it really ignores

What if you want to half ignore: still be there, but not noticed by Juxta

Sometimes we don’t want to compare the whole text

[shows chart of text pieces]

Abigail Firey: yes; e.g. take out preface

This chart could represent my work

Might want to collate little chunks

Alex Gil: we need to re-imagine how Juxta handles workflow

There should be options

e.g. start slow

Juxta does not handle this complexity

We could handle my text if we added the function to allow user to specify parts to compare

Problem: a diff is a second level operation to a similarity

diff never comes first

It assumes alignment

But sequence in texts is unstable

Imagine a text with 66 moves …..

Dana Wheeles: we need to think about our agenda

Hold off for your presentation

Alex Gil: for real life editors, the ability to handle texts that break down into many pieces is essential

Abigail Firey: agree

Medieval texts: authorial version doesn’t exist/matter

Texts are freely changed

Floating around in small units getting recombined

Alex Gil: applies across history, culture of reprinting

Abigail Firey: not reprinting the whole in a standard way

Mashups, remixing

Abigail Firey: Juxta has to be built for that presumption

Nick Laiacona: we assume a file with text, let’s diff it

Deeper problem: find common start points

Alex Gil: Juxta started by thinking of files as versions

Needs to think of texts as fragments, containers of fragments

Abigail Firey: have to retain model of different versions because a lot of editors will be doing that

Alex Gil: yes, necessary subset

Abigail Firey: a lot of Juxta users will be traditional, two versions …

Digital world: ability to edit texts we’ve never been able to edit because they’re mashups

Nick Laiacona: the developers of Juxta are hoping to find interface for web service

What kind of questions can you ask Juxta and what will answers look like?

Abigail Firey: user interface

Colleague was baffled by it

Not as transparent as one might think for first time user

Nick Laiacona: users are on the web: future of Juxta is on the web

It will be a long time before we can make the web version do everything the desktop version does

What do you put on web?

We’re talking about 2 interfaces: Java application, HTML comparisons in edition search

Alex Gil: you use Juxta to prepare an edition, but it is not the edition

You can make edition available—you wouldn’t want to use Juxta view

Abigail Firey: wondering about what gets lost in move to HTML

For a long time Classical Text Editor was stalled

CTE back in the game

They do full preparation for a print edition

Two functions for web Juxta:

  1. web replication of local software

  2. next stage of edition preparation

Alex Gil: no, not preparation stage, consumption stage

Abigail Firey: I think there are two different stages and I’m not sure you want to substitute the second for the first or consider them one

Nick Laiacona: scholar working with text versus reading version?

Abigail Firey: we have Drupal environment, individual spaces password protected

People can take bits from our database into individual work spaces

Work with Juxta, export from CCL editorial space into whatever format

Dana Wheeles: my concern: we’ve been thinking of web service as a pared down version

Difficult things would need to happen with downloaded Juxta

Question of how much we can do in a browser

Nick Laiacona: no not a question of how much you can do

Want to have everything in web environment eventually

Initially, you want to be able to share results on web

What’s the essential step that needs to be displayed

You seem concerned Abigail?

Abigail Firey: I think the current Juxta display of collations / differences / moves / notes would be useful to display on web, consumable

Nick Laiacona: I am confident we can reproduce that on the web, have prototype

Abigail Firey: what do you expect people to do on local, what on web

Nick Laiacona: That’s what we’re here to decide

Abigail Firey: I am having trouble imagining what users would do, what would be visible to whom

Nick Laiacona: we need your help to figure it out

Abigail Firey: I want all of it

Dana Wheeles: we need to prioritize

Are people going to be stripping XML, adapting schemas? (In private work spaces)

Abigail Firey: our users will use our online transcriptions in a Drupal database, XML files

Alex Gil: will they have the comparison view?

Nick Laiacona: yes

Alex Gil: imagine separating the process of creating from how you show them to users

Juxta will generate the view automatically

Abigail Firey: CCL does not determine which texts will be compared

The user determines

Clarification: end users, not just editors

Alex Gil: you want end users to use regular Juxta as interface

Dana Wheeles: assuming end users are scholars

What sorts of questions?

Abigail Firey: CCL is agnostic, we do not predetermine the questions

We make the corpus available for collation

Are texts related? Do canons change?

What if I choose a large collection of texts that all come from Paris, for example – do they read differently from the Berlin versions?

We don’t want to predetermine

We want scholars to explore

Then say it makes sense to edit 60 manuscripts from Paris (for example)

Alex Gil: so users are still working before the final product

Abigail Firey: yes, CCL does not host editions

It is an environment for people to find what they might want to edit

Juxat is part of the research environment, open to everyone, not just the editorial team

Dana Wheeles: that’s why I think this can work well

One can set up workflows without determining what work people can do

People can always ask for different capabilities

Give users certain abilities and they will gain confidence

Abigail Firey: people can take a copy of transcriptions

Get it into their own workspace, play with the text (fragment or whole)

Choose a comparison set

Start comparing

Open their workspace to other scholars: look at this comparison set, what happens when I change the base …

Once they figure out the most useful comparison set, then export selected data and work with it locally

Since transcriptions are on the web, workspace is on the web, it’s the easiest place to show comparisons to other people

Nick Laiacona: ultimately the whole thing should be on the web

But I don’t think we can do it this year

I think we can get side-by-side comparison on the web

Discover, load transcriptions, compare, share with others

Abigail Firey: what can’t go on web?

Nick Laiacona: it’s easier to say what can

Simple side-by-side comparison of two texts

The web service would allow you to get information about the entire set you’re comparing

What I’m thinking: modify desktop Juxta to make it easy to discover texts from CCL and load into comparison sets

Abigail Firey: if necessary we can instruct users to download to local

Uneasy about having them upload stuff to CCL site

What about heat map, histogram?

Nick Laiacona: those would be the next things

Abigail Firey: not in first web version?

Nick Laiacona: we could delay and make them part of the first web version … not hard once we get other parts working

Abigail Firey: what kind of export function?

Will you be able to generate a critical apparatus?

Dana Wheeles: you would want HTML, a link to place on web where you can see the comparison

What format makes sense?

An HTML apparatus is not useful

Abigail Firey: a “generate critical apparatus” function would be useful—presumes download to another environment

XML for people who want to retain XML markup for other web publications

We’re trying to make our schema standard for medieval canon law texts

Our markup may be useful for other medieval canon law projects

Something that sensibly strips the text and gives plain text reading view

Dana Wheeles: we’re on the way there

Within the .jxt file you have a copy of the files that are being collated in the original XML format

If they download the .jxt file of the comparison set they’ve got the markup schema

Abigail Firey: they didn’t have to export from Juxta—once they do the exploration and know what files they want…

Nick Laiacona: we can generate a plain reading copy

You upload a text, compare it with another

You should have a URL that goes right to that text

Download that in HTML

Produce a version that has no markup whatsoever

Abigail Firey: what have most users been doing after collating in Juxta?

Dana Wheeles: most people aren’t interested in critical apparatus

They take screenshots of critical parts of visualization

Nick Laiacona: they’re not sharing it effectively

Abigail Firey: they’re not editing

Dana Wheeles: they’re using it as an analytical tool

What alex is talking about is beyond what we can do

Nick Laiacona: there used to be no editing function

Now you can choose what to keep and what to exclude

Abigail Firey: I’m struck that people are collating for reasons other than to produce an edition

Alex Gil: in theory Juxta should allow discovery AND generating editions

Generating an edition is just a matter of export

Dana Wheeles: but everyone will have a different idea of what they want

Alex Gil: the answer is style sheets

The data is same for everyone

Abigail Firey: I wonder if partnership with editorial software like CTE would supply that stage; export functions are already in CTE

….

Nick Laiacona: Juxta is living in the research world more than in publishing and consumption

To what extent should it extend into publishing and consumption?

Abigail Firey: or interface with existing software …

Dana Wheeles: it’s a tool for analysis right now

We want to make sure people can do all the analysis they want

Abigail Firey: how do you harvest that?

Alex Gil: analysis is useless if I have to retype the data to produce an edition

Why recode everything if Juxta performs the tasks?

Have had to add span tags to HTML and use Javascript to align things

That’s what I’m publishing

I don’t want readers to mess with my stuff, just read it

Dana Wheeles: we’re talking about different things: a web service with some level of analysis, and a web interface to give a better representational tool for analysis

Alex Gil: an analytical tool can easily generate something that’s not a tool, an edition

Dana Wheeles: in order to develop Juxta for the web and get to that endpoint, how can we strategize what we want to build, to get sense of how people want to use the service?

Do we build it to be easier to compare two texts side by side, or do we start with a heavy version?

Alex Gil: I vote to start simple; anyone who wants to do heavy work can download the desktop version

Dana Wheeles: we want to build things in for CCL that will make it easy for users

Abigail Firey: heavy analysis = collation, side-by-side, histogram, heat map; generate critical apparatus—get a list of variants—from that you can build an edition

That’s the current “exit door”—not bad

It’s essential to have some way of capturing collation—that’s what a critical apparatus has always been

Users can do other things with that

Users can see a lot, analyze a lot, but can’t DO much when it’s all just visualized in Juxta—visualization is not the end point

We need to bridge Juxta and the scholar’s problem

Dana Wheeles: going to the web service: we’re not stuck with one form of critical apparatus

.jxt is the critical apparatus

It contains the texts under question

Abigail Firey: how do we transfer it from user to user?

Nick Laiacona: use the .jxt file to transport

The software runs algorithms

You’ve chosen what witnesses are included, starting points, etc.

It’s a way to share with someone else

Bridge from research environment to editions: bring .jxt file to own system; use Juxta in different way; take HTML representations

Transforming TEI into HTML is important

Abigail Firey: people can get TEI from us

Get analytical Juxta on the web

Give greater exposure to users and editors

We can make the case that we need funding for next stage of development: we’ll need a better interface for producing consumable editions—that feels like another major round of development for separate funding

Nick Laiacona: there’s the distinction about starting point for texts

Alex Gil: the dual view should be standardized

Edition building: a stage of funding

Dana Wheeles: I could imagine going to CCL and as we’re working on features for discovering similarities we could take small batches of texts and pre-run them and say hey, the computer has found interesting similarities—would you like to take a look?

Abigail Firey: CCL users are already panting to collate

They’re limited by the current size of the data set

Nick Laiacona: when you put a file in a comparison set—it automatically compares to all others

You could already have all cross-comparisons

Alex Gil: we are already primed to export an XML file that has differences produced by the algorithm encoded in it

Nick Laiacona: there’s an issue with web service of real time computation versus batched computation ahead of time

Perhaps there’s a way using somebody’s laptop to batch pre-compute their entire edition on their laptop and upload that .jxt file

Abigail Firey: I’m wondering about what happens when our corpus gets big

Alex Gil: once you do the work, you want to save it, not redo it

Nick Laiacona: Dana went to get lunch

Ron and Gregor are not here yet – we may see them at 2:30 or 3:00, or tomorrow

Let’s go on the assumption it’s going to be just us today

Let’s spend more time on discussions tomorrow instead of hacking

Alex Gil: shall we show and tell the new features?

Lou Foster: you saw spot editing in 1.5?

Have you seen templates?

Alex Gil: I never saw 1.5

Lou Foster: new features:

Parsing templates

When it sees a new schema, it pops up a dialogue and allows you to include or exclude different elements

Abigail Firey: we decided to have one global thing that says “exclude most of the header tags”

You don’t want to exclude the TEI tag or you get a blank screen

Lou Foster: template files

Nick Laiacona: you could create a “skip headers” template

Lou Foster: you can save / rename / change template

Templates are associated in two places: 1. Tied to Juxta file; 2. Library of templates not tied to Juxta file

Alex Gil: can you edit on the fly?

Lou Foster: menu … [demonstrates]

Abigail Firey: could you add a “save as” or “generate new” button to dialogue?

Nick Laiacona: there are two sets of templates: current and archived

Lou Foster: spot editing

Edit button on lower right

Abigail Firey: that terrifies me. What about version control?

Lou Foster: there is none.

Alex Gil: you’re supposed to write in the editorial statement

Nick Laiacona: we could generate a new witness rather than editing the original

Alex Gil: down the road, you want to share files, make an open source edition;

Someone else can make changes to my file

What happens when the next person downloads?

Versioning is crucial

Abigail Firey: this is a problem for editors: we’re working with to-the-letter accuracy

You’re never going to remember your quick fixes

Now your file is different

You lose track of whether you’re looking at an edited file

Pop up a warning in red: you have changed this file, do you want to change its filename?

Nick Laiacona: do you want to make a new witness or replace this one?

Alex Gil: save it for history—at no point should you erase the history

Nick Laiacona: the origin of this feature was in the SHANTI funding; we were designing for students, not necessarily researchers

Alex Gil: as much as Dana insists that this is an analytical tool, it is already an editorial tool

It doesn’t exist in a vacuum

Abigail Firey: let’s not put this feature in the web version

It would invite world of trouble

I don’t like the thought of XML files being edited without version control

Not really files, but the way files are going to flow in . . . this is scary

Alex Gil: think about your case: users generate analysis

A user picks up an error

The user has to write it down and remind other users?

Or do you allow the editors to tweak, keep master file, save their own versions?

I’m not afraid of the masses

Idea of perfection . . .

Abigail Firey: I see where you’re going, but—there’s a question of scholarly responsibility and the transcriptions we deliver to users

Why shouldn’t the correction go to a master file? If someone inadvertently messes up the XML files …

Alex Gil: I have the solution. Think of Google Docs: there are three levels of users: private, reading, and public

Put your file there, give access to master file to people you trust

Nick Laiacona: I think Abigail is saying keep copies, don’t destroy anything

Abigail Firey: [anecdote about a person who saved too many copies, losing version control]

Alex Gil: you don’t want Juxta to go on the web without version control

I agree you don’t want shared tools without versioning

Abigail Firey: you should at least have a warning

Nick Laiacona: we’re certainly not going to put this in initially

Alex Gil: there could be a simple toggle of permissions – htaccess file

Nick Laiacona: this comes into play in the interface between Juxta desktop and the web service

Abigail Firey: there’s the issue of synchronization

If we update on the web, users are working with old files

Make web call from the local version to the web and deliver an alert message when appropriate: “witnesses you’ve selected have been updated”

Nick Laiacona: maybe when we load a file into the web service, there’s a URL where you can get the authoritative version

It would be cool if they could call out and get the latest version

Abigail Firey: we have a reporting system for entering changes for user corrections

Nick Laiacona: when you load the web service initially, you’re not loading it, you’re using a URL—you must refresh

Alex Gil: github model / stemmas – trees and branches

Scholars can protect their materials, other people can use same datasets but branch out, play

In github you can still trace back

Nick Laiacona: if we build in simple versioning control system—sounds important—that’s a whole big other piece, but may be necessary

Abigail Firey: I like idea of real-time calls to a URL with notice to the user that these files may have been updated / have been updated

Nick Laiacona: we could keep every copy ever put in and order them, assign a revision number

Abigail Firey: we get into storage …

Alex Gil: you’re not replicating files, you’re just keeping revisions

[lunch break]

Lou Foster: wrapping up spot editing

Edit, hit “update” and it will re-collate

Alex Gil: you can edit XML or text?

Lou Foster: yep, you can edit whatever

If you break it, it will throw up an error message

It won’t give an error if you put in well-formed XML that’s not part of the template

Note support: when it encounters a note, it will put it in a bubble off on the side

If you compare two documents and click on highlighted text it will switch to difference detail margin boxes; click anywhere that’s not a difference to go back to note view

Abigail Firey: what if noted text is one of the differences?

Lou Foster: it still works

Switch to slightly newer version

Show managing revisions

“Abc” button with strikethrough or pencil

Revisions = additions and deletions

Step through all the revisions in the document

When you click the edit button you can accept or reject revisions (scribal revisions) on a case-by-case basis

[Gregor and Ronald arrive]

if you hover over a site of change it will give you a tooltip

Abigail Firey: if you’ve changed it, will it give the tooltip?

Lou Foster: no, should it?

Abigail Firey: yes

Alex Gil: add the accept/reject function to the “ignore” control

Generalize it—give me the option to accept/reject certain tags

Lou Foster: once you’ve accepted all you want, hit the arrowy thing and all your choices will be incorporated in the document (.jxt file)

Abigail Firey: but they’re still marked so you can undo ?

Lou Foster: yes

You can toggle versions at will

You can add different versions to your comparison set—just add the document again

Abigail Firey: we’ve nested add/del tags within set type = core

I see the option to allow users to decide letter by letter

Alex Gil: you want to be able to do this kind of manipulation with the markup that’s already there and the markup you will add

Abigail Firey: this keeps turning into “Lou can you add user options?”!

Nick Laiacona: the problem of too many options: it’s overwhelming for the user

Alex Gil: think of the three levels of working with photoshop: elements, medium, advanced

With this, you could allow on/off –what kind of user are you, basic or advanced?

Abigail Firey: software development is a process: we have to put something out and see what users want, either “we are pleading for granularity” or “the granularity drives me nuts”

Alex Gil: we don’t want to limit the software because of not wanting to overwhelm the users

Editors are expert users

Either/or alienates half of the user base

Nick Laiacona: let’s do Alex’s presentation, then Abigail’s, then Gregor and Ronald

Alex Gil: one more question for Lou Foster: are there changes in the algorithm?

Lou Foster: there are a couple changes in the parsing – to know which revisions you accepted, how to handle notes

Alex Gil: what happens when Juxta becomes not only analytical but a representational tool?

Rhetorical power

Eventually Juxta will produce editions—a finished product with its own interface

An idea has been evolving: re-imagine Juxta as content management system whose job is to produce scholarly editions

I think it’s helpful to think about it as CMS

There’s a separation between the data that all scholars share and their representation of that reality

A CMS like Wordpress has a database of posts, etc.; representation of that content is done differently with themes

Themes – not only style sheets, but also interface—different functionality

Separate representation from data because different editions have different users

You want a reading edition, communicate scholarly categories e.g. dual view

CMS would look like a Wordpress dashboard but more Juxta-y

Several things happen: Selection of texts

Three levels of users: basic, medium, advanced

Readability of text versus visibility

Readable = clean

Visible = analytical categories

The balance is always a problem for scholarly editions

Juxta needs its own standard theme

My work: an HTML replica of the original

Collaboration: all working on same edition

Also want other users to be able to borrow text and make something new, with a new theme

Different levels of permission, recirculate the data

Juxta does not understand chronology; it assumes that you are working with whole versions of texts that share a sequence

[presents his project]

I want Juxta to be able to start by matching approximate strings before it does the diff

  1. organize corpus; 2. Establish relationship between two different states

Example of several fragments published separately, moved around in relation to each other, etc—a problem of recognizing approximate string matches

Abigail Firey: problematic case: imagine a text of ballads with a refrain—the refrain will cause havoc

Alex Gil: I have encountered this problem

The solution is to recognize the refrain and tell you how often it’s repeated

The problem is noise—it’s a statistics game, so there is noise; there’s also a problem of missing positives

There is very little false positive

You can’t run diff first; you have to find the matches first

Then run diff inside matching fragments

Parts that don’t match, consider as a deletion; ignore it, you don’t need to run diff

Nick Laiacona: let’s do 10 minutes of questions, then Abigail’s presentation

Abigail Firey: Alex’s diagram of his project reminds me of plectograms used by David Birnbaum to show the order of sermons in different manuscripts

[…]

Alex Gil: the graph could be generated with Ben Fry’s processing language if the relationships were encoded in the texts

Gregor Middell: we need to look at terminology: there’s a difference between sequence alignment and pattern matching

For diff, sequence is crucial—it looks for common sub-sequences

Diff is not able to cope with transpositions

That’s how sequence alignment works

Pattern matching has been tried; the problem is that correlating transpositions is an interpretive act; it can only be done by a human making insight

You can make a case that this cannot be solved with an algorithm

We can use pattern matching to establish possible relationship between texts

An editor then must filter what makes sense

Even if I do pattern matching, I cannot make the decision that is made from the editorial point of view

Alex Gil: I agree, I don’t think I said anything different

I can’t use Juxta; it breaks down

Sequence is not a given

Model of pattern matching followed by editorial intervention—not every editor is going to want to work with this workflow

Abigail Firey: I have had trouble with the move function: screen real estate issue: when bits are moved very far …

Juxta feels cramped

If we’re trying to figure out priority features, I wonder if the move function is one of the more problematic; also exciting and promising; room for improvement

Nick Laiacona: let’s pick up again at the end of Ronald’s presentation

Move on to Abigail’s presentation

[break]

Abigail Firey: I am not a developer—this will be at a rudimentary level.

In my experience, the strongest software is built when there is strong communication between scholars and developers

NEH study: the single biggest problem was communication between scholars and developers

I want to go over 3 areas of my project that might be useful to developers to bear in mind

First: some issues that arise for medievalists in working with manuscripts and texts

There are a lot of presumptions about texts based on printed texts that don’t hold for other fields

I also would like to go back to the question of textual structure and hierarchy

Finally: issues of collaboration and workflow

Two things that make the CCL a good test case: we have enormous corpus that is unexplored; it has defied editing. And I am building the CCL on the presumption that I could be hit by a bus and people could still use the CCL collaboratively.

Considerations:

  1. Latin is an inflected language; every noun has eight forms, etc.

Customized XTF for CCL

Do we need to make Juxta speak Latin? No, if we make Juxta work for environments like CCL

  1. Orthography: not standardized—makes pattern matching hard

  2. Latin manuscripts heavily abbreviated: transcribers expand abbreviations

  3. Paleographic considerations: text as physical artifact

  4. Glosses: you can do a lot by hand that you can’t in print

There’s no good system of footnotes or call numbers relating point in text to gloss or note; it’s all spatial—the scribe will put a comment in the margin

It’s inappropriate editorial intervention to convert these to footnotes; we try to use a system that replicates marginal notes

  1. The distinction between different hands is very important. Is Juxta going to offer font differentiation to represent change of hand? Different font, different color? This is not a high priority, but medievalists may ask.

  2. There are very few autograph manuscripts; we have copies of copies of copies . . .

  3. There is very little stability. Things can move, get dropped out, replaced, etc.

Implications for Juxta:

Excited about filtering out “Latin noise”

A lot of small clutter—e.g. using c for t

It’s a question of where on the editorial scale you fall: I think those variants are insignificant. I hate it when Juxta turns blue with sea of differences between oe/ae and ligatures

Some colleagues would say it could be important. Philologists would care.

Make a sliding board—so the user can control which variants they see.

That is hard software to write.

There are no simple rules—oe/ae is an obvious switch, but they sometimes substitute for e; you can’t substitute ae for e …

How far do you want to push Juxta to pick up paleographic features?

Add/del: substitute is not the same as correct. Probably won’t affect how Juxta handles things.

Example of one canon in progress: in a short text, 12 variants, many witnesses.

Editions of medieval texts tend to be heavy on apparatus and variants. We’re used to it. Keep that in mind.

On the CCL, we show a toggle between revision sites / scribal corrections: you can click on a word in red and it turns burgundy and shows variants.

Medievalists are used to seeing in a transcription the locus: what page are we on? Put that into light gray text: it will be easy to read past.

Square brackets with ellipses represent unclear parts of text.

Working with developers: we struggle with terminology and hierarchy

For medievalists the physical artifact (codex/manuscript) is unique. Always named by shelfmark.

A manuscript may contain more than one collection; a collection may appear in more than one manuscript.

A collection is a type of text. “Text” can mean anything from smallest atomic unit (individual canon) up to the whole.

Subdivisions of canon: rubric, title, number, body of statute, identifier

Juxta has potential for future development in how it handles the relationship between a larger unit of text and smaller atomic units.

Do we have to load in a transcription for an entire collection or could we load a bunch of canons and just collate those?

The hard thing about these canons: Alex said we imagine texts in eternal present. I have a lot of canons that do live in the eternal present. Canons get recycled and rearranged over and over even if their origins were at different moments.

Often I am wanting to collate something from different moments.

Then I want to learn about textual context.

I’m always moving between the fine units and the larger units.

Even though we all say that programming doesn’t care much about content and markup is agnostic, actually it’s helpful for developers to understand type of textual material we’re working with, even if it involves looking at THE CHART

Developers started making their own chart of XML format details

We are feeding into Drupal/XTF transcriptions of entire canon collections—can be hundreds of canons.

But we’ve also allowed users to find individual canons.

We thought it would be cool to have a checkbox saying, “I want to put this in Drupal.”

We also offer translations of Latin

We’re using existing CMS to do that rather than ask Drupal to become a one-stop shop.

Nick Laiacona: [question about XTF and Drupal]

Abigail Firey: We’re finding neat ways to pipe between Drupal and XTF; it has involved a lot of code to bridge them

We’re doing things appropriate to each function.

E.g. search function built in XTF

How can we draw on the power of Drupal to deliver data in different ways?

Once you find data in the canons, what can you do with it?

We’re doing something better than “cobbling together” to draw on the potential of both.

We’re embedding an XTF in frame in the site.

Transcription tool, search engine, collation.

Nice standard toolbox.

Shows example of where Drupal is able to tap into XTF display.

Going back to the problem of glosses: medieval manuscripts get gloss-heavy, especially Bible and canon law.

Will it be possible to collate the glosses?

Nick Laiacona: You and Hans Gabler . . .

Abigail Firey: the sheer quantity of glosses shows that it is a text in its own right.

A lot of glosses are interlinear: different spatial relationship.

Our XTF guy said “no problem” but I didn’t like the result.

Switching to T-PEN for a moment: new transcription tool.

Many manuscript repositories are releasing digital images. What can people do besides look at them? Scholars want to do stuff with them.

We’re making agreements that T-PEN can pull images and put them in our repository for the transcription tool.

It’s not taking a copy—it redelivers the image.

Transcription is correlated to location in the image. You can annotate. Our team is getting crushed by markup in a large corpus. Now you can use custom XML tags, human usable. E.g. button for “gloss.” There’s a reminder in the corner to close that tag.

We figure we can get 80% of the necessary markup. What’s missing: TEI placeholders.

We can export in XML, the CCL team can pull it up in Oxygen and add the missing 20%, then put it back in Drupal.

Gregor Middell: how do you handle overlap of tagging?

Abigail Firey: in the 20% that we fix manually. We’ll see that the file doesn’t validate.

You can also upload plain text transcriptions or XML transcriptions and go right to the line and proofread.

There’s a good chance of good resolution of non-validation problems with reference to the images themselves.

We expect that files won’t validate.

A developer wondered why we have Oxygen in the loop: we can’t expect users to produce valid markup. But they can do the bulk of the work.

The CCL is unusual: it’s a collaborative environment. We’re not an editorial project. We’re providing accurate transcriptions made by experts to anyone who wants to use them at any time. That has implications for Juxta: we’re supporting an unknown universe of editors. We don’t know which files they’ll select for collation, what editorial approaches they’ll want to apply, how many will be working together, where they’ll be.

We’re crowdsourced (though it’s a small expert crowd)

People can contribute transcriptions and translations.

If it’s published raw we say it’s “in progress” and people can comment; when it’s approved we stamp it CCL approved.

Once data files are approved, we lock them down. It takes a lot of work.

But we want it to be possible for copies of those files to be placed in individual password-protected workspaces where people can manipulate those files in Juxta.

People want to work with fragments of texts not just full transcriptions.

You can break out chunks and reassemble them (with locus information). <xml:id> gives their position on the page.

We need to do some work on allowing people to break out pieces of texts.

Web-Juxta questions to think about at Juxta Camp:

Data flow: how are files going to be imported and exported? CCL and user workspaces

How do users get files, how do we protect CCL data?

Can you work with fragments as well as full transcriptions?

Can we extract single canons into individual workspaces?

If we make copies of files, where will they be stored? Where will .jxt files be stored? Will there be online help? Where will a web service run—iframe in Drupal?

How do we keep users’ data private or allow them to invite viewers in?

Alex Gil: the canon versus collection problem is generalizable. There are two textual units. One is a bibliographic category: people have bound it a particular way.

Within a collection, there is a stream of texts. In another collection, there may be matching canons. Imagine a collection where there’s only one matching canon. The relation between the two collections is irrelevant apart from matching canon. The large context gives canon its precise bibliographic marker. The other canon relates as a match; the bibliographic code doesn’t matter.

Have I lost you?

Abigail Firey: maybe I lost you. There’s the meaning of the individual canon. It obtains meanings intertextually from the different collections and from its real life application. I’m not sure that the distinction between bibliographic and semantic is useful—

Alex Gil: semantic is not part of it. I’m talking about bibliographic and linguistic similarity.

I can imagine a situation in which all my files will have the bibliographic marker distinguishing them as files. Within those files there are canons included. There may be different relationships between the files and each other and the canons and each other. There are two levels of operation here. It’s not useful to talk in terms of hierarchies; it’s better to talk in terms of networks.

Imagine the ideal software: when I look at canon x, the software shows me the matches to other items.

Abigail Firey: you can find that in the search engine. It will find all the canons with that search string.

Alex Gil: problem: imagine a canon where three or four words are changed. If you search for those words, you won’t find those canons.

Abigail Firey: we’re not actually providing transcriptions of coherently defined texts. If you find two pages that have never been recorded just send us those canons. We have all sorts of texts in all sizes.

Alex Gil: do you make sure each corresponds to a bibliographic unit?

Abigail Firey: we always have a record of shelfmark and folio.

Alex Gil: sometimes I just want to compare matching blocks. If I could click on a text and see the ones it’s related to …

You want to be able to compare segments. If you’re able to create relationships between blocks of texts you should be able to very easily—

Abigail Firey: I’m going back to human intervention. I can search and see that canons are related.

Nick Laiacona: you can’t just select an entire TEI file, you want to select a section of the file.

Alex Gil: load up collection, run an algorithm that will suggest a place to begin—

Nick Laiacona: we don’t have that algorithm right now—

Abigail Firey: if files of individual canons are already broken out…

Gregor Middell: we are facing a similar problem. We have some integration solutions. You can pull witnesses and tell Juxta “collate these.”

Abigail Firey: if it works on a large corpus I’m interested.

Gregor Middell: scalability is not a problem, but you might need to duplicate data.

Abigail Firey: you are duplicating?

Gregor Middell: yes.

We haven’t decided whether we will then get rid of the XML.

The basic problem is hierarchies. We are describing our source material from a manuscript-oriented perspective, trying to preserve spatial information. Then we do textual transcription in a different XML file. Different perspective. We end up with overlapping hierarchies.

If you go into collation, this problem gets exponential. If you ask Juxta to tokenize it—

This can be done on different granularity levels. (character, word, sentence)

There are layers of markup that you cannot fix in a single XML file.

Abigail Firey: we’ve been worried about tokenizing

Gregor Middell: you might have a physical object that contains multiple texts.

Abigail Firey: XTF could not handle multiple manuscript items.

We’re doing some programming to fix it. Quick fix: separate out into separate files.

Gregor Middell: we are transforming XML markup into a standard markup format able to handle arbitrary overlap and using the parsing step from Juxta. We are also using collation to blend different hierarchies into each other. If you have different transcriptions of the same text manuscript, you can collate them and get correlations between texts that are the same—you can blend different markups over each other. We use collation to semi-automatically blend them.

Abigail Firey: collation is helpful for cleaning up transcription errors.

Gregor Middell: you are getting all kinds of problems in standard markups. You cannot modify the file anymore without major tooling.

Abigail Firey: we feel that we have to have gloss display, toggle feature . . . every time we move around, we lose it.

Nick Laiacona: for first version of web service, side-by-side comparison: once you’ve marshaled the texts, fire up Juxta, do your work, share the .jxt file—can post to your workspace? Display on web service?

Lou Foster: [sorry, your note taker missed the question]

Nick Laiacona: … you could get back a stable URL and do editing functions in Juxta.

Abigail Firey: in Drupal we have stable URLs.

Dana Wheeles: we are going to need some kind of anchor or command: which two of your comparison set do you want to have up? Do you want to highlight a particular passage, jump to a piece of the text?

Nick Laiacona: marginal notes are similar to Alex’s floating boxes—texts are emerging in parallel; we’re looking at them wrong, in a linear way

Abigail Firey: with collections, you can’t tell where they end. Scribes add a few more canons. Can’t tell how or why they got there. Additions, interpolations, full form?

Alex Gil: these things are operating at different dimensions. Juxta should be aware of the dimensions. It’s important to keep in mind that what we have here is a system of relationships, just like Collex. This block in a manuscript has a relationship to XYZ.

Abigail Firey: we wanted to see the relationship of the texts within the relationships. But if you imagine the scribe putting together a collection from other collections, the relationships between any two collections are not the ones that matter—you need to collate at the canon level.

Alex Gil: we’re in complete agreement.

For example, a canon can have five relationships. Choose one. It has seven relationships.

We’re talking about networks, organically connected.

Abigail Firey: there is no single text.

Alex Gil: this doesn’t mean that every single item can’t have a set of fixed relationships. Every box in my chart has finite number of lines leaving.

Nick Laiacona: those lines are interpretations.

Alex Gil: those are not arbitrary. I figured them out. I did it manually, but there was a percentage overlap; it was obvious these things were blocks. But there were a fixed number of relationships. There will always be finite sets of relationships and they can be computed.

[break]

Ronald Dekker: the Gothenburg model for collation. An abstract model for collation.

Intro: My name is Ronald Dekker. I work in the art institute in the Hague. I am surrounded by researchers. I am the head of a development team; I am not a researcher but surrounded by them.

We have software for a publication tool. We support annotations.

Transcription, annotation, collation, publication.

I do the technical direction and implementation.

At the 2009 collation summit there were experts from Juxta, TextGrid, CollateX, eXist, Interedition, etc.

The idea of the meeting was to come up with a generalized model of collation so we could connect software, work together instead of compete.

Challenges of making a collation tool: (mainly from the perspective of getting as much usage of your collation tool as possible; not from the perspective of the end user, but how can I get the largest community around this tool?)

How do I support source material in different file formats? Relational databases, TEI, etc. How do I get existing data into a collating tool?

There are many ways to visualize a collation result. Alignment tables, etc.

Source material can consist of different (national) languages—Latin, Greek, etc.

Abigail Firey made the point earlier that developers tend to ignore the differences. Not every language works with every tool.

Abigail also mentioned spelling variation. I have noticed it in medieval Dutch, Greek, German.

Organizations use different platforms (Windows, Unix, Mac; different servers; different desktops). A collation tool needs to run on multiple platforms.

Developers use different programming languages.

Requirements: Support multiple file formats, support multiple languages, etc.

Source material stored in different file formats.

**Make sure you can get existing material into your new collation tool: import or convert.

** Make the barrier of entry as low as possible. How?

** See what is the least amount of structure you require.

** Alignment table: one of the simplest output formats.

** Simplest model: You need a set of witnesses and some identifier for each.

** Plain text content is the simplest you can have.

** You can convert TEI into plain text.

** Basic pipeline for collation: Witnesses > Aligner > Alignment table

** This pipeline will be more difficult by the end of the presentation.

Source material consists of different languages

** Example: tokenization of Armenian text not splitting on white space.

** How do you deal with that in a generic model?

** Extend pipeline with tokenizer: source > tokenizer > witnesses > aligner > alignment table

** You can customize tokenizer and alignment. (two steps)

** Tokenizers support different file formats; all produce a set of witnesses; each witness contains an array of tokens. Token has two representations: actual content and normalized form.

Need a way to visualize results.

** Extend pipeline: source > tokenizer > witnesses > aligner > variant graph > visualizer > alignment table

** Three steps that can be customized.

Need a way to handle spelling variations

** Language specific

** Happens a lot

** Obvious to humans, not to computer

** Need custom matching function

** Extended pipeline: source > tokenizer > witnesses > matcher > matches > aligner > variant graph > visualizer > alignment table

** Can apply rules to specific tokens

** Need different rules for different languages, but can reuse the rest of the pipeline.

This is the model that we’re using now.

Technical stuff.

The pipeline is an abstract concept; there are many ways to implement it.

Each of the parts is a web service, using HTTP and REST

Different institutions can host different web services.

When you have a collation problem, you can write your own solution.

You can also use a pipeline that stands alone on your own computer.

Our pipeline runs in cocoon.

CollateX is written in Java; Java VM is platform independent.

JSON and XML are platform independent.

Developers on my team work on different platforms; it doesn’t matter.

Sum up: how to achieve interoperability.

** Split up functionality in small, self-contained parts

** Determine input and output formats for each part

** Require the minimum amount of structure in the input; you can add more

** Bring experts together

** Build a community of users and developers.

Alex Gil: What is collation, for you?

Ronald Dekker: finding similarities and differences between texts. There is nothing specific about texts.

Nick Laiacona: linearly sequenced …

Alex Gil: sequence might not mean anything—

Ronald Dekker: there can be transposition; I don’t mean the order is fixed; but there is an order in which you read the text

You still have words that read left to right. There are bits, objects, that you read in a certain—alignment is about order.

Nick Laiacona: where are you going with this, Alex?

Alex Gil: two matching sequences each belong to higher level sequence.

Ronald Dekker: I think you’re going into my next presentation. CollateX doesn’t use the diff algorithm. It tries to detect sequences.

There is some order in the material. If you have blocks of text, the words are in a certain order. If the blocks are moved around, the words are still in order, and CollateX tries to find that.

Nick Laiacona: we’ve been talking about how you choose your witnesses.

Abigail Firey: I’ve been thinking about transposition, trying to get the machines to think like humans. At a certain point, a human will recognize a paraphrase; it’s not the same text transposed. It’s an editorial judgment call. How to resolve that in terms of pattern matching? Google searches don’t care about order—they find the match. Sometimes you get bad matches because they’re not close enough to what you’ll recognize as a match with transposition. I don’t know how you set the threshold.

Gregor Middell: Google doesn’t collate; it indexes.

You can think of this as pattern matching. What you lose is the notion of editions. If you talk about editions, you do it in sequential context. If you don’t have that order and look for loosely correlated things between texts, you get rid of the problem of transposition, but you get rid of the idea of editions. There has to be a match above and below a word in order to recognize a deletion.

Alex Gil: I call those limiting matches.

My own solution to the matching problem: you can come up with delimiters.

The perfect situation for the diff algorithm is one in which there are no transpositions, just additions and deletions. From token A to Z there is a sequence. Absence and presence.

Nick Laiacona: There should also be a nice ratio between unique tokens and the total size of the text.

Alex Gil: If there were a loose network of texts: if we could identify all instances in which diff could run perfectly, we would have isolated mathematically what we mean by “blocks of texts.”

It can have meaning for humans.

Finding those things will make us create relationships.

Abigail Firey: the importance of transposition in pattern matching is language specific. A Latin transposition doesn’t make much difference; English is order-dependent.

Gregor Middell: example of omission: doesn’t pattern-match; we see the omission; the question is where it was omitted.

Dana Wheeles: everyone sees the usefulness of pattern matching; we’re talking about how to refine collation for text cases. You (Alex) are thinking about a larger question: if I could go across the world’s texts on the web . . .

Alex Gil: I am thinking of a specific corpus. I am worried about editing one book that has a series of relationships to chunks of texts of different sizes. Those sets of relationships are the only thing we know perfectly. Before the collation tool can work on my texts this needs to be in place.

Nick Laiacona: there’s a step to the left of that “source” box, that’s a search. We take for granted that you have two witnesses that are suitable to be compared.

Alex Gil: in Juxta, all the files are flat. I know in my head the relationships. Imagine I would be able to mark all the moves—those moves still don’t mean anything in relation to the bibliographical items on the side. For example, a block moves between two big texts; that chunk is also published separately. Juxta misses that relationship. If I want to do an edition of everything that ever happened to that poem . . .

Abigail Firey: It’s a question of expectations. I expect a lot more human intervention at various stages of the process. Whenever you develop software and all the investment—is this really better than doing it the old-fashioned, human, manual way? Are we expending a lot of money and effort to do something that my brain does pretty well?

Thinking about how best to extend Juxta: there is the question of scholarly intuition where you have an idea about how a text works and it’s just not worth trying to get the machine to show everybody that. You have to make an editorial statement and say the preponderance of evidence shows it is this way.

Nick Laiacona: the search engine naturally inserts itself before the collation step.

Alex Gil: as advanced as Juxta is, I use it only to catch my own errors in transcription. In order to get that going, I have to break things by hand into blocks so that diff can work on them. It’s an enormous amount of work. The only useful result I’m getting out of Juxta is to prevent transcription errors.

Using screenshots to show comparisons: people used to do that by quoting.

Abigail Firey: I’m using it to discover differences: whoa, “homilies” / “homilies” [?] ! Different meaning! Different word!

I’m not doing the editorial work, I’m trying to figure out what the work will be.

Alex Gil: Juxta promises to be able to generate a complete apparatus.

Ronald Dekker: I understand what you’re getting at, but that’s a different problem about relationships between different texts.

Alex Gil: I thought that was what it was all about.

Ronald Dekker: up to a certain limit. A computer can do things faster; it can’t do a semantic interpretation.

Gregor Middell: I met with the International New Testament group in Berlin boot camp. They are using Collate (Peter Robinson’s Collate). We said show us how you use it.

One of the scholars walked us through the collation steps: assembling apparatus by hand. Token by token. Stepping through the text.

Their project has come to a scale where they can’t handle it manually any more. We have 100-150 documents to make sense of, and we can’t.

Alex Gil: we don’t disagree. What we’re seeking is a balance between the automatic functions and manual functions. I like your pipeline. But we still need to remember the total goal: establish relationships between texts.

Technology is useful to deal with scale, eliminate gruntwork. I’m not saying the manual should not be there. But we want automated functions to do as much as possible. If it’s all left to manual it’s not going to work.

Philosophical point: we need to re-conceptualize how we think about relationships between texts. Sometimes a bibliographic unit is a very small block of text within the larger unit.

Dana Wheeles: Now would be a good point to hear more about CollateX.

Nick Laiacona: Gregor, did you want to finish a point?

Gregor Middell: I’m working on a group to write the critical apparatus chapter of the guidelines.

This is my main interest from a software development perspective. We want to encode in TEI P5

There are two competing groups: software developers want to generate the apparatus automatically. Then there’s the crowd of scholars who are used to manually assembling apparatus. There are different requirements: philologists care about human readability; the output should be manually tweakable. They must be able to fully control output or won’t use it.

There’s a group of users who want to use output as a means to intervene in the editorial process; software developers want to use an automatically generated apparatus to feed back into the electronic files, e.g. create back-references.

Abigail Firey: we’re circling back to my admiration for Juxta’s current “generate apparatus” function.

[break]

Ronald Dekker: to introduce myself: I am the lead developer of CollateX

Review pipeline: general model

Second presentation: CollateX implementation

CollateX 1.1

Contents: what it does; how it works

First: show CollateX to the end user

Second: behind the scenes

Intended to be a successor to the old Collate written by Peter Robinson

Collate was used to collate The Origin of Species: use case

Source: simple XML file

CollateX gives back an alignment table

Each column has an identifier (year text was published)

The last three versions have some changes in common

The last two versions have “compare” in common

Some cells in the table represent parallel texts. E.g. “when we” occurs in all texts

You can have parallel segments that occur only in some versions

CollateX doesn’t conclude whether it’s an addition or omission

There is no base text

If you want to use a reference text, you can

Darker gray areas: cells where variations occur

This result is done without any human intervention at all: straight from source file to result table.

This is not perfect, but the idea is to get results fast.

Great starting point to explore your material.

You can already see that the last two versions of The Origin of Species have more in common with each other than the previous four

You can also get the output in XML format for further edition work.

How it works:

Why write another collation tools? What makes it different?

It does multiple witness collation—diff usually compares two.

Baseless collation: no assumption about reference text

Parallel segmentation: pieces of text that occur in multiple versions: lines them up.

Gotenburg model (pipeline)

Client server: lives on the web.

If you want to make a desktop application with it, you can.

Collation micro-services model

Examples of preparation: tokenization, regularization. (in the example, it ignores capitalization)

Alignment: doesn’t use diff

Goes through whole text comparing each token to each other token

Diff tries to go for the longest coherent sequence; CollateX tries to go for the smallest number of sequences.

Stores the result in a graph.

Omissions, additions, transpositions, changes.

Abigail Firey: how does visualization of transmission work in grid?

Ronald Dekker: we’re working on that. It’s a hard visualization problem.

Abigail Firey: how does it work on non-standardized orthography?

Ronald Dekker: that’s in the matching part of the pipeline—it is not perfect.

You can change this.

When you have a near match, it will see it as a repeated token; it has to look at more context to figure out what is going on there.

Nick Laiacona: our hope was with the Gothenburg model, by breaking down collation into distinct steps, you could always use the best tool. E.g. use a good tokenizer, Juxta’s Latin noise filter, etc.

Gregor Middell: there are two places you can intervene: matching and comparison. You can get handed every possible pair of tokens and decide whether you think they’re the same. Not manually—you can tie it to a database.

Ronald Dekker: either rule based or human determination.

Nick Laiacona: so if you build your whole pipeline around one of these tools … you can use a good algorithm.

Abigail Firey: (admiring collation graph of internal CollateX logic) parsing the witnesses is the hard thing in traditional collation

Ronald Dekker: visualization step: alignment table or XML output

Abigail Firey: can you move the columns in the alignment table?

Ronald Dekker: the idea is that it gives you a pretty good result in a small amount of time …

Abigail Firey: I wondered if there was a “click the button, rearrange the columns” function

Ronald Dekker: no, we don’t have the usability of Juxta yet

Gregor Middell: (examples of use cases)

Ronald Dekker: we prioritized getting the Gothenburg model working

I am hopeful that we will integrate more tools

There is nothing in the model that prohibits you from doing what you want; it just isn’t made yet

Alex Gil: differences between CollateX and Juxta?

Ronald Dekker: Juxta works from a base; CollateX doesn’t.

Nick Laiacona: we were going to wrap up at 5:00; it’s 5:18 now … I don’t want to shut down that conversation …

Ronald Dekker: I’ve never really done a direct comparison; I’ve been focusing on getting this model running.

Gregor Middell: part of the problem is there is no official benchmark for collation tools to determine whether a collation is correct.

Nick Laiacona: thanks for attending. I’m looking forward to day 2.

We’ll reconvene at 9:00 tomorrow morning and start with Jim from Bamboo.