-
Notifications
You must be signed in to change notification settings - Fork 3
Juxtacamp20110711
These are notes from Day One of Juxta Camp held at Performant Software's offices in Charlottesville, Virginia on July 11, 2011. These notes represent an attempt to paraphrase the discussion and should not be taken as verbatim quotations from the participants. Brackets and ellipses mark spots where your note taker missed details.
Abigail Firey: [your note taker missed the very beginning of the discussion] … transcription and edition
Medievalists think of an edition as something constructed from multiple witnesses etc—different from a working transcription
Dana Wheeles: two types of scholars: 1. Want working text to play with 2. Bibliographic, careful about specific witnesses etc.
Alex Gil: constraints of book—can only produce one text; electronic allows multiple texts, easier to visualize differences
Same information presented differently
Abigail Firey: we are going back to the original intention—allow user to understand what is going on in each witness
Nick Laiacona: not much movement in intention; movement in product
Abigail Firey: there will be a lot of new questions
I went to a meeting of the European Society for Textual Scholarship—they are grappling with implications of digital world
New questions aren’t formulated yet
Alex Gil: what do the new questions smell like?
Abigail Firey: my question: how we’re going to think about whether variants are significant or not
This has always been a problem
Automation allows us to show all variants all the time—we can get overwhelmed by what were traditionally called insignificant variants
So much was left off the printed page because of physical constraints but also conceptually—marginalia was marginal
People are now more interested in reader response e.g. interlinear glosses
Alex Gil: once we separate content from form you can generate as many views as you want
You can make insignificant stuff invisible
You can generate a series of documentary editions and critical editions on the fly
Each user can produce their own edition, play with it
Abigail Firey: that’s the dream
Anytime you lay hands on the text you’re intervening
Markup is a basic level of intervention
Preparation of texts for philologists and literary people: markup will play out in different ways
Developers and humanists could get into a dark alley if they try to build for all potential users at once
At CCL we decided we can’t build for philological analysis; that’s someone else’s project
Alex Gil: I’ve heard this for a while now: we can’t build for everyone
John Unsworth says there is a common denominator
One common denominator is transcriptions
Another is yes or no variations, basic, logical: difference, similarity
We share those things; everything else can be added on the level of the markup
Abigail Firey: there are transcription issues . . . you’re right, but to play devil’s advocate: most people reading the text will automatically expand Latin contractions—some people will care about distinction; linguists care about u versus v in Latin texts
There’s going to have to be negotiation
Developers can be aware that there are these swirling clouds
Nick Laiacona: the open source development model meshes with how scholars work, how tools need to develop
You can have common code
To the extent that they don’t agree they can build their own forks on the project
Starting from five years worth of development; add the tool that I need
Abigail Firey: agree, think it will happen at the tool level
Dana Wheeles: agree
Alex Gil: agree
Tools separate markup from transcription
You can have specialized marker, do search and replace
Abigail Firey: IF everyone can agree to the same standards
Standards are like toothbrushes, nobody wants to use anyone else’s
Every transcriber is doing things differently
You can make guidelines, but they get unwieldy
Nick Laiacona: TEI is a data format with no tools to read it
E.g. web browsers consume HTML
Software has its own gravitational field that bends the protocols
Abigail Firey: yes; ambivalent about TEI
We have to wrench our protocols around TEI in illogical ways
TEI presumes that any text is highly, systematically structured
Alex Gil: possibilities of Juxta
Most of what it can do is outside of TEI
It can be TEI-agnostic
The future of computational power resides in algorithms, not markup
Any tool for edition-building can be added
Nick Laiacona: the diff layer is TEI-agnostic; the analytical layer is not
Abigail Firey: you’re right to go around TEI
On the other hand, we’re putting in that markup for a reason
Some is about textual matters that matter in collation
How much is it about visual collation?
What if scribe changed a display line?
Rubric in display text versus main text
That would be a distinction in TEI
Is Juxta going to support font differentiation, visual tricks to show such distinctions?
Alex Gil: three solutions
I believe in bibliographic markup: not semantic, interpretive
My markup worries about line length, margins, etc.
That can be translated into HTML markup, compare margin sizes etc.
The way you transition to that to allow different communities to use it . . .
Create style sheet builder within Juxta
I am already working on style sheets
Second level: tool would allow users to indicate the text they want to render
As opposed to building style sheets from scratch
Third possibility: have Juxta generate standard views—flatten out everything but in a meaningful way
Each approach works for different users
Last one works for most users
Second works for specialized user
First for bibliographic users
Abigail Firey: adding forks between users . . .
Bibliographically inclined, physical artifact versus those who are doing semantic work
There are real forks there
TEI modules reflect that fork
We want to use msDesc tags but you can’t use them within the textual module
It’s assumed they describe not contain
But I have paleographic features that have implications for the text; can’t use those tags; need workarounds
Division within the community
Medievalists are an interesting body of people to consider because we have a balance of relating the physical artifact to the text . . . not unique but we push it in different ways
Because so much of our evidence is … [note taker missed the predicate here]
Manuscripts are different from printed books
There are overlaps but you fork into different presumptions
Dana Wheeles: this would be a good point to talk about where the Juxta code stands now
Latin noise filter
More useful?
Abigail Firey: yes
Nick Laiacona: have you seen 1.6?
Need dropbox access
…
Alex Gil: Juxta allows you to separate layers into virtual versions
Abigail Firey: that’s what we were working on with 1.5
Alex Gil: want Juxta to compare documents in different layers
Abigail Firey: some people are going to want to be able to collate glosses
…
Abigail Firey: sample text is anomalous
Texts are broken into smaller units (canons, statutes)
Those get shuffled, compilers add prefaces, etc.
Many times you’re not comparing texts that resemble each other
Not an easy problem to solve
We’re good for Juxta: we present extreme cases
Nick Laiacona: macrons . . .
Abigail Firey: in many manuscripts, there will be macrons in every word
If transcribers can’t expand an abbreviation, they’ll leave macrons
e.g. if n-macron for “non” it will appear in transcription as “n”
Things frequently abbreviated in latin manuscripts: inflection endings, verb endings, common words (quam, quoniam, enim) e.g. qm with macron = quam or quoniam
Prefixes e.g. per, pre, post, pro – scribes may use different abbreviations, p with different marks
We made the editorial decision that all abbreviations will be silently expanded; if there’s a problematic one—
Alex Gil: expanding at level of transcription? Really?
Abigail Firey: yes, that is standard practice
Philosophy of editions
Medieval scholars: edition is supposed to help you read the text in a way the manuscript doesn’t
Editor makes text more readable, not fight through abbreviations
It is accepted practice to silently expand abbreviations, note in preface
Can note if scribes consistently use/misuse an abbreviation
Exception: some editors and rich publishers will put expansions into italics
When you are doing that with every word—a nightmare for printers, very expensive
Most critical editions didn’t bother
Heard editors in Europe (Spaniards) proclaiming that every variation would be marked
Are you going to allow people like that to show expanded abbreviations?
Nick Laiacona: options: change font or color of text; put note on side . . .
Alex Gil: it would be easy to code abbreviations and expand them
Technologically speaking, having software run through them and let an operator select from options – not that hard
Abigail Firey: at a practical level, when I’m transcribing a large text, I don’t always see abbreviations; I just read them
I have to stop, think
Need convenient way to record
It’s easier to write “per” than the abbreviation
Some odd ones aren’t easily reproduced
It’s going to make transcription not practical
I think the Spaniards are nuts—they’re going to go at snail’s pace; will have to triple-proofread
Alex Gil: we would like Juxta to remain agnostic; allow both practices
Any meaning happens at the level of markup
If Juxta can have an abstract way of dealing with all markups to indicate how they’ll be represented—allow user to choose—Juxta generates representation
Abigail Firey: when we hit problematic abbreviation where you don’t know if it’s quam or quoniam use TEI abbrev tag; it has certainty level
Alex Gil: useful to think backwards from representation, not forward from manuscript
What do I want to show?
I go back to Cornell editions on paper of Yeats
They used system that captures …
I don’t think we can get better than that for representing layout
Abigail Firey: or edition of The Wasteland
Alex Gil: yes
Start with HTML; how many steps do I need to get there from TEI
How TEI must work
How Juxta must generate this
At an abstract level, all that Juxta needs to do is recognize positions transform into CSS
Comparisons: recognize difference and similarity
Different categories: addition, deletion, move
Different category: abbreviation expansion
Expansion is still going to be compared under addition, deletion, move
All you really need Juxta to do is recognize categories at the tag level that you want highlighted
Whether you want Juxta to mark as difference, similarity, or ignore
Abigail Firey: there are a few cases where for some disciplines there’s additional material e.g. mark folio number—that’s neither addition, deletion, nor move
Alex Gil: there’s two algorithms and the absence
Difference, similarity (mark sameness)
Ignore function: absence of similarity and difference: e.g. for pages—you want to represent but you don’t want algorithms to run on those things
Can mark things to ignore but it really ignores
What if you want to half ignore: still be there, but not noticed by Juxta
Sometimes we don’t want to compare the whole text
[shows chart of text pieces]
Abigail Firey: yes; e.g. take out preface
This chart could represent my work
Might want to collate little chunks
Alex Gil: we need to re-imagine how Juxta handles workflow
There should be options
e.g. start slow
Juxta does not handle this complexity
We could handle my text if we added the function to allow user to specify parts to compare
Problem: a diff is a second level operation to a similarity
diff never comes first
It assumes alignment
But sequence in texts is unstable
Imagine a text with 66 moves …..
Dana Wheeles: we need to think about our agenda
Hold off for your presentation
Alex Gil: for real life editors, the ability to handle texts that break down into many pieces is essential
Abigail Firey: agree
Medieval texts: authorial version doesn’t exist/matter
Texts are freely changed
Floating around in small units getting recombined
Alex Gil: applies across history, culture of reprinting
Abigail Firey: not reprinting the whole in a standard way
Mashups, remixing
Abigail Firey: Juxta has to be built for that presumption
Nick Laiacona: we assume a file with text, let’s diff it
Deeper problem: find common start points
Alex Gil: Juxta started by thinking of files as versions
Needs to think of texts as fragments, containers of fragments
Abigail Firey: have to retain model of different versions because a lot of editors will be doing that
Alex Gil: yes, necessary subset
Abigail Firey: a lot of Juxta users will be traditional, two versions …
Digital world: ability to edit texts we’ve never been able to edit because they’re mashups
Nick Laiacona: the developers of Juxta are hoping to find interface for web service
What kind of questions can you ask Juxta and what will answers look like?
Abigail Firey: user interface
Colleague was baffled by it
Not as transparent as one might think for first time user
Nick Laiacona: users are on the web: future of Juxta is on the web
It will be a long time before we can make the web version do everything the desktop version does
What do you put on web?
We’re talking about 2 interfaces: Java application, HTML comparisons in edition search
Alex Gil: you use Juxta to prepare an edition, but it is not the edition
You can make edition available—you wouldn’t want to use Juxta view
Abigail Firey: wondering about what gets lost in move to HTML
For a long time Classical Text Editor was stalled
CTE back in the game
They do full preparation for a print edition
Two functions for web Juxta:
-
web replication of local software
-
next stage of edition preparation
Alex Gil: no, not preparation stage, consumption stage
Abigail Firey: I think there are two different stages and I’m not sure you want to substitute the second for the first or consider them one
Nick Laiacona: scholar working with text versus reading version?
Abigail Firey: we have Drupal environment, individual spaces password protected
People can take bits from our database into individual work spaces
Work with Juxta, export from CCL editorial space into whatever format
Dana Wheeles: my concern: we’ve been thinking of web service as a pared down version
Difficult things would need to happen with downloaded Juxta
Question of how much we can do in a browser
Nick Laiacona: no not a question of how much you can do
Want to have everything in web environment eventually
Initially, you want to be able to share results on web
What’s the essential step that needs to be displayed
You seem concerned Abigail?
Abigail Firey: I think the current Juxta display of collations / differences / moves / notes would be useful to display on web, consumable
Nick Laiacona: I am confident we can reproduce that on the web, have prototype
Abigail Firey: what do you expect people to do on local, what on web
Nick Laiacona: That’s what we’re here to decide
Abigail Firey: I am having trouble imagining what users would do, what would be visible to whom
Nick Laiacona: we need your help to figure it out
Abigail Firey: I want all of it
Dana Wheeles: we need to prioritize
Are people going to be stripping XML, adapting schemas? (In private work spaces)
Abigail Firey: our users will use our online transcriptions in a Drupal database, XML files
Alex Gil: will they have the comparison view?
Nick Laiacona: yes
Alex Gil: imagine separating the process of creating from how you show them to users
Juxta will generate the view automatically
Abigail Firey: CCL does not determine which texts will be compared
The user determines
Clarification: end users, not just editors
Alex Gil: you want end users to use regular Juxta as interface
Dana Wheeles: assuming end users are scholars
What sorts of questions?
Abigail Firey: CCL is agnostic, we do not predetermine the questions
We make the corpus available for collation
Are texts related? Do canons change?
What if I choose a large collection of texts that all come from Paris, for example – do they read differently from the Berlin versions?
We don’t want to predetermine
We want scholars to explore
Then say it makes sense to edit 60 manuscripts from Paris (for example)
Alex Gil: so users are still working before the final product
Abigail Firey: yes, CCL does not host editions
It is an environment for people to find what they might want to edit
Juxat is part of the research environment, open to everyone, not just the editorial team
Dana Wheeles: that’s why I think this can work well
One can set up workflows without determining what work people can do
People can always ask for different capabilities
Give users certain abilities and they will gain confidence
Abigail Firey: people can take a copy of transcriptions
Get it into their own workspace, play with the text (fragment or whole)
Choose a comparison set
Start comparing
Open their workspace to other scholars: look at this comparison set, what happens when I change the base …
Once they figure out the most useful comparison set, then export selected data and work with it locally
Since transcriptions are on the web, workspace is on the web, it’s the easiest place to show comparisons to other people
Nick Laiacona: ultimately the whole thing should be on the web
But I don’t think we can do it this year
I think we can get side-by-side comparison on the web
Discover, load transcriptions, compare, share with others
Abigail Firey: what can’t go on web?
Nick Laiacona: it’s easier to say what can
Simple side-by-side comparison of two texts
The web service would allow you to get information about the entire set you’re comparing
What I’m thinking: modify desktop Juxta to make it easy to discover texts from CCL and load into comparison sets
Abigail Firey: if necessary we can instruct users to download to local
Uneasy about having them upload stuff to CCL site
What about heat map, histogram?
Nick Laiacona: those would be the next things
Abigail Firey: not in first web version?
Nick Laiacona: we could delay and make them part of the first web version … not hard once we get other parts working
Abigail Firey: what kind of export function?
Will you be able to generate a critical apparatus?
Dana Wheeles: you would want HTML, a link to place on web where you can see the comparison
What format makes sense?
An HTML apparatus is not useful
Abigail Firey: a “generate critical apparatus” function would be useful—presumes download to another environment
XML for people who want to retain XML markup for other web publications
We’re trying to make our schema standard for medieval canon law texts
Our markup may be useful for other medieval canon law projects
Something that sensibly strips the text and gives plain text reading view
Dana Wheeles: we’re on the way there
Within the .jxt file you have a copy of the files that are being collated in the original XML format
If they download the .jxt file of the comparison set they’ve got the markup schema
Abigail Firey: they didn’t have to export from Juxta—once they do the exploration and know what files they want…
Nick Laiacona: we can generate a plain reading copy
You upload a text, compare it with another
You should have a URL that goes right to that text
Download that in HTML
Produce a version that has no markup whatsoever
Abigail Firey: what have most users been doing after collating in Juxta?
Dana Wheeles: most people aren’t interested in critical apparatus
They take screenshots of critical parts of visualization
Nick Laiacona: they’re not sharing it effectively
Abigail Firey: they’re not editing
Dana Wheeles: they’re using it as an analytical tool
What alex is talking about is beyond what we can do
Nick Laiacona: there used to be no editing function
Now you can choose what to keep and what to exclude
Abigail Firey: I’m struck that people are collating for reasons other than to produce an edition
Alex Gil: in theory Juxta should allow discovery AND generating editions
Generating an edition is just a matter of export
Dana Wheeles: but everyone will have a different idea of what they want
Alex Gil: the answer is style sheets
The data is same for everyone
Abigail Firey: I wonder if partnership with editorial software like CTE would supply that stage; export functions are already in CTE
….
Nick Laiacona: Juxta is living in the research world more than in publishing and consumption
To what extent should it extend into publishing and consumption?
Abigail Firey: or interface with existing software …
Dana Wheeles: it’s a tool for analysis right now
We want to make sure people can do all the analysis they want
Abigail Firey: how do you harvest that?
Alex Gil: analysis is useless if I have to retype the data to produce an edition
Why recode everything if Juxta performs the tasks?
Have had to add span tags to HTML and use Javascript to align things
That’s what I’m publishing
I don’t want readers to mess with my stuff, just read it
Dana Wheeles: we’re talking about different things: a web service with some level of analysis, and a web interface to give a better representational tool for analysis
Alex Gil: an analytical tool can easily generate something that’s not a tool, an edition
Dana Wheeles: in order to develop Juxta for the web and get to that endpoint, how can we strategize what we want to build, to get sense of how people want to use the service?
Do we build it to be easier to compare two texts side by side, or do we start with a heavy version?
Alex Gil: I vote to start simple; anyone who wants to do heavy work can download the desktop version
Dana Wheeles: we want to build things in for CCL that will make it easy for users
Abigail Firey: heavy analysis = collation, side-by-side, histogram, heat map; generate critical apparatus—get a list of variants—from that you can build an edition
That’s the current “exit door”—not bad
It’s essential to have some way of capturing collation—that’s what a critical apparatus has always been
Users can do other things with that
Users can see a lot, analyze a lot, but can’t DO much when it’s all just visualized in Juxta—visualization is not the end point
We need to bridge Juxta and the scholar’s problem
Dana Wheeles: going to the web service: we’re not stuck with one form of critical apparatus
.jxt is the critical apparatus
It contains the texts under question
Abigail Firey: how do we transfer it from user to user?
Nick Laiacona: use the .jxt file to transport
The software runs algorithms
You’ve chosen what witnesses are included, starting points, etc.
It’s a way to share with someone else
Bridge from research environment to editions: bring .jxt file to own system; use Juxta in different way; take HTML representations
Transforming TEI into HTML is important
Abigail Firey: people can get TEI from us
Get analytical Juxta on the web
Give greater exposure to users and editors
We can make the case that we need funding for next stage of development: we’ll need a better interface for producing consumable editions—that feels like another major round of development for separate funding
Nick Laiacona: there’s the distinction about starting point for texts
Alex Gil: the dual view should be standardized
Edition building: a stage of funding
Dana Wheeles: I could imagine going to CCL and as we’re working on features for discovering similarities we could take small batches of texts and pre-run them and say hey, the computer has found interesting similarities—would you like to take a look?
Abigail Firey: CCL users are already panting to collate
They’re limited by the current size of the data set
Nick Laiacona: when you put a file in a comparison set—it automatically compares to all others
You could already have all cross-comparisons
Alex Gil: we are already primed to export an XML file that has differences produced by the algorithm encoded in it
Nick Laiacona: there’s an issue with web service of real time computation versus batched computation ahead of time
Perhaps there’s a way using somebody’s laptop to batch pre-compute their entire edition on their laptop and upload that .jxt file
Abigail Firey: I’m wondering about what happens when our corpus gets big
Alex Gil: once you do the work, you want to save it, not redo it
Nick Laiacona: Dana went to get lunch
Ron and Gregor are not here yet – we may see them at 2:30 or 3:00, or tomorrow
Let’s go on the assumption it’s going to be just us today
Let’s spend more time on discussions tomorrow instead of hacking
Alex Gil: shall we show and tell the new features?
Lou Foster: you saw spot editing in 1.5?
Have you seen templates?
Alex Gil: I never saw 1.5
Lou Foster: new features:
Parsing templates
When it sees a new schema, it pops up a dialogue and allows you to include or exclude different elements
Abigail Firey: we decided to have one global thing that says “exclude most of the header tags”
You don’t want to exclude the TEI tag or you get a blank screen
Lou Foster: template files
Nick Laiacona: you could create a “skip headers” template
Lou Foster: you can save / rename / change template
Templates are associated in two places: 1. Tied to Juxta file; 2. Library of templates not tied to Juxta file
Alex Gil: can you edit on the fly?
Lou Foster: menu … [demonstrates]
Abigail Firey: could you add a “save as” or “generate new” button to dialogue?
Nick Laiacona: there are two sets of templates: current and archived
Lou Foster: spot editing
Edit button on lower right
Abigail Firey: that terrifies me. What about version control?
Lou Foster: there is none.
Alex Gil: you’re supposed to write in the editorial statement
Nick Laiacona: we could generate a new witness rather than editing the original
Alex Gil: down the road, you want to share files, make an open source edition;
Someone else can make changes to my file
What happens when the next person downloads?
Versioning is crucial
Abigail Firey: this is a problem for editors: we’re working with to-the-letter accuracy
You’re never going to remember your quick fixes
Now your file is different
You lose track of whether you’re looking at an edited file
Pop up a warning in red: you have changed this file, do you want to change its filename?
Nick Laiacona: do you want to make a new witness or replace this one?
Alex Gil: save it for history—at no point should you erase the history
Nick Laiacona: the origin of this feature was in the SHANTI funding; we were designing for students, not necessarily researchers
Alex Gil: as much as Dana insists that this is an analytical tool, it is already an editorial tool
It doesn’t exist in a vacuum
Abigail Firey: let’s not put this feature in the web version
It would invite world of trouble
I don’t like the thought of XML files being edited without version control
Not really files, but the way files are going to flow in . . . this is scary
Alex Gil: think about your case: users generate analysis
A user picks up an error
The user has to write it down and remind other users?
Or do you allow the editors to tweak, keep master file, save their own versions?
I’m not afraid of the masses
Idea of perfection . . .
Abigail Firey: I see where you’re going, but—there’s a question of scholarly responsibility and the transcriptions we deliver to users
Why shouldn’t the correction go to a master file? If someone inadvertently messes up the XML files …
Alex Gil: I have the solution. Think of Google Docs: there are three levels of users: private, reading, and public
Put your file there, give access to master file to people you trust
Nick Laiacona: I think Abigail is saying keep copies, don’t destroy anything
Abigail Firey: [anecdote about a person who saved too many copies, losing version control]
Alex Gil: you don’t want Juxta to go on the web without version control
I agree you don’t want shared tools without versioning
Abigail Firey: you should at least have a warning
Nick Laiacona: we’re certainly not going to put this in initially
Alex Gil: there could be a simple toggle of permissions – htaccess file
Nick Laiacona: this comes into play in the interface between Juxta desktop and the web service
Abigail Firey: there’s the issue of synchronization
If we update on the web, users are working with old files
Make web call from the local version to the web and deliver an alert message when appropriate: “witnesses you’ve selected have been updated”
Nick Laiacona: maybe when we load a file into the web service, there’s a URL where you can get the authoritative version
It would be cool if they could call out and get the latest version
Abigail Firey: we have a reporting system for entering changes for user corrections
Nick Laiacona: when you load the web service initially, you’re not loading it, you’re using a URL—you must refresh
Alex Gil: github model / stemmas – trees and branches
Scholars can protect their materials, other people can use same datasets but branch out, play
In github you can still trace back
Nick Laiacona: if we build in simple versioning control system—sounds important—that’s a whole big other piece, but may be necessary
Abigail Firey: I like idea of real-time calls to a URL with notice to the user that these files may have been updated / have been updated
Nick Laiacona: we could keep every copy ever put in and order them, assign a revision number
Abigail Firey: we get into storage …
Alex Gil: you’re not replicating files, you’re just keeping revisions
[lunch break]
Lou Foster: wrapping up spot editing
Edit, hit “update” and it will re-collate
Alex Gil: you can edit XML or text?
Lou Foster: yep, you can edit whatever
If you break it, it will throw up an error message
It won’t give an error if you put in well-formed XML that’s not part of the template
Note support: when it encounters a note, it will put it in a bubble off on the side
If you compare two documents and click on highlighted text it will switch to difference detail margin boxes; click anywhere that’s not a difference to go back to note view
Abigail Firey: what if noted text is one of the differences?
Lou Foster: it still works
Switch to slightly newer version
Show managing revisions
“Abc” button with strikethrough or pencil
Revisions = additions and deletions
Step through all the revisions in the document
When you click the edit button you can accept or reject revisions (scribal revisions) on a case-by-case basis
[Gregor and Ronald arrive]
if you hover over a site of change it will give you a tooltip
Abigail Firey: if you’ve changed it, will it give the tooltip?
Lou Foster: no, should it?
Abigail Firey: yes
Alex Gil: add the accept/reject function to the “ignore” control
Generalize it—give me the option to accept/reject certain tags
Lou Foster: once you’ve accepted all you want, hit the arrowy thing and all your choices will be incorporated in the document (.jxt file)
Abigail Firey: but they’re still marked so you can undo ?
Lou Foster: yes
You can toggle versions at will
You can add different versions to your comparison set—just add the document again
Abigail Firey: we’ve nested add/del tags within set type = core
I see the option to allow users to decide letter by letter
Alex Gil: you want to be able to do this kind of manipulation with the markup that’s already there and the markup you will add
Abigail Firey: this keeps turning into “Lou can you add user options?”!
Nick Laiacona: the problem of too many options: it’s overwhelming for the user
Alex Gil: think of the three levels of working with photoshop: elements, medium, advanced
With this, you could allow on/off –what kind of user are you, basic or advanced?
Abigail Firey: software development is a process: we have to put something out and see what users want, either “we are pleading for granularity” or “the granularity drives me nuts”
Alex Gil: we don’t want to limit the software because of not wanting to overwhelm the users
Editors are expert users
Either/or alienates half of the user base
Nick Laiacona: let’s do Alex’s presentation, then Abigail’s, then Gregor and Ronald
Alex Gil: one more question for Lou Foster: are there changes in the algorithm?
Lou Foster: there are a couple changes in the parsing – to know which revisions you accepted, how to handle notes
Alex Gil: what happens when Juxta becomes not only analytical but a representational tool?
Rhetorical power
Eventually Juxta will produce editions—a finished product with its own interface
An idea has been evolving: re-imagine Juxta as content management system whose job is to produce scholarly editions
I think it’s helpful to think about it as CMS
There’s a separation between the data that all scholars share and their representation of that reality
A CMS like Wordpress has a database of posts, etc.; representation of that content is done differently with themes
Themes – not only style sheets, but also interface—different functionality
Separate representation from data because different editions have different users
You want a reading edition, communicate scholarly categories e.g. dual view
CMS would look like a Wordpress dashboard but more Juxta-y
Several things happen: Selection of texts
Three levels of users: basic, medium, advanced
Readability of text versus visibility
Readable = clean
Visible = analytical categories
The balance is always a problem for scholarly editions
Juxta needs its own standard theme
My work: an HTML replica of the original
Collaboration: all working on same edition
Also want other users to be able to borrow text and make something new, with a new theme
Different levels of permission, recirculate the data
Juxta does not understand chronology; it assumes that you are working with whole versions of texts that share a sequence
[presents his project]
I want Juxta to be able to start by matching approximate strings before it does the diff
- organize corpus; 2. Establish relationship between two different states
Example of several fragments published separately, moved around in relation to each other, etc—a problem of recognizing approximate string matches
Abigail Firey: problematic case: imagine a text of ballads with a refrain—the refrain will cause havoc
Alex Gil: I have encountered this problem
The solution is to recognize the refrain and tell you how often it’s repeated
The problem is noise—it’s a statistics game, so there is noise; there’s also a problem of missing positives
There is very little false positive
You can’t run diff first; you have to find the matches first
Then run diff inside matching fragments
Parts that don’t match, consider as a deletion; ignore it, you don’t need to run diff
Nick Laiacona: let’s do 10 minutes of questions, then Abigail’s presentation
Abigail Firey: Alex’s diagram of his project reminds me of plectograms used by David Birnbaum to show the order of sermons in different manuscripts
[…]
Alex Gil: the graph could be generated with Ben Fry’s processing language if the relationships were encoded in the texts
Gregor Middell: we need to look at terminology: there’s a difference between sequence alignment and pattern matching
For diff, sequence is crucial—it looks for common sub-sequences
Diff is not able to cope with transpositions
That’s how sequence alignment works
Pattern matching has been tried; the problem is that correlating transpositions is an interpretive act; it can only be done by a human making insight
You can make a case that this cannot be solved with an algorithm
We can use pattern matching to establish possible relationship between texts
An editor then must filter what makes sense
Even if I do pattern matching, I cannot make the decision that is made from the editorial point of view
Alex Gil: I agree, I don’t think I said anything different
I can’t use Juxta; it breaks down
Sequence is not a given
Model of pattern matching followed by editorial intervention—not every editor is going to want to work with this workflow
Abigail Firey: I have had trouble with the move function: screen real estate issue: when bits are moved very far …
Juxta feels cramped
If we’re trying to figure out priority features, I wonder if the move function is one of the more problematic; also exciting and promising; room for improvement
Nick Laiacona: let’s pick up again at the end of Ronald’s presentation
Move on to Abigail’s presentation
[break]
Abigail Firey: I am not a developer—this will be at a rudimentary level.
In my experience, the strongest software is built when there is strong communication between scholars and developers
NEH study: the single biggest problem was communication between scholars and developers
I want to go over 3 areas of my project that might be useful to developers to bear in mind
First: some issues that arise for medievalists in working with manuscripts and texts
There are a lot of presumptions about texts based on printed texts that don’t hold for other fields
I also would like to go back to the question of textual structure and hierarchy
Finally: issues of collaboration and workflow
Two things that make the CCL a good test case: we have enormous corpus that is unexplored; it has defied editing. And I am building the CCL on the presumption that I could be hit by a bus and people could still use the CCL collaboratively.
Considerations:
- Latin is an inflected language; every noun has eight forms, etc.
Customized XTF for CCL
Do we need to make Juxta speak Latin? No, if we make Juxta work for environments like CCL
-
Orthography: not standardized—makes pattern matching hard
-
Latin manuscripts heavily abbreviated: transcribers expand abbreviations
-
Paleographic considerations: text as physical artifact
-
Glosses: you can do a lot by hand that you can’t in print
There’s no good system of footnotes or call numbers relating point in text to gloss or note; it’s all spatial—the scribe will put a comment in the margin
It’s inappropriate editorial intervention to convert these to footnotes; we try to use a system that replicates marginal notes
-
The distinction between different hands is very important. Is Juxta going to offer font differentiation to represent change of hand? Different font, different color? This is not a high priority, but medievalists may ask.
-
There are very few autograph manuscripts; we have copies of copies of copies . . .
-
There is very little stability. Things can move, get dropped out, replaced, etc.
Implications for Juxta:
Excited about filtering out “Latin noise”
A lot of small clutter—e.g. using c for t
It’s a question of where on the editorial scale you fall: I think those variants are insignificant. I hate it when Juxta turns blue with sea of differences between oe/ae and ligatures
Some colleagues would say it could be important. Philologists would care.
Make a sliding board—so the user can control which variants they see.
That is hard software to write.
There are no simple rules—oe/ae is an obvious switch, but they sometimes substitute for e; you can’t substitute ae for e …
How far do you want to push Juxta to pick up paleographic features?
Add/del: substitute is not the same as correct. Probably won’t affect how Juxta handles things.
Example of one canon in progress: in a short text, 12 variants, many witnesses.
Editions of medieval texts tend to be heavy on apparatus and variants. We’re used to it. Keep that in mind.
On the CCL, we show a toggle between revision sites / scribal corrections: you can click on a word in red and it turns burgundy and shows variants.
Medievalists are used to seeing in a transcription the locus: what page are we on? Put that into light gray text: it will be easy to read past.
Square brackets with ellipses represent unclear parts of text.
Working with developers: we struggle with terminology and hierarchy
For medievalists the physical artifact (codex/manuscript) is unique. Always named by shelfmark.
A manuscript may contain more than one collection; a collection may appear in more than one manuscript.
A collection is a type of text. “Text” can mean anything from smallest atomic unit (individual canon) up to the whole.
Subdivisions of canon: rubric, title, number, body of statute, identifier
Juxta has potential for future development in how it handles the relationship between a larger unit of text and smaller atomic units.
Do we have to load in a transcription for an entire collection or could we load a bunch of canons and just collate those?
The hard thing about these canons: Alex said we imagine texts in eternal present. I have a lot of canons that do live in the eternal present. Canons get recycled and rearranged over and over even if their origins were at different moments.
Often I am wanting to collate something from different moments.
Then I want to learn about textual context.
I’m always moving between the fine units and the larger units.
Even though we all say that programming doesn’t care much about content and markup is agnostic, actually it’s helpful for developers to understand type of textual material we’re working with, even if it involves looking at THE CHART
Developers started making their own chart of XML format details
We are feeding into Drupal/XTF transcriptions of entire canon collections—can be hundreds of canons.
But we’ve also allowed users to find individual canons.
We thought it would be cool to have a checkbox saying, “I want to put this in Drupal.”
We also offer translations of Latin
We’re using existing CMS to do that rather than ask Drupal to become a one-stop shop.
Nick Laiacona: [question about XTF and Drupal]
Abigail Firey: We’re finding neat ways to pipe between Drupal and XTF; it has involved a lot of code to bridge them
We’re doing things appropriate to each function.
E.g. search function built in XTF
How can we draw on the power of Drupal to deliver data in different ways?
Once you find data in the canons, what can you do with it?
We’re doing something better than “cobbling together” to draw on the potential of both.
We’re embedding an XTF in frame in the site.
Transcription tool, search engine, collation.
Nice standard toolbox.
Shows example of where Drupal is able to tap into XTF display.
Going back to the problem of glosses: medieval manuscripts get gloss-heavy, especially Bible and canon law.
Will it be possible to collate the glosses?
Nick Laiacona: You and Hans Gabler . . .
Abigail Firey: the sheer quantity of glosses shows that it is a text in its own right.
A lot of glosses are interlinear: different spatial relationship.
Our XTF guy said “no problem” but I didn’t like the result.
Switching to T-PEN for a moment: new transcription tool.
Many manuscript repositories are releasing digital images. What can people do besides look at them? Scholars want to do stuff with them.
We’re making agreements that T-PEN can pull images and put them in our repository for the transcription tool.
It’s not taking a copy—it redelivers the image.
Transcription is correlated to location in the image. You can annotate. Our team is getting crushed by markup in a large corpus. Now you can use custom XML tags, human usable. E.g. button for “gloss.” There’s a reminder in the corner to close that tag.
We figure we can get 80% of the necessary markup. What’s missing: TEI placeholders.
We can export in XML, the CCL team can pull it up in Oxygen and add the missing 20%, then put it back in Drupal.
Gregor Middell: how do you handle overlap of tagging?
Abigail Firey: in the 20% that we fix manually. We’ll see that the file doesn’t validate.
You can also upload plain text transcriptions or XML transcriptions and go right to the line and proofread.
There’s a good chance of good resolution of non-validation problems with reference to the images themselves.
We expect that files won’t validate.
A developer wondered why we have Oxygen in the loop: we can’t expect users to produce valid markup. But they can do the bulk of the work.
The CCL is unusual: it’s a collaborative environment. We’re not an editorial project. We’re providing accurate transcriptions made by experts to anyone who wants to use them at any time. That has implications for Juxta: we’re supporting an unknown universe of editors. We don’t know which files they’ll select for collation, what editorial approaches they’ll want to apply, how many will be working together, where they’ll be.
We’re crowdsourced (though it’s a small expert crowd)
People can contribute transcriptions and translations.
If it’s published raw we say it’s “in progress” and people can comment; when it’s approved we stamp it CCL approved.
Once data files are approved, we lock them down. It takes a lot of work.
But we want it to be possible for copies of those files to be placed in individual password-protected workspaces where people can manipulate those files in Juxta.
People want to work with fragments of texts not just full transcriptions.
You can break out chunks and reassemble them (with locus information). <xml:id> gives their position on the page.
We need to do some work on allowing people to break out pieces of texts.
Web-Juxta questions to think about at Juxta Camp:
Data flow: how are files going to be imported and exported? CCL and user workspaces
How do users get files, how do we protect CCL data?
Can you work with fragments as well as full transcriptions?
Can we extract single canons into individual workspaces?
If we make copies of files, where will they be stored? Where will .jxt files be stored? Will there be online help? Where will a web service run—iframe in Drupal?
How do we keep users’ data private or allow them to invite viewers in?
Alex Gil: the canon versus collection problem is generalizable. There are two textual units. One is a bibliographic category: people have bound it a particular way.
Within a collection, there is a stream of texts. In another collection, there may be matching canons. Imagine a collection where there’s only one matching canon. The relation between the two collections is irrelevant apart from matching canon. The large context gives canon its precise bibliographic marker. The other canon relates as a match; the bibliographic code doesn’t matter.
Have I lost you?
Abigail Firey: maybe I lost you. There’s the meaning of the individual canon. It obtains meanings intertextually from the different collections and from its real life application. I’m not sure that the distinction between bibliographic and semantic is useful—
Alex Gil: semantic is not part of it. I’m talking about bibliographic and linguistic similarity.
I can imagine a situation in which all my files will have the bibliographic marker distinguishing them as files. Within those files there are canons included. There may be different relationships between the files and each other and the canons and each other. There are two levels of operation here. It’s not useful to talk in terms of hierarchies; it’s better to talk in terms of networks.
Imagine the ideal software: when I look at canon x, the software shows me the matches to other items.
Abigail Firey: you can find that in the search engine. It will find all the canons with that search string.
Alex Gil: problem: imagine a canon where three or four words are changed. If you search for those words, you won’t find those canons.
Abigail Firey: we’re not actually providing transcriptions of coherently defined texts. If you find two pages that have never been recorded just send us those canons. We have all sorts of texts in all sizes.
Alex Gil: do you make sure each corresponds to a bibliographic unit?
Abigail Firey: we always have a record of shelfmark and folio.
Alex Gil: sometimes I just want to compare matching blocks. If I could click on a text and see the ones it’s related to …
You want to be able to compare segments. If you’re able to create relationships between blocks of texts you should be able to very easily—
Abigail Firey: I’m going back to human intervention. I can search and see that canons are related.
Nick Laiacona: you can’t just select an entire TEI file, you want to select a section of the file.
Alex Gil: load up collection, run an algorithm that will suggest a place to begin—
Nick Laiacona: we don’t have that algorithm right now—
Abigail Firey: if files of individual canons are already broken out…
Gregor Middell: we are facing a similar problem. We have some integration solutions. You can pull witnesses and tell Juxta “collate these.”
Abigail Firey: if it works on a large corpus I’m interested.
Gregor Middell: scalability is not a problem, but you might need to duplicate data.
Abigail Firey: you are duplicating?
Gregor Middell: yes.
We haven’t decided whether we will then get rid of the XML.
The basic problem is hierarchies. We are describing our source material from a manuscript-oriented perspective, trying to preserve spatial information. Then we do textual transcription in a different XML file. Different perspective. We end up with overlapping hierarchies.
If you go into collation, this problem gets exponential. If you ask Juxta to tokenize it—
This can be done on different granularity levels. (character, word, sentence)
There are layers of markup that you cannot fix in a single XML file.
Abigail Firey: we’ve been worried about tokenizing
Gregor Middell: you might have a physical object that contains multiple texts.
Abigail Firey: XTF could not handle multiple manuscript items.
We’re doing some programming to fix it. Quick fix: separate out into separate files.
Gregor Middell: we are transforming XML markup into a standard markup format able to handle arbitrary overlap and using the parsing step from Juxta. We are also using collation to blend different hierarchies into each other. If you have different transcriptions of the same text manuscript, you can collate them and get correlations between texts that are the same—you can blend different markups over each other. We use collation to semi-automatically blend them.
Abigail Firey: collation is helpful for cleaning up transcription errors.
Gregor Middell: you are getting all kinds of problems in standard markups. You cannot modify the file anymore without major tooling.
Abigail Firey: we feel that we have to have gloss display, toggle feature . . . every time we move around, we lose it.
Nick Laiacona: for first version of web service, side-by-side comparison: once you’ve marshaled the texts, fire up Juxta, do your work, share the .jxt file—can post to your workspace? Display on web service?
Lou Foster: [sorry, your note taker missed the question]
Nick Laiacona: … you could get back a stable URL and do editing functions in Juxta.
Abigail Firey: in Drupal we have stable URLs.
Dana Wheeles: we are going to need some kind of anchor or command: which two of your comparison set do you want to have up? Do you want to highlight a particular passage, jump to a piece of the text?
Nick Laiacona: marginal notes are similar to Alex’s floating boxes—texts are emerging in parallel; we’re looking at them wrong, in a linear way
Abigail Firey: with collections, you can’t tell where they end. Scribes add a few more canons. Can’t tell how or why they got there. Additions, interpolations, full form?
Alex Gil: these things are operating at different dimensions. Juxta should be aware of the dimensions. It’s important to keep in mind that what we have here is a system of relationships, just like Collex. This block in a manuscript has a relationship to XYZ.
Abigail Firey: we wanted to see the relationship of the texts within the relationships. But if you imagine the scribe putting together a collection from other collections, the relationships between any two collections are not the ones that matter—you need to collate at the canon level.
Alex Gil: we’re in complete agreement.
For example, a canon can have five relationships. Choose one. It has seven relationships.
We’re talking about networks, organically connected.
Abigail Firey: there is no single text.
Alex Gil: this doesn’t mean that every single item can’t have a set of fixed relationships. Every box in my chart has finite number of lines leaving.
Nick Laiacona: those lines are interpretations.
Alex Gil: those are not arbitrary. I figured them out. I did it manually, but there was a percentage overlap; it was obvious these things were blocks. But there were a fixed number of relationships. There will always be finite sets of relationships and they can be computed.
[break]
Ronald Dekker: the Gothenburg model for collation. An abstract model for collation.
Intro: My name is Ronald Dekker. I work in the art institute in the Hague. I am surrounded by researchers. I am the head of a development team; I am not a researcher but surrounded by them.
We have software for a publication tool. We support annotations.
Transcription, annotation, collation, publication.
I do the technical direction and implementation.
At the 2009 collation summit there were experts from Juxta, TextGrid, CollateX, eXist, Interedition, etc.
The idea of the meeting was to come up with a generalized model of collation so we could connect software, work together instead of compete.
Challenges of making a collation tool: (mainly from the perspective of getting as much usage of your collation tool as possible; not from the perspective of the end user, but how can I get the largest community around this tool?)
How do I support source material in different file formats? Relational databases, TEI, etc. How do I get existing data into a collating tool?
There are many ways to visualize a collation result. Alignment tables, etc.
Source material can consist of different (national) languages—Latin, Greek, etc.
Abigail Firey made the point earlier that developers tend to ignore the differences. Not every language works with every tool.
Abigail also mentioned spelling variation. I have noticed it in medieval Dutch, Greek, German.
Organizations use different platforms (Windows, Unix, Mac; different servers; different desktops). A collation tool needs to run on multiple platforms.
Developers use different programming languages.
Requirements: Support multiple file formats, support multiple languages, etc.
Source material stored in different file formats.
**Make sure you can get existing material into your new collation tool: import or convert.
** Make the barrier of entry as low as possible. How?
** See what is the least amount of structure you require.
** Alignment table: one of the simplest output formats.
** Simplest model: You need a set of witnesses and some identifier for each.
** Plain text content is the simplest you can have.
** You can convert TEI into plain text.
** Basic pipeline for collation: Witnesses > Aligner > Alignment table
** This pipeline will be more difficult by the end of the presentation.
Source material consists of different languages
** Example: tokenization of Armenian text not splitting on white space.
** How do you deal with that in a generic model?
** Extend pipeline with tokenizer: source > tokenizer > witnesses > aligner > alignment table
** You can customize tokenizer and alignment. (two steps)
** Tokenizers support different file formats; all produce a set of witnesses; each witness contains an array of tokens. Token has two representations: actual content and normalized form.
Need a way to visualize results.
** Extend pipeline: source > tokenizer > witnesses > aligner > variant graph > visualizer > alignment table
** Three steps that can be customized.
Need a way to handle spelling variations
** Language specific
** Happens a lot
** Obvious to humans, not to computer
** Need custom matching function
** Extended pipeline: source > tokenizer > witnesses > matcher > matches > aligner > variant graph > visualizer > alignment table
** Can apply rules to specific tokens
** Need different rules for different languages, but can reuse the rest of the pipeline.
This is the model that we’re using now.
Technical stuff.
The pipeline is an abstract concept; there are many ways to implement it.
Each of the parts is a web service, using HTTP and REST
Different institutions can host different web services.
When you have a collation problem, you can write your own solution.
You can also use a pipeline that stands alone on your own computer.
Our pipeline runs in cocoon.
CollateX is written in Java; Java VM is platform independent.
JSON and XML are platform independent.
Developers on my team work on different platforms; it doesn’t matter.
Sum up: how to achieve interoperability.
** Split up functionality in small, self-contained parts
** Determine input and output formats for each part
** Require the minimum amount of structure in the input; you can add more
** Bring experts together
** Build a community of users and developers.
Alex Gil: What is collation, for you?
Ronald Dekker: finding similarities and differences between texts. There is nothing specific about texts.
Nick Laiacona: linearly sequenced …
Alex Gil: sequence might not mean anything—
Ronald Dekker: there can be transposition; I don’t mean the order is fixed; but there is an order in which you read the text
You still have words that read left to right. There are bits, objects, that you read in a certain—alignment is about order.
Nick Laiacona: where are you going with this, Alex?
Alex Gil: two matching sequences each belong to higher level sequence.
Ronald Dekker: I think you’re going into my next presentation. CollateX doesn’t use the diff algorithm. It tries to detect sequences.
There is some order in the material. If you have blocks of text, the words are in a certain order. If the blocks are moved around, the words are still in order, and CollateX tries to find that.
Nick Laiacona: we’ve been talking about how you choose your witnesses.
Abigail Firey: I’ve been thinking about transposition, trying to get the machines to think like humans. At a certain point, a human will recognize a paraphrase; it’s not the same text transposed. It’s an editorial judgment call. How to resolve that in terms of pattern matching? Google searches don’t care about order—they find the match. Sometimes you get bad matches because they’re not close enough to what you’ll recognize as a match with transposition. I don’t know how you set the threshold.
Gregor Middell: Google doesn’t collate; it indexes.
You can think of this as pattern matching. What you lose is the notion of editions. If you talk about editions, you do it in sequential context. If you don’t have that order and look for loosely correlated things between texts, you get rid of the problem of transposition, but you get rid of the idea of editions. There has to be a match above and below a word in order to recognize a deletion.
Alex Gil: I call those limiting matches.
My own solution to the matching problem: you can come up with delimiters.
The perfect situation for the diff algorithm is one in which there are no transpositions, just additions and deletions. From token A to Z there is a sequence. Absence and presence.
Nick Laiacona: There should also be a nice ratio between unique tokens and the total size of the text.
Alex Gil: If there were a loose network of texts: if we could identify all instances in which diff could run perfectly, we would have isolated mathematically what we mean by “blocks of texts.”
It can have meaning for humans.
Finding those things will make us create relationships.
Abigail Firey: the importance of transposition in pattern matching is language specific. A Latin transposition doesn’t make much difference; English is order-dependent.
Gregor Middell: example of omission: doesn’t pattern-match; we see the omission; the question is where it was omitted.
Dana Wheeles: everyone sees the usefulness of pattern matching; we’re talking about how to refine collation for text cases. You (Alex) are thinking about a larger question: if I could go across the world’s texts on the web . . .
Alex Gil: I am thinking of a specific corpus. I am worried about editing one book that has a series of relationships to chunks of texts of different sizes. Those sets of relationships are the only thing we know perfectly. Before the collation tool can work on my texts this needs to be in place.
Nick Laiacona: there’s a step to the left of that “source” box, that’s a search. We take for granted that you have two witnesses that are suitable to be compared.
Alex Gil: in Juxta, all the files are flat. I know in my head the relationships. Imagine I would be able to mark all the moves—those moves still don’t mean anything in relation to the bibliographical items on the side. For example, a block moves between two big texts; that chunk is also published separately. Juxta misses that relationship. If I want to do an edition of everything that ever happened to that poem . . .
Abigail Firey: It’s a question of expectations. I expect a lot more human intervention at various stages of the process. Whenever you develop software and all the investment—is this really better than doing it the old-fashioned, human, manual way? Are we expending a lot of money and effort to do something that my brain does pretty well?
Thinking about how best to extend Juxta: there is the question of scholarly intuition where you have an idea about how a text works and it’s just not worth trying to get the machine to show everybody that. You have to make an editorial statement and say the preponderance of evidence shows it is this way.
Nick Laiacona: the search engine naturally inserts itself before the collation step.
Alex Gil: as advanced as Juxta is, I use it only to catch my own errors in transcription. In order to get that going, I have to break things by hand into blocks so that diff can work on them. It’s an enormous amount of work. The only useful result I’m getting out of Juxta is to prevent transcription errors.
Using screenshots to show comparisons: people used to do that by quoting.
Abigail Firey: I’m using it to discover differences: whoa, “homilies” / “homilies” [?] ! Different meaning! Different word!
I’m not doing the editorial work, I’m trying to figure out what the work will be.
Alex Gil: Juxta promises to be able to generate a complete apparatus.
Ronald Dekker: I understand what you’re getting at, but that’s a different problem about relationships between different texts.
Alex Gil: I thought that was what it was all about.
Ronald Dekker: up to a certain limit. A computer can do things faster; it can’t do a semantic interpretation.
Gregor Middell: I met with the International New Testament group in Berlin boot camp. They are using Collate (Peter Robinson’s Collate). We said show us how you use it.
One of the scholars walked us through the collation steps: assembling apparatus by hand. Token by token. Stepping through the text.
Their project has come to a scale where they can’t handle it manually any more. We have 100-150 documents to make sense of, and we can’t.
Alex Gil: we don’t disagree. What we’re seeking is a balance between the automatic functions and manual functions. I like your pipeline. But we still need to remember the total goal: establish relationships between texts.
Technology is useful to deal with scale, eliminate gruntwork. I’m not saying the manual should not be there. But we want automated functions to do as much as possible. If it’s all left to manual it’s not going to work.
Philosophical point: we need to re-conceptualize how we think about relationships between texts. Sometimes a bibliographic unit is a very small block of text within the larger unit.
Dana Wheeles: Now would be a good point to hear more about CollateX.
Nick Laiacona: Gregor, did you want to finish a point?
Gregor Middell: I’m working on a group to write the critical apparatus chapter of the guidelines.
This is my main interest from a software development perspective. We want to encode in TEI P5
There are two competing groups: software developers want to generate the apparatus automatically. Then there’s the crowd of scholars who are used to manually assembling apparatus. There are different requirements: philologists care about human readability; the output should be manually tweakable. They must be able to fully control output or won’t use it.
There’s a group of users who want to use output as a means to intervene in the editorial process; software developers want to use an automatically generated apparatus to feed back into the electronic files, e.g. create back-references.
Abigail Firey: we’re circling back to my admiration for Juxta’s current “generate apparatus” function.
[break]
Ronald Dekker: to introduce myself: I am the lead developer of CollateX
Review pipeline: general model
Second presentation: CollateX implementation
CollateX 1.1
Contents: what it does; how it works
First: show CollateX to the end user
Second: behind the scenes
Intended to be a successor to the old Collate written by Peter Robinson
Collate was used to collate The Origin of Species: use case
Source: simple XML file
CollateX gives back an alignment table
Each column has an identifier (year text was published)
The last three versions have some changes in common
The last two versions have “compare” in common
Some cells in the table represent parallel texts. E.g. “when we” occurs in all texts
You can have parallel segments that occur only in some versions
CollateX doesn’t conclude whether it’s an addition or omission
There is no base text
If you want to use a reference text, you can
Darker gray areas: cells where variations occur
This result is done without any human intervention at all: straight from source file to result table.
This is not perfect, but the idea is to get results fast.
Great starting point to explore your material.
You can already see that the last two versions of The Origin of Species have more in common with each other than the previous four
You can also get the output in XML format for further edition work.
How it works:
Why write another collation tools? What makes it different?
It does multiple witness collation—diff usually compares two.
Baseless collation: no assumption about reference text
Parallel segmentation: pieces of text that occur in multiple versions: lines them up.
Gotenburg model (pipeline)
Client server: lives on the web.
If you want to make a desktop application with it, you can.
Collation micro-services model
Examples of preparation: tokenization, regularization. (in the example, it ignores capitalization)
Alignment: doesn’t use diff
Goes through whole text comparing each token to each other token
Diff tries to go for the longest coherent sequence; CollateX tries to go for the smallest number of sequences.
Stores the result in a graph.
Omissions, additions, transpositions, changes.
Abigail Firey: how does visualization of transmission work in grid?
Ronald Dekker: we’re working on that. It’s a hard visualization problem.
Abigail Firey: how does it work on non-standardized orthography?
Ronald Dekker: that’s in the matching part of the pipeline—it is not perfect.
You can change this.
When you have a near match, it will see it as a repeated token; it has to look at more context to figure out what is going on there.
Nick Laiacona: our hope was with the Gothenburg model, by breaking down collation into distinct steps, you could always use the best tool. E.g. use a good tokenizer, Juxta’s Latin noise filter, etc.
Gregor Middell: there are two places you can intervene: matching and comparison. You can get handed every possible pair of tokens and decide whether you think they’re the same. Not manually—you can tie it to a database.
Ronald Dekker: either rule based or human determination.
Nick Laiacona: so if you build your whole pipeline around one of these tools … you can use a good algorithm.
Abigail Firey: (admiring collation graph of internal CollateX logic) parsing the witnesses is the hard thing in traditional collation
Ronald Dekker: visualization step: alignment table or XML output
Abigail Firey: can you move the columns in the alignment table?
Ronald Dekker: the idea is that it gives you a pretty good result in a small amount of time …
Abigail Firey: I wondered if there was a “click the button, rearrange the columns” function
Ronald Dekker: no, we don’t have the usability of Juxta yet
Gregor Middell: (examples of use cases)
Ronald Dekker: we prioritized getting the Gothenburg model working
I am hopeful that we will integrate more tools
There is nothing in the model that prohibits you from doing what you want; it just isn’t made yet
Alex Gil: differences between CollateX and Juxta?
Ronald Dekker: Juxta works from a base; CollateX doesn’t.
Nick Laiacona: we were going to wrap up at 5:00; it’s 5:18 now … I don’t want to shut down that conversation …
Ronald Dekker: I’ve never really done a direct comparison; I’ve been focusing on getting this model running.
Gregor Middell: part of the problem is there is no official benchmark for collation tools to determine whether a collation is correct.
Nick Laiacona: thanks for attending. I’m looking forward to day 2.
We’ll reconvene at 9:00 tomorrow morning and start with Jim from Bamboo.