Schematron Exercise: Digital Mitford Project

Preliminaries

To work on this assignment, you will need to to find and do the following:

Information resources at the ready: Review our Schematron tutorial, and read more about the XPath functions and syntax we describe below either on the web (see w3Schools’ “XSLT, XPath, and XQuery Functions”, Obdurodon’s “The XPath Functions We Use the Most”) or through offline searching with the index of the Michael Kay book.
XML file that needs better schema: Right-click to save this TEI file locally and open it in <oXygen/>: Sample for Digital Mitford Site Index. You will need to associate your Schematron file with this document in addition to the currently associated TEI schema lines.
Open a new Schematron document in <oXygen/> by going to File → New and typing “Schematron” in the “Type filter text” box, or by going to File → New → New Document → (scroll to Schematron in the alphabetized list) → Schematron. Once opened, you will keep the default xml line at the top, but you will delete everything from <sch:schema> down. To write Schematron rules for a document in the TEI namespace, you will then replace this with:
```
<schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
    xmlns:sqf="http://www.schematron-quickfix.com/validator/process"
    xmlns="http://purl.oclc.org/dsdl/schematron">
        <ns uri="http://www.tei-c.org/ns/1.0" prefix="tei"/>
        
    </schema>
```
Write your Schematron patterns inside the <schema> root element.
Important: You must use the tei: prefix before each of your elements since we are working with a document in the TEI namespace; otherwise none of your schema rules involving elements will fire! However, we do not use that prefix before attributes because the attributes are in no namespace.

The goal:

The Digital Mitford project is working on a collection of prosopography data: a record of people, places, organizations, published works, and other named entities relevant to British author Mary Russell Mitford’s world in the nineteenth century. After some years of collaborative research the collection of data (which we call our Site Index) contains thousands of entries, and it keeps growing as members of our project team contribute batches of new entries in the course of their research. It’s common for our editors to make typographical errors as they enter details about historical people in particular, since these entries can be especially complicated! That is why we need to write some Schematron rules to help people find and correct the common errors they will likely make they are coding.

As you work on the rules below, think about how to group them logically into related pattern elements. You can use an @id on pattern elements to help label them and organize your work. Also, be sure to associate your Schematron file with the XML file you are testing as soon as you write your first rule so you can see if it is responding as you expect.

Survey the code

Skim through the Digital Mitford project XML you downloaded, and get a sense of how it is organized and the way we have nested information about individuals inside each person element. Notice:

Each tei:person has an @xml:id whose value is a distinct identity marker.
Inside the tei:person elements there are tei:persName elements, some of which contain nested tei:surname, and tei:forename elements.
There are also elements for tei:birth and tei:death with attributes and contents telling us about when and where a person was born and died.
Finally, most tei:person elements contain a biographical tei:note element with more information. These notes sometimes include references (made with @ref attributes) to people, places, books, and more listed elsewehere in the site index.

Schematron rules to write and test

On the tei:person element, we want to check the way its @xml:id is written. In our project when a historical person is given a unique identifier, that @xml:id value is supposed to begin with the most distinctive part of the person’s name, their last name. Since we code the tei:surname element as a descendant of tei:person, you may write a Schematron rule that tests whether the @xml:id starts with the contents of the TEI's surname element. Hint: To specify an XML node (an element or attribute) as an argument in an XPath function, simply give the element name (without quotation marks) instead of a specific string.
Sometimes our editors don’t capitalize proper names! Check that all the tei:forename, tei:surname, and tei:placeName elements, as well as any tei:persName elements that hold text and do not wrap around forename and surname elements start with capital letters. Hints:
- You can do that with one rule, and you can set multiple contexts using the union operator or pipe: | to join these together. You last used the pipe when writing Relax NG. You can use it in Schematron (and XSLT) contexts here specifically to join together multiple context items in one rule.
Now let’s take a look at the dates coded in this file, coded in the tei:birth and tei:death elements. All death dates need to be later than birth dates, but surprisingly, the TEI does not have a built-in way of checking this. Write a Schematron rule to flag when the dates coded in the @when attributes on any tei:birth and tei:death elements don’t make sense. For the purposes of this homework, it is fine to concentrate only on the @when attributes coded on tei:birth and tei:death (you can ignore other attributes containing dates).
- How to test for this: Some dates are given as full ISO years (yyyy-mm-dd) and others are only partial and those, alas, will NOT convert to a machine-readable date with xs:date(), so we do not want to use that function here. Instead, we recommend that you work with the tokenize() function to isolate the year as the piece that we really need to look at, that is, the four-digit year that sits in front of the first hyphen. To reliably capture this piece, write the tokenize() function to break the attribute values in pieces around hyphens (tokenize on the hyphen) and write a position predicate to grab the first of the tokens. (Note: tokenize() is a wonderfully adaptable function! Even if the date value lacks any hyphens and only contains a year, this will still return that year since the token just won’t break off!)
- Remember, you are testing to see when a birth year is later than a death year, so you need to write a test that uses comparison operators.

Submission

Upload your completed Schematron schema AND the si-Add-MRMsample.xml file with your Schematron associated to Courseweb, and follow our standard filenaming conventions for homework assignments uploaded to Courseweb.