NewtFire logo: a mosaic rendering of a firebelly newt
newtFire {dh|ds}
Maintained by: Elisa E. Beshero-Bondar (ebb8 at pitt.edu) Creative Commons License Last modified: Sunday, 06-Aug-2017 18:16:15 EDT. Powered by firebellies.

Schematroll, the Schematron mascotMeet Schematroll, the Schematron mascot! Schematroll is a cross between a bilby and a bettong.

Preliminaries

To work on this assignment, you will need to to find and do the following:

Analysis of the task

The goal:

The Digital Mitford project is working on a collection of prosopography data, that is, a record of people, places, organizations, published works, and other named entities relevant to British author Mary Russell Mitford’s world in the nineteenth century. After some years of collaborative research the collection (which we call our Site Index) contains thousands of entries, all contributed in batches by members of the editing team in the course of their research. It’s common for our editors to make typographical errors as they enter details about historical people in particular, since these entries can be especially complicated! Your task is to write a helpful Schematron file to guide the editors in their process, flag errors if they reverse date ranges like birth and death dates, check for white space errors and other common problems, and check to see that the referencing of @xml:id attributes is correct. We hope that learning these things will give you ideas for writing Schematron to guide your own projects.

As you work on the rules below, think about how to group them logically into related pattern elements. You can use an @id on pattern elements to help label them and organize your work. Also, be sure to associate your Schematron file with the XML file you are testing as soon as you write your first rule so you can test it to make sure it is working.

Rules to write and test

  1. We want to close up extra white spaces that our editors inevitably type at the start of their elements. Write a Schematron rule that checks for leading white space inside any element. (That is, raise a warning when an element starts with a white space.) Hint: you may want to look at the starts-with() function, one of the family related to contains().
  2. We are keeping track of the sex of each person in the historical persons lists so that we can study the proportion of males to females in Mitford’s social network. To mark this, we are using the @sex attribute on the person element. There are four standard ISO codes for sex that we have decided to use: 1 (for males), 2 (for females), 0 (for unknown), and 9 (for not applicable). Though these standard code numbers have understandably provoked controversy, our team decided that the simple set of numbers was sufficient for the nineteenth-century context of our project. But our editors sometimes forget to code or miscode the @sex attribute. They might also forget to apply the standard @xml:id to that element. To check for these problems, write and test two Schematron rules:
    • A rule that checks to makes sure the @sex and @xml:id elements are present on the person element, and
    • A rule that establishes the permitted values of the @sex attribute. Note: the format for coding a series of permitted attribute values is very similar to the way we do it in Relax-NG, with a comma-separated list that we wrap in parentheses, thus:
                                  <rule context="tei:element">
                                      <assert test="@attName = ('val1', 'val2', 'val3')">
                                      Your message here</assert>
                                   </rule>   
                              
  3. While you’re working on Schematron tests for the person element, we want to check the way its @xml:id is written. In our project when a historical person is given a unique identifier, that @xml:id value is supposed to begin with the most distinctive part of the person’s name, their last name. Since we code the surname element as a descendant of person, you may write a Schematron rule that tests whether the @xml:id starts with the contents of the surname element. There is one complication mainly for women in the list: they frequently have two different kinds of surname, a <surname type="paternal"> and a <surname type="married">, and the @xml:id could start with either one (depending on how Mitford knew the person). Write a rule to test if the @xml:id on each person element starts with either of these types of surname, and check to see if your rule is working.
  4. Sometimes our editors don’t capitalize proper names! Check that all TEI forename, surname, and placeName elements, as well as any persName elements that hold text and do not wrap around forename and surname elements start with capital letters (you can do that with one rule).
  5. Now let’s work on the dates. Notice how we have paired @from and @to attributes in pairs to indicate a date range. Write a Schematron rule to check that the dates in a pair of @from and @to attributes are plausible: No @from should be later than an @to.(Note: Applying xs:date() won’t work here because these are entered simply as four-digit years and won’t be recognized as ISO dates.)
  6. Similarly, all death dates need to be later than birth dates. Write a Schematron rule to flag when these dates don’t make sense. Note: this is a little more complicated than the previous rule you wrote because some dates are given as full ISO years (yyyy-mm-dd) and others are only partial and won't translate with xs:date(). We dealt with this by using the tokenize() function to isolate the year as the piece that we really need to look at, that is, the four-digit year that sits in front of the first hyphen. To reliably capture this piece, write the tokenize() function to break the attribute values in pieces around hyphens (tokenize on the hyphen) and write a position predicate to grab the first of the tokens. (Note: Even if the date value lacks any hyphens and only contains a year, this will still return that year since the token just won’t break off!)
  7. For our last two rules, you will need to consult our guide on Coding with Unique Identifiers and Testing Them with Schematron. First of all, it is very important for our site index file that @xml:ids must not contain white spaces and must not begin with a leading hashtag (#), since (our guide explains) the hashtag is reserved for @ref attributes that point to @xml:ids. Write Schematron rule(s) to test and flag errors here.
  8. Finally, carefully following our guide on testing unique identifiers, test to see whether the @ref and @resp attribute values, following their hashtags, actually matches up to a defined @xml:id in this file or in the Digital Mitford Site Index at http://digitalmitford.org/si.xml. (Note that this rule will also ensure that these values actually begin with a hashtag!) Following our guide, you will learn how to write a let statement to define a variable that points to another file’s @xml:ids, and then refer to that variable in your Schematron test. Also, it is perfectly legal in our project for there to be multiple values on an @ref or @resp, separated by white space, just as you see in our guide, so you should follow our lead to adapt our code there.
  9. Bonus Challenge: We need a more sophisticated way than we used in number 4 to check the way people type out full names in the persName elements. Can we test for errors like these?
    Dorothy wordsworth
    or
    Percy bysshe Shelley
    Of course we can, by adapting the tokenize() we have been using here to break on white space, and to test each token in turn to see if it is capitalized. You can do this by applying the for $i in (sequence) return … (or for-loop XPath feature) so we can walk through each token in the full sequence. To see how to write the code, consult our our guide on testing unique identifiers: Look at our let statement, defining a variable containing a sequence of tokens, and then consider how we processed each one in turn in our assert @test. Can you adapt that code to tokenize the parts of a name, and test to see if each part is capitalized? Write your Schematron rule!

Submission

Upload your completed Schematron schema AND the si-Add-MRMsample.xml file with your Schematron associated to Courseweb, and follow our standard filenaming conventions for homework assignments uploaded to Courseweb.