NewtFire logo: a mosaic rendering of a firebelly newt
newtFire {dh}
Creative Commons License Last modified: Monday, 12-Oct-2020 13:40:25 UTC. Maintained by: Elisa E. Beshero-Bondar (eeb4 at psu.edu). Powered by firebellies.

Consult the following resources as you work with Regular Expressions:

Get the source text ready in <oXygen/>

We calculated a list of the Fall 2020 semester dates in a spreadsheet program, and copied the results into a plain-text file, which you can download from our site here: fa2020semesterDates.txt.

Prepare a Step File

The task

Your goal is to produce an XML version of the semester dates file by using the search-and-replace techniques we discussed in class, and record each step you take in a plain text or markdown file so others can reproduce exactly what you did. (You may, in a real-life project situation, need to share the steps you take in up-converting plain text documents to XML, and share that on your GitHub repo in GitHub’s markdown (the same that we write on the GitHub Issues board), and in that case you would save the file with a .md extension, like this sample regex instructions file.).

Your up-converted XML output should look something like fa2020semesterDates.xml. This involves putting each class period date in its own element and reformatting it to hold the full date information in an attribute. It also involves wrapping the three (M W and F) class period dates for each week in an element to wrap the weeks.

Your Steps file needs to be detailed enough to indicate each step of your process: what regular expression patterns you attempted to find, and what expressions you used to replace them. You might record the number finds you get and even how you fine-tuned your steps when you were not finding everything you wanted to at first. Note: we strongly recommend copying and pasting your find and replace expressions into your Steps file instead of retyping them (since it is easy to introduce errors that way).

How to proceed

There are several ways to get to the target output, but the starting points are standard:

Starting work:

First of all, for any up-conversion of plain text, you must check for the special reserve characters: the ampersand & and the angle brackets < and >. You need to search for those and, if they turn up, replace them with their corresponding XML entities, so that these will not interfere with well-formed XML markup.

Search for:Replace with:
&&amp;
<&lt;
>&gt;

Note that you need to process the special XML reserve characters in the correct order. Why is it important that you search and replace the & first?

To perform regex searching, you need to check the box labeled Regular expression at the bottom of the <oXygen/> find-and-replace dialog box, which you open with Control-f (Windows) or Command-f (Mac). If you don’t check this box, <oXygen/> will just search for what you type literally, and it won’t recognize that some characters in regex have special meaning. You don’t have to check anything else yet. Be sure that Dot matches all is unchecked, though; we’ll explain why below.

Inside out or outside in?

We can create our markup either from the outside in (wrap the whole document in a root element, then week wrappers, then date lines) or from the inside out (first the date lines , then wrap those in a week, and lastly set the document root element). Either strategy can be made to work, but we generally find it easier to work from the inside out.

Dates (as table rows)

Our markup is eventually going to be a table, so in our planning we used <tr< tags (standing for table row to wrap around each date. We’ll start by tagging every date as a <tr>, and we want to go from this format in the text file:

2020-08-24	M

to this format in our XML conversion:

<tr id="2020-08-24">M 08-24</tr>

The way regex really thinks of this process is, match every date line, delete it, and replace it with remixed pieces of itself wrapped in <tr> tags. That is, regex doesn’t think about leaving the date in place and inserting something before and after it; it thinks about matching each date line, deleting it, and then putting pieces of it back, with the tags that you desire. You start by writing a regex to match the date lines, but how do we remix the date information to put part of it in an attribute value, and part of it as element contents?

The answer is to set capturing groups by setting parentheses in the regular expression you use in the Find. Your Find expression needs to hold (in parentheses) the portions of the match that you want to using in your Replace. We divided the date information into three parts using capturing groups: The first set of parentheses holds just the year, the next wraps the rest of the numerical date, and the last wraps around the capital letter designating the weekday. Then in the Replace, we need to refer to the capturing groups using a special regular expression. The sequence \1 points to the first capturing group, ordered from left to right. \2 refers to the second capturing group, and \3 to the third. The expression \0 refers to the entire match regardless of the capturing groups. Try experimenting with Find and Replace using capturing groups in various ways until you set down the tagging you want. (The Undo button in oXygen is under the Edit menu, and we use it frequently when we are experimenting like this!)

Finding and wrapping the weeks

Finding the weeks means understanding how each week begins (with an M for Monday). If we can locate every Monday in the file, we have found a pattern that we can use to wrap each week. What is tricky for Find and Replace of the weeks is that you want to create an element structure to wrap around each week, like this:

           <table class="week">
               <tr id="2020-08-24">M 08-24</tr>
               <tr id="2020-08-26">W 08-26</tr>
               <tr id="2020-08-28">F 08-28</tr>
           </table>   
       

We recommend not working too hard to accomplish this. You do not need to search for the three lines that constitute a full week to wrap a table element around them. Intead, we recommend a close-open strategy: That is, once you have found all of the Mondays, set down a closing tag for each table element before presenting the opening tag: </table>\n<table class="week">. Try this and see what it does. You will have an extra close tag for `</table> at the top of the document, and you will just need to manually move it to the end to complete a series of table elements wrapping each week.

Cleaning up and checking your results

Save your text file now as an XML file by saving as .xml. You will now need to reopen the document to see if it is well-formed so that oXygen actually recognizes and reads the file as an XML document. It probably is not well-formed, because you need to wrap the document in a root element. Do that and inspect the document for well-formedness. To check for well-formedness in the XML file, you can use Control+Shift+W on Windows, Command+Shift+W on Mac, or click on the arrow next to the red check mark in the icon bar at the top and choose Check well-formedness. If you see regular patterns of something that you can fix with regular expressions, use them and document your steps.

General

As we mention above, there are several ways to get to the target output, and whatever works is legitimate, as long as you make meaningful use of computational tools, including regular expressions (where appropriate), and don’t just tag everything manually. As you saw in class, there are ways to build your own regular expressions to match whatever patterns you need to identify, and the regex languages is complex and often difficult to read. The way we would approach this task is by figuring out what we need to match and then looking up how to match it. In addition to the mini-tutorial above, there is a more comprehensive tutorial information at http://www.regular-expressions.info/tutorialcnt.html. If you decide to look around for alternative reference sites and find something that seems especially useful, please post the URL on the discussion boards, so that your classmates can also consult it.

What to submit

On Canvas, upload two things (or a zip directory containing the following):

  1. a step file in markdown or txt (a step-by-step description of what you did), and
  2. your results file (the XML document)

If you don’t get all the way to a solution, just upload the description of what you did, what the output looked like, and why you were not able to proceed any further. As you are working on this, post any questions on our class GitHub Issues board!