Skip to content

Commit

Permalink
Merge pull request #29 from jpallen22/main
Browse files Browse the repository at this point in the history
Update 2024-01-04-Introducing-the-ACDC-Project--Part-I--Tr.md
  • Loading branch information
jpallen22 authored Jan 5, 2024
2 parents 61b683f + 3d8efa7 commit 2ab335b
Showing 1 changed file with 5 additions and 3 deletions.
8 changes: 5 additions & 3 deletions _posts/2024-01-04-Introducing-the-ACDC-Project--Part-I--Tr.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,11 @@ One of OpenITI's major deliverables in our most recent round of work is the [Aut

Our team member David Smith (Northeastern University), the main driver behind the ACDC project, has just released a video introduction and tutorial to the project and the ensuing tool, wherein he explains the logic of the project and gives a detailed walk-through of use: [Automatic Collation for Diversify Corpora (ACDC) Tutorial](https://www.youtube.com/watch?v=kNx4GyH5HSo).

[![]({{ "/images/blogs/2024-01-04/Introducing-the-ACDC-Project--Part-I--TrJonathan Parkes Allen/media/image4.jpeg" | absolute_url }})]({{ "/images/blogs/2024-01-04/Introducing-the-ACDC-Project--Part-I--TrJonathan Parkes Allen/media/image4.jpeg" | absolute_url }})

Gūlistan, Library of Congress PK6450 .G2 1593
{: .figcaption }

My primary contribution to this project was in leading the production of training data: while the goal of ACDC is to limit the amount of necessary training data for the production of valuable HTR through its text alignment method, in order to construct the tool itself some training data was necessary, albeit not on the same scale as our previous work on optical character recognition (thank goodness, as manuscript transcription is a whole other order of difficulty!). Our goal in compiling a training data set was to reflect the internal diversity of the Islamicate manuscript tradition as manifest in particular texts common across the corpus, existing in many copies, and hence especially useful for training the alignment method. We wanted to have script diversity, obviously, meaning texts that could be found from the Maghrib to southeast Asia, but we also wanted a good representation of layout diversity, from multi-column poetry to extensive marginal annotations, many lines per page to a few widely spaced ones, and so forth. This meant identifying and then collecting many examples of texts that obtained a 'canonical' status in medieval and early modern Islam (primarily, though not all of these texts were of an exclusively Islamic nature). What do we mean by 'canonical' in these cases? Perhaps it would be better to demonstrate via an exploration of the five-text corpus we chose to structure our training data set, as each text became canonical for different reasons, with the manuscripts of these texts displaying the particular forms of use that canonicity involved, and through which it was generated.

Perhaps the most visually striking text in terms of layout in our corpus comes from copies of Sa'd al-Dīn al-Taftāzānī's *Sharḥ al-ʻAqāʼid al-Nasafīya*. One of if not the most important theological introductory texts of the late medieval into early modern Islamicate world, al-Taftāzānī's commentary on the short *'aqā'id* ('creed' as it is often translated, albeit not precisely accurately) text of the medieval theologian Abū Ḥafṣ \'Umar al-Nasafī elaborated further on the principles of Islamic philosophical theology, with many later authors adding their own super-commentaries to al-Taftāzānī's initial *sharḥ*. We chose this text not just because it exists in many, many copies, having become a mainstay of madrasa 'curriculum' across the Islamicate lands (but especially in the Ottoman world), but also because like many such texts employed in a madrasa context, it is often very complex layout-wise, composed with wide interlinear space and ample margins, both elements designed for additional annotation by students or for the addition of marginal commentary. And like other typical madrasa texts, the overwhelming majority of manuscript copies are decidedly non-prestige, featuring no illumination or decoration, employing scribal hands of a very 'workaday' mien, and hence encompassing a great deal of diversity given the chronological and geographic reach of this text.
Expand All @@ -42,10 +47,7 @@ University of Michigan, Islamic Ms. 221

Alongside the previous three Arabic texts we selected two Persian texts (for this project we limited ourselves to those two languages due to the sheer number of manuscripts in both, followed by Ottoman Turkish and, in time, Urdu), one of poetry, the other primarily of prose, both circulating not just in the historically Iranian lands but far beyond in the global 'Persianate.' Especially redolent of the far reach of Persian and its status as a language of cultivation is the famous *Gulistān* of Sa'dī, a broadly moralistic work of stories and lessons, written in what we might describe as poetic prose. Its canonical status was generated not just by appreciation for its contents and their moral instructional value but even more by the utility of this text for learning Persian. As such many surviving manuscripts have a similar layout to copies of al-Taftāzānī's *sharḥ*, reflecting, if not a madrasa context, something very similar in Persian language learning environments (which might have taken place in a sufi *tekke* or in the private home of a scholar, less often in madrasas themselves). Interlinears in Ottoman Turkish are especially common, as this was the primary text for the learning of Persian among Turkish speakers in the early modern world.

[![]({{ "/images/blogs/2024-01-04/Introducing-the-ACDC-Project--Part-I--TrJonathan Parkes Allen/media/image4.jpeg" | absolute_url }})]({{ "/images/blogs/2024-01-04/Introducing-the-ACDC-Project--Part-I--TrJonathan Parkes Allen/media/image4.jpeg" | absolute_url }})

Library of Congress PK6450 .G2 1593
{: .figcaption }

Unlike al-Taftāzānī's *sharḥ*, prestige copies of the *Gulistān* also exist, as its canonicity, while underlined by its pedagogical role, was more broadly based. It was a text that graced the libraries of the elite, and thus attracted extensive illumination and illustration programs, examples of which we included in our data set.

Expand Down

0 comments on commit 2ab335b

Please sign in to comment.