Page Title: Restoring ancient text using deep learning: a case study on Greek epigraphy

  • This webpage makes use of the TITLE meta tag - this is good for search engine optimization.

Page Description: Ancient History relies on disciplines such as Epigraphy, the study of ancient inscribed texts, for evidence of the recorded past. However, these texts, “inscriptions”, are often damaged over the centuries, and illegible parts of the text must be restored by specialists, known as epigraphists. This work presents PYTHIA, the first ancient text restoration model that recovers missing characters from a damaged text input using deep neural networks. Its architecture is carefully designed to handle longterm context information, and deal efficiently with missing or corrupted character and word representations. To train it, we wrote a nontrivial pipeline to convert PHI, the largest digital corpus of ancient Greek inscriptions, to machine actionable text, which we call PHI-ML. On PHI-ML, PYTHIA’s predictions achieve a 30.1% character error rate, compared to the 57.3% of human epigraphists. Moreover, in 73.5% of cases the ground-truth sequence was among the Top-20 hypotheses of PYTHIA, which effectively demonstrates the impact of this assistive method on the field of digital epigraphy, and sets the state-of-the-art in ancient text restoration.

  • This webpage makes use of the DESCRIPTION meta tag - this is good for search engine optimization.

Page Keywords:

  • This webpage DOES NOT make use of the KEYWORDS meta tag - whilst search engines nowadays do not put too much emphasis on this meta tag including them in your website does no harm.

Page Text: View open source Abstract Ancient History relies on disciplines such as Epigraphy, the study of ancient inscribed texts, for evidence of the recorded past. However, these texts, “inscriptions”, are often damaged over the centuries, and illegible parts of the text must be restored by specialists, known as epigraphists. This work presents PYTHIA, the first ancient text restoration model that recovers missing characters from a damaged text input using deep neural networks. Its architecture is carefully designed to handle longterm context information, and deal efficiently with missing or corrupted character and word representations. To train it, we wrote a nontrivial pipeline to convert PHI, the largest digital corpus of ancient Greek inscriptions, to machine actionable text, which we call PHI-ML. On PHI-ML, PYTHIA’s predictions achieve a 30.1% character error rate, compared to the 57.3% of human epigraphists. Moreover, in 73.5% of cases the ground-truth sequence was among the Top-20 hypotheses of PYTHIA, which effectively demonstrates the impact of this assistive method on the field of digital epigraphy, and sets the state-of-the-art in ancient text restoration. Authors' notes Historians rely on different sources to reconstruct the thought, society and history of past civilisations. Many of these sources are text-based – whether written on scrolls or carved into stone, the preserved records of the past help shed light on ancient societies. However, these records of our ancient cultural heritage are often incomplete: due to deliberate destruction, or erosion and fragmentation over time. This is the case for inscriptions: texts written on a durable surface (such as stone, ceramic, metal) by individuals, groups and institutions of the past, and which are the focus of the discipline called epigraphy . Thousands of inscriptions have survived to our day; but the majority have suffered damage over the centuries, and parts of the text are illegible or lost (Figure 1). The reconstruction ("restoration") of these documents is complex and time consuming, but necessary for a deeper understanding of civilisations past. One of the issues with discerning meaning from incomplete fragments of text is that there are often multiple possible solutions. In many word games and puzzles, players guess letters to complete a word or phrase – the more letters that are specified, the more constrained the possible solutions become. But unlike these games, where players have to guess a phrase in isolation, historians restoring a text can estimate the likelihood of different possible solutions based on other context clues in the inscription – such as grammatical and linguistic considerations, layout and shape, textual parallels, and historical context. Now, by using machine learning trained on ancient texts, we’ve built a system that can furnish a more complete and systematically ranked list of possible solutions, which we hope will augment historians’ understanding of a text. Figure 1: Damaged inscription: a decree of the Athenian Assembly relating to the management of the Acropolis (dating 485/4 BCE). IG I3 4B. (CC BY-SA 3.0, WikiMedia) Pythia Pythia – which takes its name from the woman who delivered the god Apollo's oracular responses at the Greek sanctuary of Delphi – is the first ancient text restoration model that recovers missing characters from a damaged text input using deep neural networks. Bringing together the disciplines of ancient history and deep learning, the present work offers a fully automated aid to the text restoration task, providing ancient historians with multiple textual restorations, as well as the confidence level for each hypothesis. Pythia takes a sequence of damaged text as input, and is trained to predict character sequences comprising hypothesised restorations of ancient Greek inscriptions (texts written in the Greek alphabet dating between the seventh century BCE and the fifth century CE). The architecture works at both the character- and word-level, thereby effectively handling long-term context information, and dealing efficiently with incomplete word representations (Figure 2). This makes it applicable to all disciplines dealing with ancient texts ( philology , papyrology , codicology ) and applies to any language (ancient or modern). Figure 2: Pythia processing the phrase μηδέν ἄγαν (Mēdèn ágan) "nothing in excess," a fabled maxim inscribed on Apollo’s temple in Delphi. The letters "γα" are the characters to be predicted, and are annotated with ‘?’. Since ἄ??ν is not a complete word, its embedding is treated as unknown (‘unk’). The decoder outputs correctly "γα". Experimental evaluation To train Pythia, we wrote a non-trivial pipeline to convert the largest digital corpus of ancient Greek inscriptions ( PHI Greek Inscriptions ) to machine actionable text, which we call PHI-ML. As shown in Table 1, Pythia’s predictions on PHI-ML achieve a 30.1% character error rate, compared to the 57.3% of evaluated human ancient historians (specifically, these were PhD students from Oxford). Moreover, in 73.5% of cases the ground-truth sequence was among the Top-20 hypotheses of Pythia, which effectively demonstrates the impact of this assistive method on the field of digital epigraphy, and sets the state-of-the-art in ancient text restoration. Table 1: Pythia's Predictive performance of on PHI-ML. The importance of context To evaluate Pythia’s receptiveness to context information and visualise the attention weights at each decoding step, we experimented with the modified lines of an inscription from the city of Pergamon (in modern-day Turkey)*. In the text of Figure 3, the last word is a Greek personal name ending in -ου. We set ἀπολλοδώρου ("Apollodorou") as the personal name, and hid its first 9 characters. This name was specifically chosen because it already appeared within the input text. Pythia attended to the contextually-relevant parts of the text - specifically, ἀπολλοδώρου. The sequence ἀπολλοδώρ was predicted correctly. As a litmus test, we substituted ἀπολλοδώρου in the input text with another personal name of the same length: ἀρτεμιδώρου ("Artemidorou"). The predicted sequence changed accordingly to ἀρτεμιδώρ, thereby illustrating the importance of context in the prediction process. Figure 3: Visualisation of the attention weights for the decoding of the first 4 missing characters. To aid visualisation, the weights within the area of the characters to be predicted (‘?’) are in green, and in blue for the rest of the text; the magnitude of the weights  is represented by the colour intensity. The ground-truth text ἀπολλοδώρ appears in the input text, and Pythia attends to the relevant parts of the sequence. Future research The combination of machine learning and epigraphy has the potential to impact meaningfully  the study of inscribed texts, and widen the scope of the historian’s work. For this reason, we have open-sourced an online Python notebook, Pythia, and PHI-ML’s processing pipeline at https://github.com/sommerschield/ancient-text-restoration , collaborating with scholars at the University of Oxford. By so doing, we hope to aid future research and inspire further interdisciplinary work. *Specifically, lines b.8- c.5 of the inscription MDAI(A) 32 (1907) 428, 275. Authors Yannis Assael, Thea Sommerschield *, Jonathan Prag * Venue

  • This webpage has 1099 words which is between the recommended minimum of 250 words and the recommended maximum of 2500 words - GOOD WORK.

Header tags:

  • It appears that you are using header tags - this is a GOOD thing!

Spelling errors:

  • This webpage has 1 words which may be misspelt.

Possibly mis-spelt word: epigraphists

Suggestion: telegraphists
Suggestion: calligraphists
Suggestion: epigraphs
Suggestion: telegraphist

Broken links:

  • This webpage has 1 broken links.

Broken image links:

  • This webpage has no broken image links that we can detect - GOOD WORK.

CSS over tables for layout?:

  • It appears that this page uses DIVs for layout this is a GOOD thing!

Last modified date:

  • We were unable to detect what date this page was last modified

Images that are being re-sized:

  • This webpage has no images that are being re-sized by the browser - GOOD WORK.

Images that are being re-sized:

  • This webpage has no images that are missing their width and height - GOOD WORK.

Mobile friendly:

  • After testing this webpage it appears NOT to be mobile friendly - this is NOT a good thing!

Links with no anchor text:

  • This webpage has no links that are missing anchor text - GOOD WORK.

W3C Validation:

Print friendly?:

  • It appears that the webpage does NOT use CSS stylesheets to provide print functionality - this is a BAD thing.

GZIP Compression enabled?:

  • It appears that the serrver does NOT have GZIP Compression enabled - this is a NOT a good thing!