Yes okay – but what were they doing?

How a lexical approach to event extraction can provide some first answers

In order to enable our users to not only look for entities, like places, people, or organizations, but also for actions or processes they were involved in, we work on the task referred to as event extraction. This means we try to identify and classify references to for example ships leaving, people being enslaved or wars being started. The last blogpost on event extraction discussed the debut of us looking for the right framework for our event extraction module. Through a detailed process of trial annotations, discussions and research, we ultimately formed our very own event framework. This is a schema that contains over 80 event types that are all defined specifically for our corpus. 

We published a paper explaining and motivating the ideas behind this approach to event modelling, set up our wiki where we provide definitions of our event classes and finalized our event annotation guidelines. We also started mapping out how our event classes relate to other event modelling schemas used in different projects. The event classes we define range from fairly general concepts, such as Transportation, to concepts more specific to our corpus, like Mutiny. Other examples of event types are Killing, TakingUnderControl, Trade, FallingIll, Production and many more. The event types are organized in a taxonomical structure so that they can be grouped. 

We continued annotating according to this schema. This means members of our team labelled words that referred to a specific event class in a selection of documents. For example, in het affgevaerdigde schip, ‘affgevaerdigde’ refers to the event class Leaving. The point of this is gathering examples of varying words in various contexts referring to the same event class, i.e., not only ‘affgevaerdigde’, but also ‘vertrocken’, ‘gelichte ankers’, ‘afscheid’, etc refer to Leaving.

We gather these examples to train machine learning systems with. In order to alleviate annotation labour, we started developing a lexicon: words of which we are sure that, regardless of the context, they refer to the same event class in the vast majority of cases. For example, we automatically label that ‘vertrecken’ refers to a Leaving event. On the other hand, we will not include ‘verseeckeren’ as a reference to an Occupation event in our lexicon, even though we know this sometimes happens.1 We refrain from doing this because ‘verseeckeren’ can also refer to many other things.2 Apart from using this lexicon as an automatic pre-annotation method, which means we annotate documents automatically with the lexicon before sending it to human annotators, it also serves as a good baseline to compare any machine learning systems to. Once we have a completely automatic (probabilistic) method linking words in the corpus to our event classes, it is good to see how they perform compared to a controlled lexical approach.

The lexicon grew and grew, until it reached a version where we thought it worthwhile to apply it to our complete corpus and see what we would find. This showed we could already find over 1.8 million instances of the event Arriving through this lexical approach, for example. Other examples: almost 200,000 instances of BeingInConflict, over 47,000 instances of BeingInDebt and over 22,000 instances of Destroying. We think this is useful for researchers and therefore decided to release this version of the lexicon. 

Building the lexicon

The lexicon was created through an iterative process of three different techniques: i) annotation analysis ii) expert input and feedback and iii) synonym / spelling variation search using a semantic model.

Annotation analysis

After two manual annotation rounds, we analyzed all annotations and extracted words that were annotated with the same event class twice or more. We went through them manually and decided which ones should be in the lexicon. We also directly added some token-type pairs that we knew might be relevant as experts working with the corpus all the time. This combination resulted in 150 token-type pairs.

Expert feedback

We used the first version of the lexicon for pre-annotation of documents and gathered annotators’  feedback on the quality of the token-type pairs, adding as well as deleting token-type pairs here and there: essential changes though not drastic ones.  

Adrianus Canter Visscher, Two young princes receive the blessing, c.1675-c.1725. Rijksmuseum Amsterdam, CC0.
Adrianus Canter Visscher, Two young princes receive the blessing, c.1675-c.1725. Rijksmuseum Amsterdam, CC0.

Word2Vec

The biggest addition to this release of the lexicon came from an elaborate word2vec search using a word2vec model trained on the VOC corpus. A word2vec model is a collection of word embeddings: a computational representation of meaning where words that often appear in the same context in text are clustered together and seen as similar. An example to explain this concept could be: “I could really use a warm cup of [blank] right now”. As humans, we have an intuition of what should be in the blank space. Something like tea, coffee, hot chocolate. These are all drinkable liquids that humans enjoy to drink at the same temperature (hot). These are similar concepts. A word2vec model does not know anything about hot chocolate but it can learn the contexts it is used in and thus infer it is similar to coffee.

A word2vec model represents semantics that it encounters in its training data. For example, if you would train a word2vec model on text from fashion magazines, it would know the word ‘clutch’ and know it is similar to a tote bag but not to a beret, but if you would train it on cookbooks it would probably not. It is thus relevant on what text you train your word2vec model. 

We used the words gathered through annotation analysis and expert input as a ‘seed’ for this word2vec search. This means we put in a word we knew to refer to a specific event class and looked for the most similar words according to the word2vec model. Each similar word the model would suggest was evaluated, consulting the WNT3 when necessary, and added to the lexicon if the word was deemed to be indeed similar enough. Sometimes, words we accidentally came across, either through word2vec or through the WNT, were used again as a new seed. 

This expanded the lexicon to around the size it is now: from 150 to 610 token-type pairs. During subsequent annotations, we continued the iterative process of adding and deleting token-type pairs according to expert feedback. 

Limitations 

The lexicon is limited in three ways. Firstly with intent, because we only want to include token-type pairs of which we are fairly sure they are constant. This means we try not to include ambiguous tokens in the lexicon. Secondly without intent: we only include token-type pairs that we have come across in our creation method and this definitely does not cover all possible token-type pairs corresponding to our event type selection. Also, we try to cover as many spelling variations as possible, but we realize these are not complete. Thirdly, the current version of the lexicon only matches single tokens to an event type. It does not link “het hazenpad nemen” to Leaving, but it does link “hazenpad”.

Using the lexicon

The lexicon contains 610 unique strings referring to 37 unique event types. This means we have no token-type pairs for the rest of the events from our framework. Apparently these events do not occur often, or are not referred to with unambiguous words for as far as we know. Though this is a very limited approach to event detection and very much the first step, it allows us to take a first peek into event-centered search in the VOC corpus. It allows us to try and evaluate what it means to gather events into concepts and search for a concept rather than a word. 

At the end of the blogpost we provide a table of links to perform a search for each of the events in the lexicon. An example: clicking on this link to the GLOBALISE transcription viewer allows you to search for instances of the event Destroying, which means looking for any occurrence of each of the following words:

verdestrueert; verdestrueren; verdestrueeren; destructie; destrueren; destrueeren; gelicht; demolieren; demolieeren; verwoesting; verwoesten; verwoestingen; verwoestinge; uijtgeplundert; uitplunderen; uijtplunderen; uijtgeplundert; plonderen; geplondert; uijtgeplondert; verdelging; verdelginge; verdelgen; verdelgende; verdelgt; verdelgd; verdelgde

We performed an evaluation measuring how well the lexicon can make out which tokens refer to events and which ones do not in 15 (parts of) documents covering different years and subjects. According to this evaluation, the current version of the lexicon scores a precision of 0.8 and a recall of 0.2. This means that when the lexicon indicates a token refers to an event, in 80% of cases it is right. However, the lexicon only captures 20% of all tokens that refer to events. 

In light of this we would advise against using (this version of) the lexicon to draw any conclusions about longitudinal occurrences of events throughout the corpus. For example, the intuition might be that plotting the occurrences of the event Production over time might give you insight into the periods in which most crops were produced in a certain region. However, since we do not know to what extent the way people referred to crop production in language changed over time, there is no way to know whether the search is representative (i.e., we might only have words referring to Production in our lexicon that were used in the second half of the 18th century but not before). 

The lexicon in csv format along with a more technical description is available on our GitHub page

Let’s make this a community-built resource!

Let’s Collaborate!

Adrianus Canter Visscher, Two Moorish ladies, c.1675-c.1725. Rijksmuseum Amsterdam, CC0.
Adrianus Canter Visscher, Two Moorish ladies, c.1675-c.1725. Rijksmuseum Amsterdam, CC0.

Throughout this blogpost we have mentioned the limitations of any lexical approach, as well as those of this specific lexical approach. We think our lexical approach could become very useful and powerful if we gather enough feedback from historians with different expertise. 

If you have critique, suggestions, know of tokens that should be deleted or know of tokens that should be added, we are all ears. In case you know of words that should be added: we are looking for any words that refer to one of our events. Please let us know by filling in our feedback form

Event TypeDescriptionTranscription viewer query
ArrivingThe event of reaching a destination.Query
AttackingThe act of initiating aggressive action against someone or something.Query
BeingAtAPlaceThe state of an entity being present at a specific location.Query
BeingDeadThe state of no longer being alive.Query
BeingInARelationshipThe state of being connected or associated with someone or something.Query
BeingInConflictThe state of being in a disagreement or clash between opposing parties.Query
BeingInDebtThe state of owing money or favors to another party.Query
BeingLeaderThe state of holding a position of authority or control over a group or organization.Query
BuyingThe action of acquiring a product or service in exchange for money or another product.Query
CollaborationThe action of working jointly towards a common goal.Query
CommunicationThe act of conveying information.Query
DamagingThe act of causing harm or injury to something.Query
DestroyingThe action of damaging something beyond repair or causing something to cease to exist; putting an end to something.Query
DyingThe change of state from being alive to being dead; approaching death; ceasing to live.Query
FallingIllThe event of becoming sick or unwell or developing a disease.Query
FinancialTransactionThe act of exchanging something for money.Query
ForceToActThe act of compelling someone to do something, regardless of their willingness or volition.Query
GettingThe act of obtaining or receiving something, either through force or in agreement.Query
GivingThe action of transferring something to someone else.Query
HavingInPossessionThe state of owning or holding on to something.Query
InvasionThe act of entering a place, often forcefully, to occupy or control it.Query
KillingThe act of causing the death of a living being.Query
LeavingThe act of departing from a place.Query
ProductionThe process of creating or manufacturing goods or services.Query
PunishingThe act of inflicting a penalty, sanction or torturous act on someone for a supposed offense or wrongdoing.Query
ReplacingThe act of substituting one person in a position of power for another (succession).Query
RequestAn event where an individual or entity asks for something to be given or done.Query
SellingThe act of exchanging a product or service for money or another product.Query
SinkingThe event of a vessel or object descending below the surface, typically of water.Query
TakingUnderControlThe act of gaining dominance or command over something or someone.Query
TradeThe action of buying, selling, or exchanging goods or services between parties, either involving a monetary exchange or not.Query
TranslocationThe process where something or someone moves from one place to another.Query
TransportationThe movement of goods or individuals from one location to another.Query
UnrestA state of dissatisfaction, disturbance, or agitation in a community.Query
UprisingAn act of resistance or rebellion against an established authority.Query
ViolentContestAn event characterized by aggressive competition or conflict.Query
VoyageA long journey involving travel on sea.Query

  1. en op 2: a 3: plaetsen leegerneemen om in schijn de Twee landen te verseeckeren”, NL-HaNA, VOC (1.04.02), inv. no. 1308, fo. 45r (scan no. 98), transcription GLOBALISE project (https://globalise.huygens.knaw.nl/), March 2024. (view) ↩︎
  2. Verseeckeren deselve in seer goeden stant te weesen”, NL-HaNA, VOC (1.04.02), inv. no. 4021, fo. 1153v. transcription GLOBALISE project (https://globalise.huygens.knaw.nl/), March 2024. (view) ↩︎
  3. Woordenboek der Nederlandsche Taal [Dictionary of the Dutch language], containing nearly 400,000 (historical) Dutch words from 1500 to 1976, including examples of how they are used. See: https://wnt.inl.nl. ↩︎