A Guide to Event Data: Past, Present, and Future
- Introduction
Within quantitative conflict studies, two primary forms of data exist: structural data and event data. The central goal of this paper is to provide a broad overview of the history, trends, challenges, and uses of event data—both the process of data collection and analysis as well as the events themselves as a specific data type—in the study of conflict. To do so, we begin with a discussion of the history and limitations of more commonly used structural data, which will help illustrate the vital role that event data should play in the future of empirical studies of political violence.
In the 1960s and 1970s, international relations scholars began introducing empirical datasets and quantitative methodologies, most notably through the Correlates of War (COW)project. Largely based on attempts to understand the two major drivers of international relations at the time—the causes of the two world wars and the current bi-polar dynamic of US-USSR relations—political scientists in the realist tradition began accounting for monadic and systemic level conditions, such as the number of major states and alliances in the system; measures of power such as population, GDP, coal and steel production, and various measures of military strength; and relatively fixed characteristics such as shared borders, language, and religion. In the 1980s and 1990s, interstate units of analysis moved to the dyadic and directed dyad level, with new structural variables reflecting neo-liberal concepts, such as shared alliances, regime type, trade, and joint institutions. Most recently, in response to the civil wars in Somalia, Afghanistan, Yugoslavia, and Rwanda, scholars created set of civil war variables to cover the late 1990s to the present, including ethnic diversity, religious diversity, natural resource wealth, former colonial status, and terrain (for example, the Armed Conflict and Location Event Dataset (ACLED), developed in 2010. Despite the increasing level of statistical nuance, most empirical studies of conflict of the last 30 years—both inter- and intrastate—have continued to rely almost exclusively on state-year structural variables.
Structural data are well suited to answer the types of empirical questions that have dominated the empirical conflict literature for the last 30 years, which for the most part derive from the relatively static theories of realism and its later variations:
- What conditions make interstate conflict more/less likely in a given year?
- If an interstate conflict occurs, what (directed) dyadic level variables will increase/decrease the intensity or duration of the conflict?
- If an interstate conflict occurs, what state-level variables will increase/decrease the intensity or duration of the conflict?
Structural variable-based studies have provided many insights into these types of questions and have uncovered important findings regarding the effects of variables such as regime type, natural resources, ethnicity, and geography on inter- and intrastate conflict. However, studies based solely on structural indicators have a number of analytical limitations. For example, structural variables usually change very slowly (if at all) and are either measured at the yearly level of temporal aggregation or simply fixed for the period being studied. This situation not only leads to potential problems in testing causal processes but also restricts predictions to the yearly level, which are often unhelpful to the policy community.
In addition, structural datasets do not account for the interactions that constantly occur between actors of interest in a specific location at a specific time. In many important contexts, analyses of these event interactions drive the relevant actors’ planning for future interactions. For example, in any intrastate conflict, all actors with a stake in the conflict and its outcome (including at a minimum the conflict participants, the civilians in the conflict area, and increasingly, allies and the broader international community) form their future strategies based on their interpretation of past events between important players. While structural-level characteristics may condition the interpretation of those events, they are not the primary drivers of actors’ strategic planning. Therefore, questions similar to the ones listed below, which are becoming increasingly important for the policy community and academia alike, cannot be answered with structural, state-year―level data:
- What is the likelihood that rebel group X will intensify its attacks against civilians next week/month/year?
- What is the likelihood that revolution in country X will spread to country Y within the next three months?
- If country Z experiences civil war, which rebel group is likely to initiate it?
- In what month is that civil war most likely to occur?
- How will foreign investors react in the short term to increases in terrorist attacks?
- What is the likelihood that a crisis between X and Y will escalate or be resolved? What specific actions by outside mediators will change those probabilities?
If these questions are to be answered using quantitative models, event data is required. The following example helps to define event data as a process and illustrate its importance: Consider an analyst tasked with writing a report about the likelihood of an attack on an embassy in country A in the upcoming month. This analyst will likely first read as much available information as possible about politically relevant activities in the region. Have rebels been increasing attacks on other types of government buildings? Are recruiting efforts increasing? Have threats been made? Have past attacks occurred, and if so, what was the government response? After building a mental timeline of events (and often physically creating chronological timelines), the analyst uses his or her cognitive powers to synthesize that information to make a series of predictions. Indeed, this is how we as humans form almost all of our beliefs about future interactions: we collect and evaluate ‘data’ on past interactions and use our cognitive powers to make predictions based on that data.
Unfortunately, this informal approach to data collection, interpretation, and analysis does not provide consistent and accurate predictions of outcomes of interest; it is inherently subjective and tends to be highly inconsistent. Consequently, there is room for more systematic modeling using either statistical or computational pattern-recognition methods. Such models require data that has been sampled at a much finer temporal grain than is found in the structural data, and this is where event data finds a role.
‘Event-data modeling’ in political science is both a process and a specific data type. Event data as a process is the formalization of the same three-step process that human analysts use intuitively and informally: 1) obtain stories, 2) code stories for relevant information, and 3) aggregate and analyze the resulting output with quantitative methodologies to understand trends and generate predictions. Event data as a specific data type is a set of records, each reflecting a specific event and containing codes indicating who | did what | to whom | when | [and sometimes] where |.
This paper provides a brief history of the way in which event-data projects have addressed the three steps in the event-data process. In doing so, we address debates within the discipline as well as emerging trends likely to dictate future adaptation in the field of event data.
2. Obtaining Stories
In general, we learn about an event through three media: we can visually witness it firsthand or on television, we can listen to the event from a radio report or word-of-mouth, or we can read about the event in a written account. To date, almost all international event datasets have focused on written accounts. Due largely to technological limitations of the era (i.e., the lack of electronic articles and computational power), McClelland’s World Event Interaction Survey (WEIS) and Azar’s Conflict and Peace Data Bank (COPDAB) projects relied on human analysts to physically collect newspaper clippings, press reports, and summary accounts from Western news sources to obtain news stories. In Leng’s Behavioral Correlates of War (BCOW) and in Militarized Interstate Disputes (MIDs), news reports were combined with archive material such as books and chronologies. Although coders were instructed about what types of articles to gather, they relied on their subjective judgment to determine whether an article was ‘relevant’ and warranted inclusion into the archive of articles from which events were derived.
This manual approach began to be replaced with automated coding with the first iteration of the Kansas Event Data Set (KEDS) project in the late 1980s. By this time, two major computing developments had occurred. First, the advent of large-scale data aggregators such as LexisNexis, and, later, the Internet, allowed news reports to be obtained in machine-readable form. Second, computational power and natural language processing (NLP) methods had advanced to the point where processing large quantities of information was possible using personal computers. In its earliest version, the KEDS project automatically downloaded and archived Reuters leads from the NEXIS (precursor to the LexisNexis) service into an electronic database, then coded these using a custom computer program. Following the success of KEDS, other event-data programs, such as the Protocol for Nonviolent Direct Action (PANDA) project, adopted an automated data collection process.
By 2000, virtually all large-scale event-data projects in political science relied on automated news collection. In addition to data collection efforts becoming almost exclusively electronic and automated, the scope of media coverage also increased. However, until recently, KEDS and other academic event-data projects with global coverage relied primarily on Reuters and Agence France Presse (AFP) for news content. Only with the creation of the Defense Advanced Research Projects Agency (DARPA)-funded Integrated Conflict Early Warning System (ICEWS) project in 2009, which draws articles from 29 international and regional news sources, did an event dataset with global coverage attempt to utilize a more comprehensive list of global news outlets. The key difference between the ICEWS event-data coding efforts and those of earlier National Science Foundation NSF-funded efforts was the scale. As O'Brien —ICEWS project director—notes,
. . . the ICEWS performers used input data from a variety of sources. Notably, they collected 6.5 million news stories about countries in the Pacific Command (PACOM) AOR [area of responsibility] for the period 1998-2006. This resulted in a dataset about two orders of magnitude greater than any other [of] which we are aware. These stories comprise 253 million lines of text and came from over 75 international sources (AP, UPI, and BBC Monitor) as well as regional sources (India Today, Jakarta Post, Pakistan Newswire, and Saigon Times).
More recently, the massive Global Dataset of Events, Location, and Tone (GDELT) has been developed, and is similar to ICEWS in that it draws on a comprehensive list of electronic news sources. However, instead of directly accessing local or small regional news sources, GDELT indirectly accesses stories from hundreds of smaller news sources by collecting all articles from GoogleNews (with the exception of sports stories).
2.1. Obtaining stories: trends and challenges
If the emergence of the Internet was the first wave of electronically available information about politically relevant events, the rise of social networking sites in the last five years reflects a second wave. Not only have Facebook, Twitter, and blogs drastically increased the amount of available information, they have also decreased the amount of time that transpires between an event occurring and a written account of that event appearing online.
Consider the recent Arab Spring. The most effective way to obtain information about protest events, inter- and intra-group communications, popular sentiments, and potential diffusion of the uprisings in Egypt, Tunisia, Libya, Bahrain, Syria, and other countries was through processing information from Facebook and Twitter feeds. Furthermore (in theory at least) these media should reflect future political change because sentiment and organization necessarily occur before collective action. If current trends persist, social media will continue to play an increasingly important role in the spread of information.
However, although networking platforms contain large quantities of quality information, the majority of data is random noise—“wanna getta pizza?”—and at least some of it is deliberately false, planted by governments in an effort to disrupt resistance efforts. Moreover, unlike articles published by established news outlets, the useful information often does not follow standard journalistic structure, which further complicates the data acquisition process. Among the most pressing challenges to moving forward with automated data collection efforts will be to devise a method of parsing through noisy Facebook status updates and Tweets to extract quality information.
In addition to social media, three trends are evident in processing sources. The first is using text classification tools to eliminate stories that do not contain political events—for example, sports stories, movie reviews, and historical chronologies—before these go further into the system where they either might be miscoded (by machine) or will waste human coders’ time. The ICEWS, MID, and GDELT systems all use this approach. Second, the availability of news reports directly from the web (rather than through aggregators such as NexisLexis) makes automated, near-real-time coding systems possible, although web sources do not provide archives. Third, some projects are beginning to experiment with using machine translation to code material that is in languages other than English, while still using English language coders such as Text Analysis by Augmented Replacement Instructions (TABARI), launched by the KEDS project. This method will probably be more efficient than writing more language-specific codes and dictionaries, although these might still be useful for high-priority languages such as Chinese, Arabic, and Spanish.
3. Processing Articles
After acquiring and storing electronic news sources, the second step in the event-data process is to extract the relevant information from the article to build the actual event dataset. This step is comprised of two essential aspects: 1) developing a coding scheme or ontology and 2) using a systematic coding process to apply the coding rules to the news articles and enter them into a dataset.
3.1. Coding ontologies
All news articles provide information regarding multiple actors, actions, places, and times; these must be extracted and coded from articles in an objective and replicable fashion to form the event datasets. Here, Gerner et al.’s formal definition of an event is useful:
An event is an interaction which can be described in a natural language sentence which has as its subject and direct or indirect object an element of a set of actors, and as the verb an element of a set of actions, all of which are transitive verbs, and which can be associated with a specific point in time.
Schrodt et al. further specify that “[i]n event coding, the subject of the sentence is the source of the event, the verb determines the event code, and the object of the verb is the target.” Coding ontologies or schemes are the rules by which the source, the object, and the verb presented in natural language in articles are converted into categorical actor and event codes suitable for empirical aggregation and analysis. Generally, two ontologies are used, one for actors and one for verbs, although other ontologies have been developed to code other characteristics of the sentence—for example, COPDAB and PANDA code for political ‘issues,’ while the Conflict and Event Mediation Event Observation (CAMEO) program and the Integrated Data for Event Analysis (IDEA, built on PANDA) systemhave ontologies for general ‘agents.’
McClelland’s WEIS (1976) and Azar’s COPDAB (1982) were the first event-data ontologies. Reflecting the status of international relations at the time, both followed the realist tradition in assuming that states operate as unitary actors. This thinking means that all events between individuals are treated as occurring between the states of each individual’s respective citizenship. For example, if a group of Pakistani rebels attacks Indian civilians across the countries’ common border, both WEIS and COPDAB treat this as an attack of Pakistan against India.
Consequently, the WEIS and COPDAB event-coding ontologies are also structured to capture important interstate interactions. The WEIS event ontology is based on 22 distinct cue (or parent) categories of actions (such as “Consult,” “Reward,” “Warn,”) that take on two-digit codes, and 63 sub-categories that indicate additional information and take on three-digit codes. For example, “Threaten” is one of WEIS’s cue categories and its two-digit code is 17. However, when more information is presented in the article regarding the type of threat, the event may be coded by sub-category, such as 172 if the threat is over economic sanctions, or 173 if the source threatens a specific type of force.
Table 1- Example of WEIS coding ontology |
17. THREATEN
171 Threat without specific negative sanctions
172 Threat with specific nonmilitary negative sanctions
173 Threat with force specified
174 Ultimatum: threat with negative sanctions and time limit specified |
The Conflict and Peace Databank utilizes a similar verb typology to capture interstate events, but instead of 22 cue categories, it uses 16, and places them on a conflict-cooperation continuum to facilitate empirical analyses.
While WEIS and COPDAB were the most commonly used ontologies in the first phase of event-data analysis, quite a few additional systems have been developed. For example, the BCOW dataset codes historical as well as contemporary crises and has more than 100 distinct event codes, including “Assume foreign kingship.” The Comparative Research on the Events of Nations (CREON) dataset was customized for coding foreign policy behaviors, and the SHERFACS (named after its developer) and CASCON (Computer-Aided Systems for Analysis of Local Conflicts) datasets code crisis behavior using a crisis-phase framework.
Although WEIS and COPDAB deserve much credit for spearheading the entrance of event data into mainstream political science, a number of shortcomings have become apparent over time. Gerner et al. report that the state-centric focus of WEIS and COPDAB make them ill-suited to account for sub-state level events between domestic actors. The scholars also explain that both WEIS and COPDAB’s verb typologies contain too few event categories:
For instance, WEIS has only a single cue category of Military engagement that must encompass everything from a shot fired at a border patrol to the strategic bombing of cities…COPDAB contains just 16 event categories, spanning a single conflict-cooperation continuum that many researchers consider inappropriate.
Reacting to these shortcomings, Bond et al. constructed the first version of the Protocol for the Analysis of Nonviolent Direct Action in 1988. The leading motivation behind PANDA was to more thoroughly account for domestic events, especially the non-violent type of action often found in protests and demonstrations, but overlooked by the WEIS and COPDAB schemes. Ten years later, Bond et al. created the more comprehensive IDEA, incorporating codes from Taylor's World Handbook of Social and Political Indicators, WEIS, and MID. Furthermore, IDEA created additional codes for economic events, biomedical phenomena such as epidemic disease, and various additional jurisprudence and electoral events.
In 2002, Gerner et al. released the CAMEO coding framework. Like PANDA and IDEA, CAMEO is designed to capture sub-state events and nuanced actor attributes. However, there are two differences between CAMEO and IDEA. First, while IDEA’s extensions preserve backwards compatibility with multiple earlier systems, CAMEO began only with WEIS (plus some of IDEA’s extensions) and combines WEIS categories such as “Warn”/“Threaten” and “Promise”/“Reward,” which were difficult to disambiguate in machine coding. Second, CAMEO’s actor codes utilize a hierarchical structure of one or more three-character codes that reflect the country or nation of origin and as much supplementary information as the article provides regarding region, ethnic/religious group, and domestic role (military, government, etc.). Recently, the ICEWS project—using a variety of sources such as the CIA World Factbook’snational government lists and lists of IGOs, NGOs, multinational corporations, and militarized groups—built on CAMEO’s actor dictionary, eventually collecting over 40,000 names of political figures from countries around the world who had a position of prominence between 1990 and 2011. The GDELT project also relies on the CAMEO ontology.
3.2. Coding processes
In the early stages of event coding, the lack of readily available electronic news stories and sufficient computing power to support machine-coded efforts meant that human coding was the only viable coding option. That process was relatively straightforward. Coders—generally low- or unpaid undergraduate and graduate students—applied the rules from codebooks governing actor and event ontologies to a series of articles, and manually recorded events of interest. The entries were transferred to punch cards and eventually to magnetic tape.
Human coding has three main shortcomings: it is slow, expensive, and subjective. The average human coder can code around six to ten stories an hour on a sustained basis, and few people can reliably code more than a few hours a day because the process is so mind-numbingly boring. At such a rate, it takes a team of 10 coders at least three person-years to code 80,000 news stories. Paying coders $10 an hour would cost $100,000, and the costs for training, re-training, cross-checking, and management would at least double those costs. Additionally, due to the inherently subjective nature of human analytical processes, interoperability between analysts rarely exceeds 70% and often falls in the 30% to 40% range, particularly when coding is done across institutions and over long periods of time.
By the late 1980s, computer technology, both in terms of the availability of electronic news articles and computation power needed to run automated coding software, had advanced to the point that machine coding became feasible. The KEDS project was the first attempt within academia to use a computer to parse through electronic text and code relevant events into an event database, relying on dictionary-driven ‘sparse parsing,’ based on the WEIS typology.
Sparse parsing relies primarily on simple pattern matching in the text of an article to find specific words (e.g., “Israel,” “attack,” “bomb”) or sets of words (e.g., “United Nations Secretary General,” “promised to provide aid,” “promised to seek revenge”) that match entries in dictionaries corresponding to the actor and event ontologies. The system also knows some basic rules of English grammar: for example, it knows that a phrase in the form “Representatives of the US and France will meet with Israeli negotiators” involves two events—“US meets Israel” and “France meets Israel”—and that the passive voice construction “A US convoy was attacked by Iraqi insurgents” reverses the usual subject-verb-object ordering of English sentences so that it corresponds to “Iraq insurgents-attack-US.”
Consider the following hypothetical sentence:
March 12, 1998 – Israeli troops launched offensive attacks against Palestinian insurgents on Monday, in the first of what is expected to be a new wave of counter-terrorism efforts.
Using the CAMEO verb typology and actor dictionaries, as well as rules that automatically concatenate the proper nouns “Israeli” and “Palestinian” with the generic agents “troops” and “insurgents,” the TABARI-derived output for the example is presented below:
Table 2- Example of CAMEO coding |
Date |
Source |
CAMEO Code |
Target |
CAMEO Event |
19980312 |
ISRMIL |
190 |
PALINS |
(Use conventional military force) |
By the late 1990s, machine coding had almost entirely replaced human coding. Almost all time and costs are upfront with this method, in the dictionary and software development phase. Because these are open source, they are easily adopted and upgraded. In 2000, the KEDS projects became the dominant machine-coding system in event data. Extensions of TABARI’s sparse parsing approach are used in PANDA, IDEA, WEIS, and CAMEO. Automated event coding has proved to be fast, accurate, replicable, inexpensive, and easily updatable.
As of November 2011, TABARI was able to code 26 million stories for the ICEWS project in six minutes using a small parallel processing system. Numerous tests also demonstrated that it could match the accuracy of human coders. Since computers are able to rigidly apply coding rules, results are perfectly replicable. Moreover, because TABARI is open source, it is free to install and is easily manipulated to include customized dictionaries or coding rules. Because TABARI is open source, GDELT utilizes it to process articles.
Additionally, GDELT builds on standard TABARI coding to assign a specific latitude and longitude coordinate. To do this, GDELT implements a ‘cross-walked’ approach, which first identifies CAMEO events in a [who | did what | to whom] format, and then scans the text to find the place name located nearest to the verb in the text. According to Leetaru and Schrodt, tests against ground-truthed datasets suggest that this approach works well.
3.3. Processing articles: trends and challenges
Despite the current dominance of machine coding and its ability to increase coding speed by four or five orders of magnitude while maintaining high levels of accuracy and perfect reliability, debates over coding processes still exist.
For example, humans and computers are often reported to achieve the same level of coding accuracy, but closer inspection of the types of errors that each make reveals some important differences. Because humans are forced to make inherently subjective ‘judgment’ calls when coding events, an individual coder is rarely able to correctly code more than 70% of events when compared to a ‘master’ file. Machine-coded efforts often achieve similar levels of accuracy—around 70%—when compared to the same master files, and when dealing with a complex sentence containing compound subjects and objects, machine coding will almost always construct all of the logically possible events, whereas a human coder may miss some.
However, when human analysts miscode an event, their incorrect code almost always reflects some aspects of the reality of the event, that is, it is still partially correct. In contrast, when a machine makes a coding error, it is often a complete departure from reality. For example, a machine may mistakenly code “The Cuban fighter battled the Mexican until delivering a decisive, explosive knock-out blow in the fourth round to secure victory” as an interstate attack launched by Cuba against Mexico, when it really is a report from a boxing match. A human coder would almost never make that level of mistake. For these reasons, certain governmental agencies continue to rely on human coding. Human coding is also required when relevant information—for example, the identity of a likely but unproven perpetrator of a massacre—is spread across several sentences or when information needs to be summarized from multiple articles that may disagree on some details of the event.
Despite its imperfections, machine coding has several key advantages. First, with the major increases in the volume of news reports available on the web, only machine coding can maintain near-real-time updates. Second, machine coding allows for easy experimentation with new coding schemes: ICEWS went through multiple refinements of the CAMEO coding ontologies, particularly for actors, but was able to easily recode the entire dataset after each of these, whereas human recoding would have been prohibitively slow and expensive. Finally, machine coding does not require maintaining a large team of coders for when datasets need updating: once a final set of dictionaries has been developed, new data can be coded indefinitely with almost no additional effort.
Although machine-coding software is imperfect, major advancements have occurred in NLP and full syntactic parsing in recent years. For example, WATSON, IBM’s robotic system that easily defeated human contestants on Jeopardy in early 2011, is able to perfectly interpret nearly all verbally received questions. Google’s translation software is sufficiently accurate that it is used to perform real-time translations in combat situations. While these systems require far more computing power than is available in academic coding projects, no-cost open-source NLP that runs on personal computers can provide considerable pre-processing (for example, ensuring that the object of the verb is correctly identified), which can both simplify the processing and make it more accurate. Lockheed-Martin’s Jabari-NLP, a Java-based advanced version of the original TABARI software and several open-source parsers, is an example of this approach.
4. Aggregating and Analyzing Event-Data Output
The first two sections have focused on efforts involved in the first two steps of the event-data process—obtaining news sources and coding their relevant content—in a replicable and objective manner. These steps culminate in generating event data as a data type, an example of which is provided below.
Table 3- Sample Event Codes from TABARI’s Turkey Dataset |
Date |
Source |
Target |
CAMEO Code |
CAMEO Event |
920104 |
WST |
KURREB |
72 |
(Provide military aid) |
920107 |
TUR |
IGOUNO |
30 |
(Express intent to cooperate) |
920108 |
IRQ |
IGOUNO |
10 |
(Make statement) |
920113 |
TURMIL |
KURREB |
190 |
(Use conventional military force) |
The third and final step in the event-data process involves analyzing the event-data output to gain insight into important trends and form accurate forecasts about events of interest. As Table 3 illustrates, event data as a data type is a combination of string and numerical components that reflect an event. However, most empirical models used in the social sciences are not equipped to handle this level of heterogeneity in data. Consequently, prior to performing quantitative analyses, researchers must first aggregate raw event data into a usable format.
4.1. Aggregating event data
To prepare an event dataset for quantitative models, researchers historically needed to address three primary aspects of event-data aggregation, but due to GDELT’s provision of latitude and longitude coordinates, a fourth aspect now exists.
- Actors—Contemporary event datasets such as those used in ICEWS provide a broad coverage of actors, and not all of these are relevant to the analysis. For example, a model attempting to forecast Arab-Israeli violence would likely not include events occurring between sub-Saharan African or East Asian states. In this case, the researcher may wish to focus only on events occurring between key actors in the Middle East, such as Israel, Palestine, Egypt, Lebanon, and the US.
- Actions—In most empirical analyses, it is beneficial to aggregate the verb codes, for example, by using a scale that reflects the level of contentiousness or using a series of counts that indicates the number of important types of events that have occurred. The Goldstein Scale, which places all event codes on a conflict-cooperation continuum from -10 to +10, with -10 reflecting the most conflictual and +10 indicating pure cooperation, is the most common strategy of action aggregation. Due to complications arising from scaling, other studies convert verb codes to count variables. For example, Thompson and Duval build counts that reflect whether each event is an act of material conflict, material cooperation, verbal conflict, or verbal cooperation, an approach that the GDELT dataset calls a “quad count.” Jenkins and Bond calculate ratios of counts that reflect more complicated concepts such as “conflict carrying capacity.”
- Temporal—Researchers must determine the temporal unit(s) across which to aggregate the data. Researchers tend to aggregate across traditional demarcations of time, including daily, weekly, monthly, quarterly, and annual levels. The literature has yet to settle on firm rules regarding temporal aggregation, meaning that researchers must rely on theoretical and empirical considerations on a study-by-study basis. A number of studies, including Shellman’s and Alt et al.’s, demonstrate that the level of temporal aggregation can actually drive empirical findings. As such, studies using event datasets should perform robustness checks using a different level of temporal aggregation.
- Geo-spatial— Because latitude and longitude coordinates are highly specific, scholars tend to aggregate up to a coarser level of geo-spatial aggregation to facilitate analysis. Two primary approaches exist. The first approach geo-spatially aggregates events according to sub-state administrative units (such as municipalities, provinces, or districts) because most countries are divided into such units. The second approach ignores administrative units and constructs sub-state, geo-spatial units (generally polygons) centered around the specific location where an event occurs.
4.2.Empirical analyses:
The final step in event data as a process is to apply empirical models to the properly aggregated event dataset. Because event data as a data type provides large quantities of fine-grained information, it allows researchers the flexibility to analyze a large range of issue areas, including but not limited to:
- general interstate conflict
- Arab-Israeli conflict
- equity market fluctuations
- migration
- mediation
- Yugoslavian conflict
The large amount of nuanced data also allows researchers to predict the outcomes above using sophisticated methodologies beyond the linear regression models that dominate the empirical conflict literature, such as:
- time series
- hidden Markov models
- sequence analysis
- vector auto regression (VAR)
4.3. Aggregating and analyzing event-data output: trends and challenges
Primarily due to the larger number of observations and high degree of nuance found in event datasets relative to datasets comprised of more static, structural variables, researchers utilizing the former tend to be at the forefront of methodological sophistication in the social sciences. In the future, it will be important for researchers to continue to innovate methodologically, especially in the following three areas:
First, data mining approaches—including but not limited to k-nearest neighbors (knn), support vector machines (SVMs), random forests, etc.—are well suited to finding potential non-linear clusters within event datasets. Moreover, certain data mining approaches, such as random forests, the lasso, and principal components analysis (PCA), can address dimensionality problems that may result as researchers build increasing numbers of features to aid in prediction.
Second, agent-based models (ABMs) are becoming increasingly prominent tools to predict social interactions. However, ABM programmers often struggle to base agent parameters on actual human behavior. Event datasets may be able to inform agents in an ABM environment, thereby leading to more realistic simulations.
Third, the rapid expansion of Facebook, Twitter, and blogs has contributed to the fast growth of social network analysis. Methodological techniques able to merge event data and social network analysis are likely to become increasingly important in the future.
Despite continually increasing levels of methodological sophistication in event-data studies, human intuition still holds a number of strengths over more rigorous empirical approaches.
Table 4- Comparison of Empirical Model and Human Analyses |
Empirical models |
Human analyses |
Objective—empirical approaches are perfectly replicable. |
Subjective—a human analyst may interpret the exact events differently depending on whether he/she is sick, under stress, needing to make a quick decision, and so forth. |
Rigid—empirical models are only able to account for the variables they are provided based on training. |
Flexible—human analysts can interpret sentiments that are difficult to quantify. Additionally, humans can generate more nuanced prediction, such as the specific content of a dictator’s upcoming speech. |
Struggle to handle ‘rare’ events—empirical models struggle to predict the effects of events if they occur too infrequently in a dataset. Although empirical fixes for rare events exists, if an event occurs extremely rarely (fewer than 10 times in a 100,000 observation dataset), these approaches are no longer relevant. |
Able to interpret ‘rare’ events—humans are able to predict broad ranges of consequences of important events that may happen infrequently, such as a global financial crisis or a large-scale terrorist attack on the US. Human analysts usually also have an extended professional education that involves learning a great deal of history, and thus they can put events into context over a period of time. |
Inexpensive and fast after initial calibration—after selecting and training initial model specifications, empirical approaches can be implemented quickly and inexpensively on new data. |
Ongoing expenses—human analysts require ongoing salaries to form predictions. |
Struggle to predict ‘new’ phenomena—because empirical models must be calibrated on a set of training data, if the ‘new’ outcome of interest does not exist in the training data, it is impossible to calibrate a model to predict occurrences of the new event in the future. |
Able to predict ‘new’ phenomena—the flexibility and subjectivity of human cognition allows analysts to predict outcomes that have not previously occurred, such as conflict diffusion across online social networks. Analogical reasoning based on historical archetypes can easily—if not always accurately—generalize past cases to new ones even when these do not match exactly. |
Provide clear and falsifiable predictions—empirical models provide a specific point prediction (often with confidence intervals) and are unable to retroactively attempt to justify incorrect predictions. |
Tend to make conditional, non-falsifiable predictions—human analysts often avoid making specific predictions that can be proven wrong. Instead, they prefer conditional “if x happens, y will follow, but if z happens, k will follow” predictions. When they are wrong, humans attempt to retroactively explain/justify their false predictions. |
In the future, it is unlikely that empirical approaches will ever fully duplicate human analysts. Instead, human and empirical approaches will be used as complements to each other, ideally integrating the strengths of each approach.
5. Conclusion
‘Event-data modeling’ refers to both process and a specific data type. Event data as a process is the attempt to formalize the three general steps used to make predictions about difficult social events: 1) collect as much meaningful information as possible, 2) identify and extract the relevant events, and 3) analyze those events and form a prediction. Event data as a specific data type is the structure of data that results from the second step in this process, which contains rows (often 100,000+ observations) of daily level events with information regarding who | did what | to whom | where | and when. Unlike structural data, which tend to overlook actual interactions (such as meetings or threats) between important actors and are generally aggregated at the state-year level, event datasets focus exclusively on the actions that occur between actors relevant to a specific question because these tend to drive future outcomes.
Efforts by McClelland’s WEIS project and Azar’s COPDAB in the 1960s and 1970s spearheaded the use and acceptance of event data in empirical studies of conflict literature. However, due primarily to computational limitations of the time, these projects relied heavily on human analysts to collect and code physical news articles. In the late 1980s and early 1990s, advances in computer power and the rise of the Internet allowed Schrodt’s KEDS project to automate both the data collection and event coding processes. By downloading electronic articles and coding their content with a sparse-parsing, machine-coding software, KEDS was able to drastically reduce the time and costs of building an event dataset while increasing replicability and maintaining similar levels of accuracy to human coders. Soon after, other prominent projects, such as Bond’s PANDA, adopted the KEDS approach to automate the first two steps of the event-data process.
The third step, aggregating and analyzing the event-data output, has also become increasingly sophisticated. The large size and fine degree of nuance of event datasets provide two main advantages: first, researchers are able to identify trends and make predictions at sub-annual levels of temporal nuance. Second, researchers can move beyond the traditional linear models that dominate the empirical conflict literature to sophisticated machine-learning algorithms capable of uncovering more complex patterns within the data.
As the types of questions that interest scholars of political violence continue to become increasingly nuanced, event data as both process and type will likely play an increasingly important role in academia and in the policy world. However, the future importance of event data may be contingent on the ability of practitioners to make improvements in three general areas: first, the data collection process must expand to cover social networking sites, which provide information about event planning, sentiments, and network structures not found in traditional media. Second, machine-coding approaches need to become increasingly sophisticated to not only code actors and verbs more accurately but also to parse through the massive amounts of “noise” on Facebook, Twitter, and blogs. Third, methodologists should increasingly leverage machine learning, ABMs, and social network analysis approaches with event data to uncover patterns and form predictions that more traditional statistical approaches may be less equipped to handle.