1  Data and research data management

1.1 Introduction to Research Data

1.1.1 What is data? What is it for?

Data refers to factual information, often quantitative or qualitative, used as a basis for reasoning, discussion, or calculation. It is the foundational analysis element and supports decision-making across diverse fields. In research, data helps to generate new insights, verify hypotheses, and contribute to the overall body of knowledge in a specific area.

1.1.2 Archaeological Data: Particularities

In archaeology, data encompasses various types, such as field notes, artefacts, images, geospatial coordinates, etc. By systematically collecting, organizing, and analysing this data, researchers can reconstruct past human behaviours, understand environmental contexts, and explore cultural practices.

Archaeological data is unique due to its diverse formats and the complexity of its collection. It often includes material remains like pottery, bones, tools, and structures, as well as environmental data like pollen samples and soil types. Since archaeological data often comes from excavation sites, it is usually non-renewable; once excavated, a site cannot be restored to its original state.

Archaeologists rely heavily on careful documentation to preserve as much information as possible for future study. Furthermore, the context in which artefacts are found is crucial, as it helps interpret their use, significance, and the broader cultural setting.

1.1.3 Open Science

Open Science is a general framework that promotes transparency and collaboration in research by making methodologies, data, and findings accessible to the public and other researchers. In archaeology, it involves strong encouragements to share excavation reports, datasets, and analysis methods to facilitate broader understanding and scrutiny of findings and interpretations.

The Open Science agenda, widely agreed and recognised by national and international organisations, includes open research data and open source as main areas of practice and policy.

Pillars of Open Science, [[:en:UNESCO|]] (2021)

Os taxonomy.png

1.1.4 FAIR Principles

FAIR stands for Findable, Accessible, Interoperable, and Reusable (“What Is FAIR?” n.d.) and applies to data, metadata, and supporting infrastructure to ensure data is systematically organized and readily available for ongoing use and analysis.

Metadata refers to data that provides information about other data rather than the content of the data itself. It describes various attributes such as the data’s structure, format, location, and context (“Metadata” 2024; Editor n.d.). Examples include creation dates, file sizes, authorship details, and keywords, which help organize, understand, and retrieve the associated data (“What Is Metadata and How Does It Work?” n.d.).

Example:

  • Data: identification number, site of origin, and dates of items of a museum exhibition.
  • Metadata: title and description of this data, author(s), date of last update, criteria for fixing site of origin, method(s) for establishing dates.

Notice that, in this case, we could also choose to specify the information about the dating method in the dataset, as long as we want or can specify this information for each item independently.

A short breakdown of the concepts behind FAIR:

  • Findable: Data should be easy to locate for humans and machines. This includes having:

    • A unique and persistent identifier.
    • Descriptive metadata.
    • Metadata containing data identifiers.
    • Registration in searchable resources.
  • Accessible: Data should be easy to access once found. This includes:

    • Retrieval via a standardized, open protocol.
    • Support for authentication and authorization when necessary.
    • Availability of metadata even if data itself is no longer available.
  • Interoperable: Data should integrate with other data and tools. This requires:

    • Using a standardized, accessible language for data representation.
    • Using FAIR-aligned vocabularies.
    • Including references to related data.
  • Reusable: Data should be well-described for reuse and replication in various contexts. This involves:

    • Detailed, relevant metadata.
    • Clear data usage licensing.
    • Documentation of data provenance.
    • Conformance to community standards.

These principles aim to enhance the usefulness of digital assets by ensuring they can be easily located, understood, and utilized by others (Lamprecht et al. 2020; Wilkinson et al. 2016). Applying FAIR principles in archaeology involves creating well-documented, datasets that other researchers can readily access openly or through transparent authentication procedures, in case of sensitive data (Hiebel et al. 2021; Lien-Talks 2024; Nicholson et al. 2023).

A summary breakdown of the FAIR data principles (reproduction from Lien-Talks (2024), Figure 1)

1.1.5 CARE Principles

The CARE Principles for Indigenous Data Governance were defined more recently (Carroll et al. 2020) and emphasize ethical and responsible use of data:

  • Collective Benefit: Data should benefit Indigenous communities collectively.
  • Authority to Control: Indigenous communities should have control over how their data is collected, used, and shared.
  • Responsibility: Data practices must uphold the well-being and interests of Indigenous peoples.
  • Ethics: Data use must be guided by ethical considerations, respecting cultural protocols and privacy.

These principles aim to protect Indigenous rights to self-determination and ensure data contributes positively to community development and innovation. Researchers involved in archaeology should remain sensitive to the concerns raised by CARE, especially when conceptualising a new research project and later, when publishing data and research outputs.

Applying the CARE principles in archaeological research will strongly depend on whether an indigenous group (self-identified and/or legally recognised) is directly involved as researchers or research enablers (e.g., creators of material culture, subjects to surveys and interviews, consulting experts, etc). In countries with a recent colonial past, this criteria tends to be clear for all parts involved. However, the concept of indigenous is hardly consensual when speaking about cultural groups not self-identified in oposition to a colonising cultural group. Initiatives, such as the Traditional Knowledge Labels by Local Contexts, continue to be developed to help to clarify this and other points.

  • CARE Principles for Indigenous Data Governance (2024)
  • CARE Principles for Indigenous Data Governance (n.d.)
  • CARE Principles for Indigenous Data Governance ARDC (2022)

1.1.6 Reproducibility

Reproducibility refers to the ability to replicate a study’s results using the original author’s assets or following their methodology. It ensures that findings can be independently verified under similar conditions, reinforcing the reliability of scientific research (National Academies of Sciences et al. 2019).

In scientific research, reproducibility is crucial for confirming the validity of experimental findings and building upon existing knowledge. It enhances transparency and trustworthiness in scientific practices, promoting better peer review and collaboration GRN · German Reproducibility Network (n.d.).

Most archaeological research does not necessarily involve controlled experiments, and archaeological surveys and excavations are destructive and thus unrepeatable. However, the reproducibility of data collection, processing, and analysis is not a trivial concern.

Making research reproducible must be considered a spectrum of practices in which researchers should strive to improve despite the challenges (Marwick 2017).

Ben Marwick, CC-By Attribution 4.0 International, available at OSF

1.1.7 Open Source

Open Source refers to software and tools that are freely available for anyone to use, modify, and distribute (“Open Source” 2024; “The Open Source Definition n.d.).

Key criteria include:

  • Access to Source Code: Users can access the software’s source code.
  • Free Redistribution: The software can be freely redistributed.
  • Modifications and Derived Works: Users can modify the software and create derived works based on it.
  • Integrity of the Author’s Source Code: License terms ensure modifications do not restrict access to the original code.
  • No Discrimination Against Persons or Groups: The license must not discriminate against any person or group.
  • No Discrimination Against Fields of Endeavor: The license must apply to all fields of use.
  • Distribution of License: The rights attached to the software must apply to all who receive it without needing to acquire additional licenses.

For archaeologists, open-source tools can offer affordable solutions for data analysis, visualization, and management, promoting a more inclusive research environment. Additionally, there is a growing community of archaeological software engineers who identify entirely with open-source principles (Batist and Roe 2024; Open Source Archaeology: Ethics and Practice 2015).

1.1.8 The lay of the land: Organizations, tools, data repositories and where to find them

Several organizations and platforms support Open Science and the use of FAIR and Open Source principles, providing archaeologists with access to valuable tools and datasets.

Here is a list of those that will be most useful during this course:

  • Organizations:

  • Data Repositories and Data Management Tools:

    • ORCID: ORCID, which stands for Open Researcher and Contributor ID, is a unique alphanumeric code that identifies researchers and contributors in scholarly activities.
    • Zenodo: A repository where researchers can deposit datasets, software, and publications.
    • Open Science Foundation: OSF is a free, open platform to support your research and enable collaboration.
    • Archaeology Data Service (ADS): An archive for archaeological data from the UK, which provides access to various datasets, including excavation reports and geospatial data.
    • the Digital Archaeological Record (tDAR): An archive for archaeological data hosted in the USA, which provides access to various datasets, including excavation reports and geospatial data.
    • Open Context: A repository offering access to archaeological data from various global sources, adhering to FAIR principles [noauthor_open_nodate-1].
    • Hypothes.is: an open-source platform designed to facilitate collaborative annotation of web content. It is used as a browser extension or plug-in.
  • Open Source Tools:

    • Markdown: Markdown is a lightweight markup language designed for creating formatted text using a plain-text editor (“Markdown Guide n.d.).
    • R and Python: Programming languages with extensive libraries for data analysis, which are widely used in archaeology for statistical analysis and modeling “Welcome to Python.org” (2024). See more about these and other programming languages below.
    • RStudio: RStudio IDE is an integrated development environment for R, a programming language for statistical computing and graphics.
    • Quarto: An open-source scientific and technical publishing system.
    • Git: Git is a distributed version control system designed to track changes in source code during software development.
    • GitHub: GitHub is a developer platform that allows developers to create, store, manage, and share their code using Git software and a series of additional automated services.
    • GitLab: GitLab is a developer platform that allows developers to create, store, manage, and share their code using Git software and a series of additional automated services.
    • Zotero: Zotero is a free, easy-to-use tool to help collect, organize, annotate, cite, and share metadata on references.
    • Zotero Style Repository: An open database for citations styles, which can be used in various reference managers and other tools for the formatting of references.
    • QGIS: A free, open-source GIS tool useful for mapping and spatial analysis in archaeology (“Spatial Without Compromise · QGIS Web Site n.d.).
    • Arches: Arches is an open-source software platform freely available for cultural heritage organizations to deploy independently to help them manage their cultural heritage data (“Arches Project Open Source Data Management Platform 2015).
    • OpenAtlas: an open-source database software developed primarily to acquire, edit, and manage research data from various fields of humanities like history, archaeology, and cultural heritage as well as related scientific data (OpenAtlas n.d.).
    • open-archaeo: An extensive list of open source archaeological software and resources (“Open-Archaeo” n.d.).

By utilizing these resources, archaeologists can ensure that their research aligns with Open Science, FAIR, and Open Source principles, ultimately enhancing the transparency, accessibility, and longevity of their work.

1.2 Introduction to Programming for Research

1.2.1 What is Programming?

Programming is designing and implementing instructions a computer can follow to perform specific tasks. It involves writing code in various programming languages that communicate with the computer’s hardware and software to solve problems, process data, and automate repetitive tasks. At its core, programming is about creating a sequence of steps, known as algorithms, which help achieve a desired outcome (“What Is Programming? And How To Get Started 2024).

1.2.2 Importance of Learning Programming

For researchers, learning programming offers several significant advantages:

  • Efficiency and Automation: Programming can help automate data collection, processing, and analysis, saving time and reducing human error. It also enables researchers to handle large datasets and complex calculations with ease.

  • Reproducibility: Writing scripts to perform analysis allows other researchers to replicate experiments, thus ensuring results can be verified and reproduced, a fundamental aspect of scientific research.

  • Access to Powerful Tools: With programming skills, researchers can access a wide range of tools for data visualization, statistical analysis, machine learning, and simulation. These tools can enhance the scope and quality of research projects.

1.2.3 Overview of Common Programming Languages

Several programming languages are popular in research due to their specific features and libraries(“Introduction to Programming Languages 2018):

  • Python: Known for its readability and versatility, Python is widely used for data analysis, machine learning, and automation. It has extensive libraries such as NumPy, pandas, and matplotlib, particularly useful for data-intensive research.

  • R: A language developed explicitly for statistical computing and graphics, R is preferred in data science, bioinformatics, and fields requiring extensive statistical analysis. The Comprehensive R Archive Network (CRAN) offers thousands of packages that can handle various analytical tasks.

  • MATLAB: Commonly used in engineering and scientific research, MATLAB excels at numerical computing, simulation, and algorithm development. It is prevalent in fields like physics, engineering, and finance. In contrast to R and Python, MATLAB is not open source. A MATLAB license can be obtained through the campus license of the RUB.

  • JavaScript: A web development language also used in research to develop interactive data visualizations and web-based applications.

1.2.4 Concept of Research Software: Tools and Scripts

Research software comprises tools, libraries, and scripts that aid in conducting and managing research activities. It can range from simple data cleaning and preprocessing scripts to complex software packages for statistical analysis and simulation. Although any software used in research could be considered research software, advanced training focuses on programming or scripting skills, not graphical user interface (GUI) operations (e.g., creating a plot in R or Python rather than in Microsoft Excel).

  • Tools: Research software tools like Jupyter Notebooks, RStudio, PyCharm, Visual Studio Code, etc., offer integrated development environments (IDEs) tailored to a specific range of languages, enhancing productivity and facilitating code sharing and version control.

  • Scripts: Scripts are text files encoding programs written to perform specific tasks. In research, scripts are often used for data cleaning, model training, and result visualization. Scripts can be used even to generate and format an entire dataset (have a peek on how the course schedule has been built with R code). Researchers can share these scripts to ensure that analyses are reproducible and consistent.

Learning programming for research allows scientists to leverage these tools and scripts effectively, making research more efficient, reproducible, and insightful.

1.2.5 Markdown: a language in between

As already mentioned, Markdown is a markup language with minimal syntax, designed for creating formatted text using a plain-text editor (“Markdown Guide n.d.). Markdown is easily readable and editable for us humans, just like text. However, it is technically a programming language, as it holds machine-readable instructions (i.e., program), which, once compiled (or rendered) by the right software, produces the desired output in the form of a formatted text. An advanced text editor, such as Microsoft Word, allows you to do exactly the same (and more), but only through a sequence of interactions with the graphic user interface (GUI).

Writing in Markdown or “writing Markdown code” is straightforward and will only require some getting used to, mainly if you are accustomed to advanced text editors. For example, instead of marking a text fragment in bold font through the usual steps, we would do it by typing ** (two asterisks) before and after the fragment.

To illustrate this, consider the rendering and source code of a markdown text:

Live as if you were to die tomorrow. Learn as if you were to live forever.” - Mahatma Gandhi, Source: BrainyQuote

"**Live** *as if* you were to **_die tomorrow_**. **Learn** as if you were to **_live forever_**."

Mahatma Gandhi, Source: [BrainyQuote](https://www.brainyquote.com/citation/quotes/mahatma_gandhi_133995)

The simplicity of this language might seem unnecessary or even annoying, but it is actually a trait that opens a wide range of potential applications. Today, markdown is indispensable for all practices related to open science and can offer a good entry point for other, more complex programming languages.

  • Obregon (2024)
  • “What Is Markdown? Syntax, Examples, Usage, Best Practices (2021)
  • “Functional Documentation with Markdown and Version Control” (2017)
  • “Basic Syntax Markdown Guide (n.d.)

1.2.6 Where to learn (and keep learning)


Hands-on Practice