Information and data architecture in Ecology: Folder structure, metadata and project management through GitHub
This serves as an open-access repository for the forthcoming lecture entitled 'Information and data architecture in environmental sciences: Folder structure, metadata and project management through GitHub'.
Within this repository, you can find all the course materials, references, and lecture handouts. Additionally, I envision this repository to serve as a practical demonstration of building and structuring a course from the ground up in an open and reproducible environment.
Please fill in the pre-lecture form which you can find here
Throughout my academic journey in the field of ecology, I had the privilege of participating in numerous research projects and meeting extraordinary collaborators, many of whom evolved into friendships. Frequently, I found myself actively seeking out discussions that pertained to ecological research, with a particular emphasis on subjects related to the computational dimension of our field. These interactions allowed me to explore and engage with the intricate aspects of ecological science, broadening my perspective and deepening my understanding of this dynamic discipline. Once I had the opportunity to debate with my classmate Yves. He was noticeably vexed by an assignment for our master's program, and his frustration led to a rather comical moment of exasperation for him and very inspirational for me: “I don't understand this. When I decided to do ecology I thought I would be in the jungle or in a natural ecosystem collecting data or observing pandas. I want action not programming and statistics”. Since then I met amazing ecologists, very skilled in the field-work but very often frustrated with the amount of statistical knowledge they had to cope with. Over a course of four years dedicated to ecology, I began to discern that there is a broader difficulty and frustration among junior ecologists to deal with the computational aspect of our field. The way I approach this observation is relevant to advancements and changes that took place during the last decade. Worldwide data volume doubled nine times between 2006 and 2011 with this trend to continue exponentially this decade (Farley et. al 2018). Ecology as an earth science entered the era of big data as well. Ecological data are increasing due to a) continuous data accumulation from earth observing systems (satelites), b) aggregation of small scientific projects into global, with collaborations expanding across continents (NutNet, DRAGNet), c) increasing interest and funding into long-term ecological monitoring networks (LTER; LFDP), d) citizen science (i.e. iNaturalist, GBIF). Overaccumulation of data poses several challenges for ecologists, with a background mainly in biology and not computer science (Strømme et. al. 2022). As a consequence, traditional and popular statistical analyses and data cleaning methods are slowly becoming either tedious or useless. From viewing and removing NAs in a simple dataset using spreadsheets to distributing big-sized spreadsheets via e-mail to working on a local computer with low computing power. Additionally, advancements in Artificial intelligence (AI) and the public availability of AI tools to everyone (such as bard and ChatGPT) pose more challenges to ecologists. To achieve a head start, we should openly address issues emerging from AI advancements and find out how these tools can be utilized in improving ecologists’ workflow (Poisot T. et. al. 2023). Frequently, ecologists, in their pursuit of understanding the natural world, tend to give less attention to enhancing their workflow and scripts, even though their results are influenced by programming languages (i.e. R, python or julia). To my point of view, the challenges we encouter are not solely relevant to statistical analyses but also to project organization principles. With this course, I am to discuss and point out some common mistakes we do in our everyday workflow as well to introduce some tools and methods that can ease some of the arising problems. First, I will discuss ways to structure a project within folder, then address some principles relevant to file naming and finally I will introduce GitHub as a tool for collaborative and reproducible science.
In order to set up R language environment in your coumputer, first you have to install R. For getting started with R look here.
Once you have installed R you should set-up an integrated development environment (IDE). RStudio (or nowdays Posit) serves its perpuse for many years and remains a great choice. If you have educational or pro github account, you can now (Oct-2023) benefit from GitHub copilot through the Rstudio interface. A great alternative that gains more attention is the Microsoft Visual Studio Code. Unlike Visual Studio, Visual Studio Code (yes are two seperate platforms) is free. I would highly recomend to give it a try because of its extensions and the integration of GitHub copilot.
Once you have set up R and the IDE of your choice, it is time to setup GitHub. First you should create your personal account here. As a student you have free access to GitHub premium which can be obtained with your accademic email address. I highly recommend you to take advantage of the premium services as soon as you can follow cources and have free access to GitHub copilot. Remember that GitHub can also serve as a portfolio or CV. Very often, when you apply for jobs, you can showcase your coding skills by linking your GitHub account to your CV. For many job openings, having experience with Git and GitHub is considered a significant advantage. That definitely goes beyond its intended purpose, but in a highly competitive environment, it's always good to be prepared. In case you are having trouble, you can take a look at this video on how to sign up.
Assuming you have signed up for GitHub and claimed your account, you can now download the desktop application. GitHub Desktop offers an easy-to-use and navigable working environment that doesn't require advanced coding skills. You can find a nice introduction on how to install GitHub desktop and set up your first README files here.
Setting up a local GitHub repository is a straightforward process that can be accomplished through the GitHub Desktop interface with just a few simple steps. However, I'd like to highlight an alternative approach that could prove useful to many. Once you begin using GitHub to organize your projects, especially projects involving coding, it can be challenging to return to basic cloud services. Nevertheless, there may be situations where your collaborators are unable to join you on GitHub, even though it can save a significant amount of time. In such cases, you may find yourself collaborating via cloud services like Dropbox, Google Drive, as well as email. If you are still collaborating primarily through email, I'm really sorry, but I won't be able to help. However, if you are using cloud services, there's a 'hack' that can change the way you collaborate. You can configure the path for your local repository within the folder of your chosen cloud service (e.g., iCloud, Google Drive). This allows other users to have access, even if they are not using GitHub. Additionally, it provides an extra layer of security. In case something goes wrong with either the cloud servers or GitHub, you will still have access to your project.
Here, you can access a list of references and a compilation of courses I have previously attended, which have served as an inspiration for the content of this lecture.
- A simple kit to use computational notebooks for more openness, reproducibility, and productivity in research and the corresponding GitHub repository with free code!
- Not just for programmers: How GitHub can accelerate collaborative and reproducible research in ecology and evolution
- Evaluating the popularity of R in ecology
- Close to open—Factors that hinder and promote open science in ecology research and education
- Open science, reproducibility, and transparency in ecology
- Low availability of code in ecology: A call for urgent action
- The future of ecological research will not be (fully) automated