Establishing a B cell and T cell receptor data commons using next-generation sequencing
Vaccine Insights 2023; 2(8), 323–328
How did the AIRR Community come about?
FB: My wife, Jamie Scott, and I both worked at Simon Fraser University for many years. She studied HIV vaccine development. In 2013, she organized a symposium in association with the Antibody Society on defining and delimiting clones in B cell repertoires. At the time, everybody defined those things differently (they pretty much still do).
We recognized a need for a common language for these huge datasets that were just starting to be produced, thanks to next-generation high-throughput sequencing being applied to B cell or T cell repertoires. In 2015 in Vancouver, we had our first international meeting to bring together immunologists, computer scientists, and experts in ethics of data sharing, and the Adaptive Immune Receptor Repertoire (AIRR) Community was born. Now, our eighth international meeting is planned for June of next year in Europe.
One of the first things we did was come up with minimal standards for curating these types of data, as these datasets are huge, often including 1 million or even 10 million sequences for each sample. We wanted to establish the minimal data, or metadata, needed to accompany a dataset so that a researcher would know how the samples were processed, sequenced, and analyzed bioinformatically. 80 columns of minimal standards were established as a common way of describing the samples. These minimal standards would make it easy to share these kinds of data among researchers, which was a key motivation for establishing the AIRR community.
Sometimes it is hard to get people to agree. But by remaining a grass-roots, community initiative all these years, rather than taking a top-down approach with a few leaders dictating how things should be done, we have been able to establish community-approved protocols for curation and sharing of these data.
What is the goal and structure of the AIRR Community?
FB: The overarching goal is to be able to share data effectively. The main work of the AIRR Community is accomplished by seven working groups that anyone can join, which meet about once a month. We work by consensus. Once a standard has been developed, we publish it in a scientific publication, and everybody in the community has a chance to vote to approve that publication. That is one way we maintain community control of the standards that are produced. The working group I am most involved in is the common repository working group, which is building the AIRR Data Commons. Other working groups focus on aspects such as software and immunoglobulin and T cell receptor germline genes.
The AIRR community is an open community. Anybody can join the working groups and membership is free for students and postdocs. If Vaccine Insights readers would like to get involved, they can find us here The Adaptive Immune Receptor Repertoire Community of The Antibody Society. .
How is the AIRR Data Commons helping researchers?
FB: One part of the vision is to be able to compare big vaccine studies easily. This has previously required downloading studies from the sequence read archive (SRA) as raw data. These files are often not in very good shape—there might be bits missing or unexplained quirks. These raw data then must be annotated against germline immunoglobulin and T cell receptor genes, using several different algorithms with various assumptions.
In the AIRR Data Commons, we store data in a usable form, with comprehensive annotations and metadata in a common format; for example, gender should always appear in column fourteen. If you store the data according to the minimal standards in an AIRR-compliant manner, then anyone can access and query these data from different repositories.
The AIRR Data Commons has always embraced a distributed repository model. The data sets are huge and often have data risk constraints, so it often would be best to keep data at the home institution. However, if it is all in the same format, you can either write a program to do queries or use a science gateway, such as iReceptor iReceptor. , which does those queries for you, across the distributed repositories. Researchers can also access the AIRR Data Commons in a similar fashion through VDJServer VDJServer Documentation.. Having to re-annotate the data and reformat the metadata so you can do statistics on it can take a long time. We are not doing anything that researchers cannot do on their own, but we are facilitating it so it can be done in a few hours rather than 6 months. The vision of the AIRR Data Commons is to facilitate that work, sharing immunological data in a common format.
What is the biggest challenge with this work?
FB: It can be hard to convince people to take that extra step to make their data more easily shareable within the whole community. We need to establish a data-sharing culture. Researchers talk about wanting to share their data, but sometimes it can be too hard to do in practice. The AIRR Community has developed these tools to easily add data into the AIRR Data Commons in a common format. With COVID, it was good to see researchers excited about making their data publicly available through the AIRR Data Commons.
We have also seen scientists at commercial organizations starting to use the AIRR data-compliant format. They might not put the data into the AIRR Data Commons and make it public, but this common format makes it easier for their researchers to query the public data and compare it to their own.
How are these large data sets curated and made accessible to researchers around the world?
FB: There are about 5 billion receptor sequences in the AIRR Data Commons from around 80 studies. Although the goal is to have geographically distributed repositories, most of those 5 billion are in two different repositories—the iReceptor Public Archive at Simon Fraser University and VDJServer, an NIH-funded group at the University of Texas Southwest Medical Center. These two groups have curated the data from public sources, reformatted the metadata, and re-annotated it according to an annotation program. Motivated by the pandemic, a large amount of data from COVID studies was curated in collaboration with COVID researchers. We also have a repository at the German Cancer Research Center (Deutsches Krebsforschungszentrum, DKFZ) and one at the University of Münster, Germany. We are pushing for different groups to develop their own repositories.
On the iReceptor gateway, the researcher will not see that these data are coming from different repositories. It is all federated into an integrated display. You can do a quick search on a CDR3, one of the important pieces of the receptor molecule, for example, and see whether that CDR3 shows up in other diseases or other individuals, or you can search the metadata for repertoires from specific diseases, such as HIV or flu.
What projects are you excited about right now?
FB: We are collaborating with Monica Westley, Founder of the(sugar)science, who is working tirelessly on getting type 1 diabetes researchers to share their data. She has convinced a lot of large labs to share their data publicly, and iReceptor is working with them to put those into the AIRR Data Commons. You could compare public versus private clonotypes at the push of a button or determine whether patients with worse outcomes are characterized by particular clones. To be able to do that, you have to look at a large group of diverse individuals and compare data beyond type 1 diabetes to other autoimmune diseases.
Another aspect is that single-cell work is becoming popular. Right now, most data sets are based on bulk sequencing of 1 million to 10 million sequences per sample. The AIRR Data Commons web Application Programming Interface (API) now allows for single-cell immune profiling, which includes sequences of the two chains of the receptor, the gene expression data, and some of the phenotypic markers from every single cell. We only have three single-cell datasets in the AIRR Data Commons right now, but with these, you could search the different samples to find those expressing certain genes at a high level, and then correlate these with particular immune receptors.
With single-cell data, each cell has a 25,000-count matrix associated with it. It is going to be a big challenge to curate that much data for each study, but it holds exciting potential. The ability to look at different types of cells producing a receptor and the physiological state of each of those cells can help determine the important groups of cells for diagnostics or therapeutics. We expect such single-cell studies to be a growth area for the AIRR Data Commons and for immunogenetics researchers
What are your hopes for the future of AIRR?
FB:FB: I want to share and integrate datasets, get more data faster, and get everybody to agree that gender goes in column fourteen!
We are working to integrate information in the AIRR Data Commons with databases that curate germline genes. The information in each individual’s expressed B cell and T cell receptor repertoires can be used to infer genetic polymorphisms in their immunoglobulin (for B cells) and T cell receptor germline genes. We are also working to link with the Immune Epitope Database (IEDB), which is a large database of epitopes and antigen specificity data for many of these B and T cell receptors. The big dream is to integrate from germline polymorphisms all the way up to phenotype, including disease phenotype, in order to predict propensity for diseases and understand the molecular underpinnings of disease.
We are working with the International Union of Immunological Societies to expand the view of the AIRR Community initiative for shared metadata to other immunological data types, such as flow and microbiome. The real vision is to make it easier to share and analyze all of these data types and get a complete picture, rather than having them in silos or difficult-to-navigate data lakes.
We have also talked about having some sort of digital object identifier (DOI) or stamp on each data set, to make it possible to count how many times your data has been downloaded or used in an AIRR-compliant analysis program. Right now, if somebody uses your data, you might not know about it for 2−3 years, until it is used in a publication. There is great value in other people using your data, and it can be useful to know and be rewarded (e.g., tenure committees, funding institutions) when it is happening.
What keeps you motivated?
FB: I am an evolutionary geneticist by training; I worked in beetles, toads, and guppies until about 15 years ago when my wife got me interested in human immunogenetics. Working in an area that could have positive outcomes for human health and patient care is both exciting and satisfying.
We are currently in the Wild West phase in terms of the analysis of these immune cell receptor repertoires. We do not know what information is in there, or how important that information will be, but there is a lot of potential. It is exciting to be working with one of the first groups to look at this in a very systematic way, combining data across labs, institutions, and diseases.
Felix Breden is trained broadly in evolutionary genetics, including population genetics, behavioral analysis, and molecular genetics. In 2003, Dr Jamie Scott introduced him to the wonderful world of immunology and immunogenetics, and since then much of his effort has been dedicated to developing community and bioinformatic resources for studying both the fascinating evolutionary dynamics of the adaptive immune system, and of course how understanding these systems can lead to new therapies and diagnostics. The resources he is developing for curating, analyzing and sharing immunogenetic data, through the iReceptor Project and the Adaptive Immune Receptor Repertoire (AIRR) Community, will facilitate research in cancer, and autoimmune and infectious diseases.
Simon Fraser University
AIRR Community Executive Sub-committee
1. The Adaptive Immune Receptor Repertoire Community of The Antibody Society. Crossref
2. iReceptor. Crossref
3. VDJServer Documentation. Crossref
Authorship & Conflict of Interest
Contributions: The named author takes responsibility for the integrity of the work as a whole, and have given his approval for this version to be published.
Disclosure and potential conflicts of interest: The author has no conflicts of interest.
Funding declaration: The author has received NSERC Operating Grants.
Article & copyright information
Copyright: Published by Vaccine Insights under Creative Commons License Deed CC BY NC ND 4.0 which allows anyone to copy, distribute, and transmit the article provided it is properly attributed in the manner specified below. No commercial use without permission.
Attribution: Copyright © 2023 Felix Breden. Published by Vaccine Insights under Creative Commons License Deed CC BY NC ND 4.0.
Article source: This article is based on an interview with Felix Breden carried out on Jun 13, 2023.
Interview held: Jun 13, 2023; Revised manuscript received: Aug 22, 2023. Publication date: Aug 30, 2023.