As glycoscience advances, labs will increasingly want to ask questions about glycosylation sites on a protein or the structure of a sugar, says Raja Mazumder, a bioinformatician at George Washington University. They might ask for example: are there glycosyltransferases that are expressed in liver but not in the heart, or, which ones are overexpressed by a factor of three in more than two cancers. Such questions require infrastructure building, he says, because right now there is no mechanism to allow such queries. But he and others are building such capabilities. Mazumder along with William York at the University of Georgia are starting to build a glycoscience informatics portal.
Mazumder wants to leverage existing ontologies in the developer community in order to build systems that can be queried on a large-scale. For example, Mazumder is working with Cathy Wu at Georgetown University, who is developing the Protein Ontology. Such ontologies are collected, for example, by the non-profit OBO Foundry. To allow flexible querying, the computational resources will draw on different ontologies; ones that relate to glycans, genes, proteins, tissues, diseases and more.
Ontologies are part the team’s effort to build application program interfaces (APIs) that expose the data in a given database to incoming queries. Given how complex sugars are, the informatics framework has to be well-organized for both human and machine-based querying, says Mazumder.
When using the resource, a researcher will receive results that also document the search process itself such as the version of the queried database. “You need to be able to tell where you got that information from,” says Mazumder. Tracking data provenance matters especially in an age when databases continuously integrate information emerging in the literature.
For the Food and Drug Administration, Mazumder is developing computational standards for high-throughput sequencing, which he wants to also apply to glycoscience. His ‘biocompute object’ captures the given computational workflow a lab might have used to generate results: the software used, the databases queried and their version, and identifiers of data inputs and outputs. These biocompute objects are intended to help regulatory scientists interpret submitted work. It can also help scientists generally see if, for example, the version of software they used worked as it should, says Mazumder.
Too often labs use computational tools without benchmarking them, says Mazumder. “It would be unthinkable for a wet-lab scientist to not have a positive and negative control,” he says. In informatics, developers benchmark their software but users often do not have these habits. “They don’t even know: if I don’t find anything, is it because my software did not run well or not?”
As labs move to big data analysis in genomics and also, eventually, in glycoscience, this aspect is ever more important, says Mazumder. In his view, biocompute objects will help glycobiology researchers communicate with one another about their results, such as where on a protein they found a sugar with a given structure. More generally, it will help glycoscientists to have a better way to connect the available sugar resources as they pursue their questions of interest.