Thursday, 14 February 2013

Making Data Public (and a small matrix-related rant)

In the world where there is a constant debate over the merits and disadvantages of Open Access journals and science, we are often bombarded with blogs and posts about it. I am generally a silent proponent of Open Access journals, agreeing that it is important, but not particularly versed in all of the politics so I tend to keep quiet. That being said, I have recently stumbled upon a related issue that has affected me in the last few weeks: the importance of making your data public.

Although my primary research interest is in pterosaurs, I am currently writing up a manuscript from my undergraduate thesis, which was on the ceratopsian dinosaur Centrosaurus. Much to my surprise, the most recent discussion with my former supervisor (and the senior/co-author) went in a different direction than I was expecting: he wanted me to develop a character matrix and do a phylogenetic analysis. Now I've never done this before, although I've taken several courses and have a good basic understanding of the concept, I've never actually developed a matrix and done my own analysis. Upon discussion with him, we decided that I would use several published matrices and merge them together, taking several characters from each matrix.

Looking through recently published matrices, I came across the Farke et al. (2011) paper in which Spinops sternbergorum was described. I sent him an email, and he was very happy to share the matrix, and character descriptions (although those are available from the supplementary information of the paper) and he sent along the .nex file. Super helpful, because then I had it already in a matrix that I could open, copy, paste, edit, etc. Thanks so much for that Andy! Then, I had to add some taxa that were published more recently, like Xenoceratops foremostensis (Ryan et al. 2012), and Pachyrhinosaurus perotorum (Fiorillo and Tykoski 2012). The Xenoceratops matrix was published directly in the paper as a table (not as easy to follow the correct character number, but available), while P. perotorum was found in the supplementary material (in a more easily viewable format). The best, however, came when I looked up a paper on Anchiceratops (Mallon et al. 2011). On the downside, the paper is published in a non-open access journal, which means not everyone can access it. On the BIG upside, included in the supplementary material is the actual .nex matrix file which allows you to see all the characters, states, and taxa, right in the format you want. It makes it soooo much easier to access and much quicker when these are available at your finger tips, without having to send many emails to people asking for it. There are several other (mainly older to be fair) phylogenetic papers that don't post the matrix, or characters used, which makes it really difficult to figure out how they've done things.

Unrelated to my story, and covered much in other places so I won't cover it in detail here, is a wonderful story of a recent publication that used previously published data in a huge analysis. Larson and Currie (2013) were able to study over 1000 small theropod teeth from southern Alberta, using data that had previously been published and new data. A study of this scale would clearly have taken a lot longer if they had to do sit down and do all the measurements on 1183 small teeth. Fortunately for them, (and us), they were able to spend their time analysing the data already available, rather than painstakingly measuring them. They determined that the number of small theropods present from this area has been greatly underestimated, and that many species are known only from teeth. Cool! For more information, you can check out this blog by Jon Tennant.

Take home message: make your data open to everyone! For the most part, I have dealt with people who are extremely open and willing to email me stuff if it isn't posted. But wouldn't it be better if you didn't have to email every time? If you could just go online and access it? It shouldn't be some top-secret information. Post it!

And finally, a small rant on matrices. I know that there are disagreements about characters, so not every published matrix is going to use exactly the same characters, but WHY do people insist on changing character states around in a way that just makes things difficult?? For example, there are several characters in Fiorillo and Tykoski (2012) that are just different enough from all other matrices I've looked at that you can't just directly copy the states. Why is it necessary to switch it from the postorbital horncore height being compared to the basal skull length (which every paper does) to comparing it to the length of the face? Or change numbers slightly so one one paper a character is considered to be long if it's 0.8 or more, while in another it's 0.75? Pretty sure that is unnecessary! Make it easy, people!

References:
Farke, A.A. et al. 2011. A new centrosaurine from the Late Cretaceous of Alberta, Canada, and the evolution of parietal ornamentation in horned dinosaurs. Acta Palaeontologica Polonica 56: 691-702. Freely accessible here.
Fiorillo, A.R. and Tykoski, R.S. 2012. A new Maastrichtian species of the centrosaurine ceratopsid Pachyrhinosaurus from the North Slope of Alaska. Acta Palaeontologica Polonica 57: 561-573. Freely accessible here.
Larson, D.W., and Currie, P.J. 2013. Multivariate analyses of small theropod dinosaur teeth and implications of paleoecological turnover through time. PLoS ONE 8: e54329. Freely accessible here.
Mallon, J.C., et al. 2011. Variation in the skull of Anchiceratops (Dinosauria, Ceratopsidae) from the Horseshoe Canyon Formation (Upper Cretaceous) of Alberta. Journal of Vertebrate Paleontology 31: 1047-1071.
Ryan, M.J., et al. 2012. A new ceratopsid from the Foremost Formation (middle Campanian) of Alberta. Canadian Journal of Earth Sciences 49: 1251-1262.

7 comments:

  1. As a layman, one good reason for the author staying a gatekeeper of the data is they get to ask people who want the data what they want it for, what kind of work they are doing... you know basic job related socialization that can cross-pollinate ideas and validate/invalidate assumptions.

    I don't think this is a good enough reason since publishing and conferences are supposed to do the same thing, but it's de-centralized and informal which can be good things.

    Just a layman's opinion.

    ReplyDelete
  2. I see what you're getting at, but I've made more connections from the single conference I've been to than I have asking people for their data. I think in general people/scientists have gotten pretty good at emailing each other about things like collaboration.

    Another thing I just thought of about your point would be that this is where a lot of the blogging, and activity on social media comes from. By far (other than actually meeting people at conferences) the most useful form of networking for me has been a) Twitter; and b) my blog. I think that's a much better way of getting your name out there and doing some socialisation and networking. It's open to everyone, AND you tend to be able to follow lots of different people in different disciplines. Most of the people I follow on Twitter are palaeo related people!

    ReplyDelete
  3. Nice article Liz! Yep, it can be frustrating all of this - imagine how Ross Mounce feels when his PhD involves extracting this information from thousands of matrices!

    One solution is just to have .nex files all freely available in their standard formats. I don't know if you've seen this but Graeme Lloyd has pretty much bossed all of Palaeontology and done this: http://www.graemetlloyd.com/matrdino.html - pretty sweet! Maybe your data lies in there?

    Thanks for the call out too :)

    ReplyDelete
  4. Holy crap Jon I wish I saw that website before! That would have been extremely helpful. That is super helpful, I'll remember that from now on. Unfortunately, I still can't figure out why I'm having a problem with the matrix that I painstakingly made... Even with a very similar one in the right format from Graeme Lloyd's site!

    ReplyDelete
  5. Sadly pterosaur workers in particular are utterly awful at making their matrices available. I have great sympathy for you.

    We *easily* have the technology, infrastructure and mechanisms to share this data post-publication in appropriate immediately re-usable data formats, yet few seem to do it. It's utterly lamentable.

    I hope you have more luck in getting this data than I have.

    ReplyDelete
  6. See also the Panton Principles for open data in science:

    Science is based on building on, reusing and openly criticising the published body of scientific knowledge.

    For science to effectively function, and for society to reap the full benefits from scientific endeavours, it is crucial that science data be made open.

    By open data in science we mean that it is freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. To this end data related to published science should be explicitly placed in the public domain

    http://pantonprinciples.org/


    ReplyDelete
  7. Pterosaur workers are bad? Hmmm... If ever I do anything with phylogeny of pterosaurs, I will try to be good and post it online!

    I completely agree with everything you said about the open data stuff, I just kind of didn't mention it in this post. I guess I was thinking more about why, as a scientist, it is good to put your data/matrices somewhere accessible online. Even if you don't care about the public, and don't think they need to be able to see it, think about how great it is for science.

    ReplyDelete