Wikipedia:Wikipedia Signpost/2009-05-18/Chemistry data

Chemistry data

WikiChemists and Chemical Abstracts announce collaboration

Vinegar like most people have never seen it – crystals of pure acetic acid.

Chemicals – love 'em or hate 'em, but you couldn't live without 'em. Water, glucose and sodium chloride are pretty essential for all of us. Other chemicals make our clothes, or colour them, or provide jobs for millions of workers. Still more chemicals (sometimes the same ones as before) can do us some pretty nasty harm if we're not careful, or degrade the environment for our children and grandchildren. Some chemicals are only really of interest to a professional chemist, or to someone who is easily amused by silly names – arsole, moronic acid and vomitoxin all have Wikipedia articles, after all. But information about chemical compounds is big business, worth several billions of U.S. dollars annually worldwide, and the Chemical Abstracts Service (CAS, a division of the American Chemical Society) is a leading player, "the global leader" as they prefer to put it!

But if there's something interesting that can be said about a chemical, then sooner or later someone is going to write an article about it… That's why the Chemicals WikiProject slaves away at over five thousand articles about individual chemical compounds (nearly double that number if you count all the drugs), trying to improve the content that we have and to fill in any obvious gaps in our coverage. Five to ten thousand compounds is pretty small compared to other collections of chemical information (CAS boasts 46 million compounds) but, as in many other areas, Wikipedia stands out because of its slightly idiosyncratic choice of subject matter. Free-access databases such as ChemSpider have millions of entries but, as its CEO Antony Williams told us, "for some of those compounds, Wikipedia is the only accessible online source of data."

Most chemicals are white powders. Most chemicals that aren't white powders are black or grey powders. Sodium dichromate is an orange powder (and carcinogenic): it's also one of the chemicals in the collaboration between WP Chem and CAS (number [10588-01-9]).

Getting the numbers right

Of course, none of that is worth anything unless the information is reasonably accurate. Water boils at 100 °C (212 °F) and, if you say it boils at 212 °C (100 °F) you're nearer to an 'F-grade' than a 'C-grade'… About eighteen months ago, after launching an informal survey of chemical information professionals, WP Chem embarked on a mammoth project (still ongoing) of hand-checking certain types of data in the infoboxes. Obviously, the work would be wasted without a clear record of what had been checked and what the correct data was, so WikiChemist and administrator Beetstra wrote CheMoBot, which logs any changes to data in the infoboxes and highlights changes to verified data, all with a feed to the WP Chem IRC channel on freenode (join us for publically logged meetings most Tuesdays at 1600 UTC).

One of the types of data we wanted to verify was the CAS registry number, a sort of ID number for chemical compounds. CAS registry numbers can be found from a wide variety of sources, but the sources often don't agree with one another. The ideal solution would be to check with CAS, the organisation that issues them, and a couple of editors with access to the relevant databases offered to run some checks in their spare time.

The rapid response from CAS, and the controversy it caused in the wider chemical community, at least proved to us that chemical information professionals really do read Wikipedia. The first response from CAS was that anyone using its databases to find information for Wikipedia is breaching its terms of access[1] (the databases are not public). After a hectic week of emails and posts on various blogs and mailing lists (and Wikipedia talkpages) – many thanks to all those chemists who are not involved with Wikipedia but who still stood up for the project – the door was open: CAS were more than willing to help WP Chem, but we needed to agree on how.[2]

The credit for keeping the negotiations moving forward, for calmly explaining to people on each side that the other side couldn't do a deal without this or that (and, most importantly, why), that is all due to WP Chem editor Walkerma. It took a long time, but by last Autumn the talking was mostly over and the hard work could begin.

CAS has provided the WikiChemists with over six thousand CAS registry numbers from the compounds they consider are the most interesting to the chemical community as a whole (mostly compounds that have had more than 1000 scientific papers written about them), along with the other information we need such as structure diagrams (in the right format) and their version of the chemical name (CAS uses its own chemical nomenclature system). ChemSpider stepped in and generated International Chemical Identifiers (InChIs, another widely used ID system for some types of compound) for each of the compounds and added them to the dataset. And a committed group of editors has been working through the list one-by-one checking the data in the Wikipedia articles. If the basic data has been checked – that is, if the article really is about the compound it says it is – the CAS registry number appears bolded in the infobox.

More importantly, CAS has just released the data on a dedicated website, commonchemistry.org so that anyone can access it, not just WP Chem editors. This was an important condition for WikiChemists, as the data has to be verifiable, but it is a completely new departure for CAS, who have built the site from scratch. The site is not meant to be static, and more information and compounds should be added in the future. For the moment, WP Chem is still digesting the data we've got, but we've already freed that data for anyone else to use if they wish.

Looking to the future, looking for the structures

It may seem strange to go to all this trouble over some strange numbers that are meaningless to most people (human chemists included). However, chemical identifiers (of which CAS registry numbers are only one) are the key to finding and classifying chemical information on the internet. Much of that information is graphical, yet the vast majority of chemical images online are completely meaningless to a non-human. The two structures shown here are codeine (the active ingredient in many cough medicines, left) and heroin (an illegal narcotic, right): if you can't tell the difference, then neither can a computer. The relevant chemical data is used by the software that creates the image, but is thrown away when the image is saved in a browser-compatible format because there's no generally accepted standard for chemical metadata. WikiChemists have had discussions with external partners on ways to solve the problem, not just for single molecules but also for reaction schemes. For the moment, we're still talking, but maybe we'll have another dispatch this time next year…

Is it really WP Chem's business to be doing all this? Shouldn't we be chalking up little gold stars instead? Well there's certainly nothing wrong with writing great articles, but neither is there anything wrong with Wikipedians playing an active role in the wider intellectual community. Our outside contacts have been wonderful sources of advice to prevent WP Chem from trying to reinvent the wheel or from wasting time on information that very few people want or need. We also have a vested interest in solutions that are free, especially faced with the giants of the chemical information business: free so that Wikipedia can benefit from them and free so that everybody can benefit from them.

References

  • Rovner, Sophie L. (May 19, 2009), "CAS Launches Free Online Database", Chemical & Engineering News.