Genomic photo-fits from Facebook genomes

May 17, 2010

An article in Scientific American reminded me of a conversation I had with a patent attorney in about 2001.

I had just seen a presentation from someone on Affymetrix 10k SNP genotyping arrays and the presenter focussed on something they thought would become a useful tool as the arrays got denser and denser, copy-numnber analysis. The presentation made me think how we might also use arrays like this to replace or even augment DNA fingerprinting.

The patent attorney did not respond well to my idea of using these chips to identify commonly used features such as skin, hair or eye-colour, likelihod of weight, height and physique or even ethnic background. He thought it was science-fiction and would never be fact.

It has been a long while now since DNA fingerprinting solved it’s first case in 1986, in Leicester, England. There has even been comment that DNA evidence could be manufactured by oligo-syntheis or PCR. It would be pretty simple for a well-educated crook to buy a PCR machine, SGM+ kit and  for them to be able to amplify another persons profile as a spray-on similar to deodorant but to get rid of their DNA profile by masking it with a massively amplified signature.

As more and more people get their genomes sequenced and it contines to drop in price we will get to a day when Facebook has your genome on it. It would be an interesting project for a grad-school student to download all these genomes and the tagged photos of the individuals to produce a pretty comprehensive tool. I wonder ehat the pioneers like; Craig Ventner, James Watson, George Church, Stephen Quake, Eric Lander, et al would make of that?

The Costs of Illumina Sequencing…

April 26, 2010

With the release of HiSeq Illumina stepped up another rung on the ladder to the $1000 genome, but just how much does it cost to sequence a genome today?

The first thing to establish is how big your genome is and rather than do this I will use the cost of one Gbp of sequence data as this should be comparable between platforms from one or many suppliers. Secondly I want to make it clear that I have used list prices and to make things simple this is a guesstimate of HiSeq costs. Thirdly the cost is very much dependant on the read-length used for your run,l onger reads are cheaper as the per base cost is made of a fixed and variable costs. The fixed cost includes the maintenance contract on the machine, staff costs and the flowcell. The variable cost is the amount of sequencing reagent used and this is wholly dependant on read-length.

To keep this as simple as possible I have used a fixed figure of 1500 lanes per year for the maintenance and staff costs, this allows us to amortize these big numbers over a reasonable estimate of sample submissions to be un on one instrument over a year (in this case about 150-200 flowcells). This is hugely dependant on the type of sequencing a machine is used for as you can run up to 10x as many SE36 as PE150runs . This per lane cost comes in at $100 which is low enough that a variation in how  the machine is used is tolerable.

I have three numbers now which get me to the final figures.

  1. My fixed cost of $100.
  2. The flowcell cost, $2500SE or $4000PE.
  3. Sequencing costs of $4.50 per lane/per cycle.

Looking at how Single end and Paired end sequencing costs drop as read-length increases is very interesting so I put together a handy little graph.

The bottom line is how cheap Illumina sequencing is, as low as $20.00 per Gbp! For the same prce as a single Sanger read ($5-10) you can get over 250Mbp of data back from an Illumina sequencer!!!

There is a big battle going on over sequencing costs but combining Illumina’s extremely high-quality sequencing with their very low cost data generation certainly makes for a fantastic package. Of course this is not to say that costs should not continue to drop!

Cancer Genome Atlas, a waste of money?

April 1, 2010

The recent quartet of opinions in Nature on the tenth anniversary of the completion of the Human Genome have made me think where current Cancer Genomics efforts might be going wrong. Please don’t interpret this piece as me trying to rub shoulders with Collins, Venter, Golub or Weinberg! Although of the four articles I would say Robert Weinberg’s is the one that resonates best with me.

The Caner Genome Atlas project to sequence 20,000 matched tumour and normal from 20 common cancers is a project that will take us forward in improving public health. It seems clear that a project on this scale will have benefits to thousands of individuals and result in much better management of the diseases and while not resulting in a cure for Cancer (probably never going to happen in the public understanding of a cure) it will turn many cancers into an acute disease like diabetes.

But should we be funding the generation of such a large amount of data for such a specific use? The cost of doing this is huge with massive international investments in next-gen sequencing instruments, reagents and staff. There are also equally large investments in bioinformatics and computational biology.

I think that diverting this cash to individual labs with specific hypothesis-driven research could have a wider benefit. The Cancer Genome Atlas would still happen, albeit in a slightly longer time frame. As long as the individual research projects were directed in the way samples were handled and data presented then 20,000 samples will still get sequenced and probably from many of the same diseases as already listed. But I think that a diversification will allow smaller labs to focus on the questions more closely, do more thorough validation and more likely identify key pathways and mechanisms that are druggable.

It could be argued that the larger Genome centres are an easier place to develop technologies that allow the costs to keep falling. However many smaller labs are capable of working in partnership with commercial developers and as long as these labs are well funded for basic biology they will push the boundaries in a very similar manner. In fact smaller labs are probably able to think just as creatively than the large Genomonsters governments have created in the US and UK. Perhaps their dominance will even stifle thinking  as they become the centres where this kind of science gets done and their voice is a difficult one to talk above.

If I ran the NIH or Wellcome I would concentrate on computational biology and divert the wet-lab money from projects like Cancer Genome Atlas to smaller labs. They have the ideas and the ability to follow up but bioinformatics is often their biggest headache. Whatever happens it is only a matter of time now the we truly understand some of these different diseases lumped together as the big C. Decisions will still need to be made on when we stop work on one disease, e.g. Breast cancer and focus efforts on another, e.g. Prostate. Eventually the cost of continued large scale investigations become exorbitant for the impact on public health and other diseases become more economically important to focus on.

The full cost of HiSeq 2000

March 24, 2010

How much will it cost to run your HiSeq 2000? A lot, so if they give you a free iPhone make sure you get the 32Gb one and three years free contract!

The cost of the sequence data though is so incredibly low it is amazing. A few philanthropic individuals could sequence 1000s of complete Human genomes for the same price as a house. God only knows what a baseball team sum could return from a scientific point of view. Interestingly if someone bought a ton of stock BEFORE placing their order for instruments and reagents they would probably get quite a lot of cash back on their investments. Of course this might be frowned on in some circles.

The sums:

A HiSeq and cBot cost about £475000. Each flowcell generates 0.5B reads. You can run around 300 SE36bp or 75 PE100bp flowcells per year (my estimates based on reports of how long runs take and assuming it works all year). List price reagents mean an SE36bp run costs £2000 (£0.06/Mb) and a PE100bp run is about £6500 (£0.03/Mb) .

If you amortise your instrument over three years this adds £500 for all those SE36 and £2000 for the PE100 runs. Each and every one costs a lot extra than reagents, and I have not taken people into account. My maths from all of this makes a HiSeq cost £2.4M for three years of SE36bp runs, that works out to 15Tbp of sequence. It costs £2.0M for three years of PE100bp runs, that works out to 25Tbp of sequence.

If BGI run all 128 of their instruments for three years it works out to £256M for PE100 only. Not as much as I thought the result would be!

Please can everyone start to talk about single slides or flowcells. It is really difficult to determine the true costs of next-gen sequencing as it can be difficult to determine what someone means by a run. If a whole slide us used as the ‘currency’ rather than runs or lanes or octets it would make it a whole lot easier to work this sort of stuff out.

Disclaimer: these numbers have been collected from all over the place and while the list prices are easy to get making sense of the rest is not so simple. If I am wrong point it out but don’t be too mean.

Really examining the next-gen Sample Prep ‘Ecosystem’

March 23, 2010

Keith over at Omics! Omics! has posted on: second-gen sample prep “ecosystem.” Basically Keith is writing about the possibilities in the next-gen sample prep kit market, one that has been taking off for suppliers of Illumina kits. This is probably due to the cost of kits from Illumina, either them or whoever supply them reagents is making a packet! Most of the users I know are using home-brew kits.

An interesting aside, and one now open to investigation, is the ‘ecosystem’ of next-gen libraires themselves. All next-gen libraries (exceptions are the Sanger’s full-length adapter libraries and possibly Helicos) require PCR amplification and we all know about PCR contamination. A nice way to assess this would be to check submitted sequence data for the presence of PhiX which must be floating around all next-gen labs. I wonder if a swab of the canteen at Broad or WashU would yield an amplifiable PhiX genome?

Lots of groups have been looking at Metagenomics and we might learn how to resolve where contaminants are coming from by following some of their examples. Barcoding libraries is going to help but only if everyone, or the majority, start to prepare all samples this way. Is it worth it? Only finding out the scale of the problem will convince some people.

For people doing ChIP-Seq or Seq-Capture it is probably less of an issue as the library probably has some pretty unique attributes. But for those working on Cancer genomes where comparisons are being made between Normal and Tumour samples, and those Tumours are heterogeneous cell types including normal cells; being certain that you do not have contamination is vital. You might be able to remove contamination by looking at allele calls across the genome but it has got to be seriously complicated. It is difficult for many labs to find space to really separate the steps for library prep but this might be the cheapest option in the long run.

I guess the first thing is to look and see if it is a problem and Broad, Sanger, WashU, BGI or any other large centre could easily do this. The questions is how long will they leave it to take a look?

Another SOLiD genome

March 11, 2010

Here is a nice paper describing the sequence of a cancer cell line on SOLiD. U87MG Decoded: The Genomic Sequence of a Cytogenetically Aberrant Human Cancer Cell Line <>

The really interesting bit for me was the fact this group ran SOLiD for the genome and Illumina for the captured sequence analysis but don’t say much about that.

What is missing for me is any information on how the data compare between SOLiD and Illumina. This is another group that almost certainly could look and see if there is a significant difference in sequence quality between the two.

Perhaps someone else will download the data and do the analysis?

There are loads of places that have both SOLiD and Illumina, check out the Googlemap of installations, but not one of these places has published a true comparative analysis. When I talk to people in these places they are all very coy, lots of nods and winks whilst being very descriptive but no bold statements. I am pretty sure anyone thinking of buying a next-gen machine would love this info!

Illumina vs ABI SOLiD

March 11, 2010

There has been an awful lot of noise about SOLiD recently; Ignite buys 100 machines, SOLiD4 and HQ are coming and data is getting published.
For a long time ABI have been playing catch-up to Illumina after almost missing out on next-gen sequencing. I went to Agencourt Personal Genomics in the week after ABI purchased the SOLiD technology and it looked like a good bet back then (late 2006), however it has taken rather longer than AB would have liked to make users really sit up and take notice. AGBT seemed to generate an awfully large number of blogs and tweets (and other stuff I don’t know the terms for) many of which were positive. So I thought about where the battle lines are really going to be drawn and what chance there is of getting to a more competitive market. I have some conflicts to declare, I am an experienced Illumina user and I have never tried to run a SOLiD instrument. Please feel free to correct my inaccuracies and leave your comments.

Library Prep: Illumina have very nice, if currently far too expensive, library prep methods. Most of the people I know have switched to home-brew kits where they get enzymes from companies like Enzymatics and NEB. Some enterprising users are making their own adapters as well but this seems to be a step too far for the majority of users. However the library prep methods are incredibly flexible and allow users to develop their own improvements and applications. Library prep is being automated by companies Genome centres to run 384 library preps, of course they have capacity for this kind of throughput.
ABI have a more complex library prep reliant on emulsion PCR which sounds like it is getting streamlined with their own automation, snappily named EZ-Bead. They also already supply kits which can be run in 96well plate format and users with large sample cohorts need access to this kind of throughput.
To my knowledge neither offer robust normalisation methods which will allow the holy grail of 1:1 mixing of 96 samples in a run.

Read-length: Both generate lots and lots of reads, we are now talking about billions rather than millions. And as you almost certainly know these are what are classed as short read platforms, of course it kind of depends what your reference point is but one PIs 50bp headache is a post-docs heaven. Does this matter in the light of SOAPdenovo anyway?
Where there is a clear difference is in the maximum length each company can achieve. Illumina were doing 100 last year and 150 is probably at Sanger or Broad now. Of course it takes a long time to run this length and the quality on 100bp is not as good as I would like right now with a 2% error rate at 100bp, so not useful for SNP calling for instance. But the attractive bit is the cost of this on a per bp basis. Illumina sequencing gets cheaper and cheaper as read length gets longer.
ABI can only achieve 50bp with a possible maximum of 75bp. As I understand this is a limitation of the chemistry and means we are unlikely to see improvements at the rate of Illumina for instance. Please correct me if I am wrong!

Sequence data: The third big difference, base space vs colour space! Biologists can align sequence reads on their desktops, I even know of one or two people that still print sequences out or align in Word. Sequences mean something to us that colour-space confuses. When I first heard about colour-space it took a few days to sink in but the idea of such high-quality sequence data made the system incredibly attractive when thinking about SNP calling in heterogenous tumours where their is also contaminating stromal tissue. For two to three years I have been waiting for a publication to showcase the increased quality you can achieve on SOLiD vs Illumina, it has not yet arrived. I cannot help but wonder why. ABIs recent adverts keep making the claims about increased accuracy but if it is so much better where is the data? Many centres now have both Illumina and SOLiD so lots of people have done this comparison. Some may also have made purchasing decisions based on it (however I think not, read on to find out why!)

Who wins?
I think it is Illumina but not necessarily for many of the reasons above. Quite simply they got there first* with a product that made an impact on biology that many of us are still grappling to understand. It was a true paradigm shift. The last three years have seen such a change in the kind of questions we can ask of biological systems. For a molecular biotechnologist, which is kind of how I would describe myself, it has been a fun rollercoaster to be on.
I really hope ABI can make a dent in Illumina’s market share of next-gen, and soon. We need better competition but perhaps we can’t afford to take advantage of it? The cost of change for any small or medium institute already invested in Illumina is very high as it is not just instrument depreciation and consumables but sample handling pipelines and bioinformatics analysis that needs to be reworked. Places like mine are unlikely to change.
The dent is probably going to come from groups that are investing now, they have a choice. If they are confident that 75bp is probably going to fit most of their needs (and I suspect it will) then ABI are offering cheaper genomes. This has to be balanced against the risk they eventually run out of cash, or the desire to keep putting more into this side of the business.

Of course both best might be off if Complete Genomics get their business right. I would much rather pay $5k for everything than $3k for reagents and have to do all the work myself!

* It was actually Solexa that got there first and without bridge-amplification on a flowcells they may not have made it. British brains and American muscle wins the day again!

Hello world!

March 11, 2010

I have taken up blogging to improve my technical writing as I don’t do enough in the lab to keep writing papers! Is a blog something to put on your CV? I prefer to remain anonymous, hence Zorro’s genome. I guess it could have been any other masked vigilante but Zorro was one of the first.