capstone paper

Upload: john-mark-wiggins

Post on 14-Jan-2016

215 views

Category:

Documents


0 download

DESCRIPTION

A research endeavor into the topic of sample sizing in genetic analyses examinations. Very helpful sources and critical points on shortcomings in the field of genetic research today

TRANSCRIPT

[Type text][Type text][Type text]

John Wiggins

Potentially New SNPs Associated with Familial Breast Cancer Susceptibility LociAbstract: Modern Day Breast Cancer Susceptibility background info/review: Many breakthroughs have been made in recent history in the genetic approach of how to handle breast cancer, one of the most notable of which being the identification of the BRCA1 and BRCA2 genes identified in early 1990s. However, even though this breakthrough was huge in the identification of nearly 5-10% of breast cancer cases (the cancer being caused by mutations within the two genes), that still left a massive amount of familial breast cancer risk unexplained. Further studies since the identification of these gene mutations, and their association with breast cancer have been established, studied, and reviewed, however, an astounding 75% of familial breast cancer risk is still unexplained. This establishes a conclusion that many other genes, and their associated loci, are responsible for this heritability/risk. One of the more recent studies that investigated into these possible unknown loci was the study, or rather conglomerate of multiple individual studies, Genome Association Study Identifies Breast Cancer Susceptibility Loci that examined a multitude of individuals throughout Europe and parts of Asia. This study, which was conducted by a multitude of authors including Douglas Eaton, began to try and construct a possible genetic map, through identification of new SNPs and HapMap utilization in an effort to find new genetic markers associated with familial breast cancer risk. It implemented a three-tiered study, which began with a small number of participants and a large number of SNPs, and had each successive stage increase in participants and decrease in SNPs in order to successfully narrow down the possible genetic region and help specify the genetic loci at which the mutations could be found. In order to increase the specificity and help to quantifiably prove that the data was more significant than in the previous stage, the researchers subjected the data intake to Cochran Armitage Score testing with one degree of freedom, as well as separate analysis of each stage so that no previous stages results could impact that particular genetic locis association with breast cancer in a further stage. Finally, upon the completion of stage 3, it could be seen that five particular SNPs had statistically significant association with identifiable breast cancer. Theses five SNPs, and their associated rs numbers are as follows: rs2981582, rs3803662, rs889312, rs13281615, rs3817198. Additionally, these SNPs percentage association values with breast cancer risk identification are as follows: 97%, 71%, 25%, 3%, and 1%, respectively. The statistical identification of the association values relevance, as well as their derivation, can be seen in the figure below:

FIGURE 1

To unpack the statistical significance of the above information, first the axes and subsequent indicators within the graph must be analyzed. Each of the five letters represents the five SNPs that were found with a being rs2981582, b representing rs3803662, and then the other three SNPs in order of decreasing association value. The x and y axes represent the per allele odds ratio and the individual study of the SNPs association, respectively. The x axis value of per-allele odds ratio, also labeled as the OR, is a statistical measurement of association between a particular exposure, in this case that particular SNP, and an outcome, in this case the identified associated risk of contracting breast cancer. The odds ratio is calculated by dividing the associated p value by one minus said p value (P/[1-P]), thereby measuring its relation to one. The conceptual implications of the OR value are imperative and are as follows: If the OR=1, then that particular SNP will not affect the overall outcome of that individuals risk to breast cancer If the OR1, then that particular SNP has a positive implication upon the individuals risk of breast cancer. It practically follows then that all five of the SNPs that were studied had a base value of slightly over one and this value, along with the associated practical applications, were entirely dependent upon the p-value.It should be noted here that the statistical way in which the SNPs were funneled and eliminated was done via P-value interpretation. Stage 1 had a p-value of less than 0.05, which is congruent with the general thought that the data is statistically different than that of the null, the null being that the SNP analyzed had no association with breast cancer risk. Stage 2 was a bit more stringent and required a p value of less than 2 x 10^-5, and stage 3, which is depicted above (Fig. 1), needed a p value of less than 10^-7. Now that x-axis labeling and significance has been established, it becomes pertinent to define each data point, or entry, and classify its significance based upon the y-axis. Each point, or square, represents an OR value for that particular SNP, as in a, b, etc., where as the line falling on either side of said square is descriptive of that particular OR values associated 95% confidence interval. The confidence interval, when constricting it to the confines of this particular study, is representative of the fact that 95% of this SNPs particular OR values that associate it with breast cancer should fall within the range of said line, the other 5% representing the fact that it cannot be statistically differentiated from that of the null, or not being associated with breast cancer. So, practically speaking, the most accurate and precise data finding, would be those with smaller confidence intervals, meaning that the study did not have an incredibly large variations of p values. To add a little bit of context to these individual OR values, it can be seen that the top two rows of the y-axis are representative from OR values of stages one and two, where as the following rows, such as MCCS or BCST are names of particular studies within that region. The vertical and horizontal diamonds above the lowest y-axis row marked TOTAL are probably the most significant pieces of data and are the averaged ORs and confidence intervals of each specific SNP at that particular region of study, being European or Asian, respectively. Ultimately, the diamonds falling under the y-axis row TOTAL, which are found by combining the Asian and European study results, are where the true practical implications are derived from. For your convenience, Figure 1 is again given below:

The immediate takeaway noticed within the total OR values is this: the higher the percentage association of the SNP with breast cancer, the higher the higher the total OR value. Furthermore, it can be seen that the range over which these associated OR values is found is incredibly small, found to be around .1 when approximating from the graph, especially when considering the highest SNP association is 97% and the lowest SNP association is 1%. Incidentally enough, this is actually quite large when discussing SNP significance and actually makes quite a lot of sense when taking into account the fact that there are millions upon millions of SNPs within the human genome, and even potentially exponentially more interactions between these SNPs leading to an ultimate phenotypic result, such as breast cancer. In order to place figure 1s results into a more genomic context, the gene locations of these rs numbers, and their associated SNPs, has been listed at their respective chromosomal positions below in the table: TABLE 1

All five SNPs, as well as some additional SNPs due to the fact that this table helped to show data for multiple studies, can be seen in the table above along with their chromosomal location. Furthermore, the maf values of each gene location (SNP location), which is the frequency at which the least common allele occurs within any given population, are all above .05 (meaning that they should be targeted by the HapMap project). These maf values, along with the corresponding OR and p value trend data, begins to illustrate just how elaborate and interacting these possible SNPs (gene mutations) that increase breast cancer risk truly are. This modern study, which was really a conglomeration of a multitude of smaller studies with a vast array of various individual participants and over a number of years, helped to illustrate just 3.6% of the previously unknown 75% of familial breast cancer risk genetic linkage and only a total of five SNPs of the initial 205,586 SNPs that were examined in stage one. Although this is a vast achievement in helping to identify the genetic mutations that result in an increased risk of breast cancer, these results also work to illustrate how much more research needs to be done.Potential Development for Identification of SNPs Involved in Breast Cancer Risk:While the above study conglomerate broke incredible ground in identifying SNPs associated with familial breast cancer risk, it also evidenced a multitude of areas that may be improved and elaborated in order to paint a more complete picture on identification of specific SNP association with breast cancer, as well as what can be done to provide a better understanding of gene location on the chromosome, and that particular genes role within the cell. First and foremost, it should be noted that none of the newly identified loci in the mentioned study (TABLE 1) featured genes or gene products associated that are linked with DNA repair, sex hormone synthesis, or metabolic metabolism pathways, all three of which were commonly thought to be the primary types of loci that were associated with the development of breast cancer. Furthermore, of the loci identified in table 1, only the FGFR2 locus (the locus which correlated to the rs number of the SNP that had a 97% association value with being an identifier of breast cancer) has a clear prior history with a linkage to breast cancer. Additionally, the FGFR2 locus is commonly associated with cell growth and cell signaling; however, 3/5 of the loci examined within this study, as well as the contributing studies that supplement table 1, is also associated with cell growth and cell signaling. This is incredibly fascinating because it suggests that mutations of genes at loci, which are associated with these common cell functions, could potentially be a direct indicator of a higher risk of breast cancer. Furthermore, four of the five SNPs associated with this particular study (FIGURE 1) indicate a less than 75% association with breast cancer risk identification. It should be hypothesized from this then, that the four genes defined by their respective rs numbers, and their associated SNP mutations, act not solely in their identification as an associated risk of increased breast cancer, but function as a conglomerate with other unknown genetic mutations at potentially vastly different loci on various chromosomes within the human genome to increase breast cancer risk. Additionally, it should be possible that if the sample size of the study were increased in number, amount of diversity, and over a longer period of time, then new and different patterns of linkage disequilibrium would be observed, providing a more precise look at what additional genetic mutations could be congruently functioning with these four genes to provide a clearer and more descriptive identification of breast cancer risk. Experimental Design to Better Understand Specific Loci Interaction:In order to better explain and elaborate upon these four loci and their potential association with breast cancer than the previous study, and subsequently modern day knowledge of these SNPs, the parameters of the experiment, such as subjects, statistical analysis, and overall guidelines must be examined. First and foremost, the issue of selecting subjects must be addressed. Genetics is a very specific field and can have a wide array of factors, such as environment, disease, climate, etc., shape and alter the way a human population, and its subsequent population genome, will evolve over time. Practically speaking, this means that depending upon a peoples geographic location in the world certain mutations and genes will have developed differently than that of other peoples in different living conditions, over multiple generations of course. In the context of subject selection of this study this means that the subjects should come from a wide array of countries and climates, which will help in neutralizing potential instances of phenocopy, where an apparent phenotype within a population seems to be a product of a genetic mutation but is actually cause of an environmental factor, as well as help to generate LD (linkage disequilibrium) pattern recognition within the research so that SNP mutational findings may be more precise. In order to best achieve this diversity it is best to complete this process of subject in multiple stages, beginning in one primary location and then branching out from there.Stage One: Small Population, Large Number of SNPsIn order to potentially better eliminate SNPs from further association analysis than in the previous study, it is probably best to begin research to specific major cities, rather than countries, to not so broadly categorize an entire country (for instance, it would not be fair to state that the living conditions for a people in Wyoming, would be that as the same as in New York). Furthermore, age must become a factor so that outside variables unrelated to this particular study will not affect the data. Next, each of the participants needs to have a recent definitive familial case of breast cancer; this is generally accepted as two first-degree relatives (this is defined as a parent, sibling, or child). Lastly, since the four loci being studied have already been targeted as not having a link with DNA repair, sex hormone synthesis, or metabolic pathways, and in order to build upon the findings that the 3/5 of the loci were associated with and cell signaling, excluding the previously cancer linked FGFR2 locus, participants with mutations at either the BRCA1, BRCA2, or any other loci that has been extensively associated with DNA repair, sex hormone synthesis, or metabolic pathways should be excluded. This will help provide a better specific identification into what types of cell signaling and cell growth gene mutations might be congruently working with one of the four SNPs to increase the risk of inheriting breast cancer. Of course, as at all stages of this study, as well as in any scientific study, participant selection from subjects who meet the previously defined parameters should be entirely random to provide the least biased results possible, and the number of experimental subjects to control subjects should be relatively equal. Stage One Parameters: 50 experimental subjects and 50 control subjects Experimental subject guidelines: Women under 40 with invasive breast cancer 2 first degree relatives with breast cancer No mutations BCRA1, BCRA2, or other major genes associated with DNA repair, sex hormone synthesis, or metabolic pathways Control Subjects Defined As: Women over 40 without cancer Subjects Randomly Selected From (Location): Most heavily populated cities within each state of the United States (50 experimental and 50 control per state) Total number of experimental subjects: 2500 When advancing to stage 2, much like in the previous study, the number of SNPs examined should become more specified and the number/variance of the participants should increase. In order to do this, it becomes essential to become less specific about things like the specific age guidelines for the experimental group; however, the genetic restrictions must remain in place in order to maintain the integrity of the type of interaction for which is being searched (no DNA repair genes, metabolic pathway genes, or sex chromosome synthesizers). Stage Two Parameters: roughly 20000 experimental, 20000 control per region The control, experimental, as well as omission requirements holds the same as in Stage One, however, the age guideline for the experimental group has been increased to women under 50 with breast cancer may be considered in the experimental group. The control group requirements remain the same. Subjects Randomly Selected From Previous and Very Recent Combined Studies that Encompass: East Coast of the United States, West Coast of the United States, Northern Area of the United States, South Area of the United States, Mexico, South America, and Canada (20000 control and experimental per region) Total number of experimental subjects: 140000It is very important to note that since a defined parameter has been established at this point due to stage one, that it is much more financially prudent to begin researching a multitude of previous studies within these areas, combining them and ultimately providing a clearer and more specific look at genetic loci mutations found within a multitude of individuals in the Western Hemisphere. Lastly to paint the clearest and most helpful statistical analysis of these four loci and their potential relation with other gene mutations, a vast array of subjects should be examined throughout the entirety of the genetic variation spectrum, so cases, as recent as are available, should be examined using the same criteria as in Stage Two, however many more cases in many different areas need to be analyzed.

Stage Three Parameters: 500000 subjects, 60 studies All control, experimental, and omission guidelines the same as in Stage Two Subjects Randomly Selected From: North America, South America, Europe, Asia, Africa, and Australia. Nearly 85,0000 subjects per region Total experimental subjects: 500000Obviously, this will be incredibly difficult to achieve. Much like the previous study at Stage Three, at this stage data from individuals will need to be taken exclusively from prior studies throughout each of these regions. In order to achieve the desired subject mark roughly 60 studies will need to be combined and analyzed, result by result, to make substantial headway into identification of the association of gene mutation interactions with these four loci and their respective consequences on breast cancer. While this mass genotyping might seems incredibly cost ineffective and time consuming, it is nearly a necessity for the quickest possible analysis and will also help to provide insight in a multitude of other areas, such as HapMap identification advancement, many new loci mutation SNPs and their respective mathematical association with breast cancer risk, and overall modernization of a continental breakdown of where each particular region of peoples is at genetically speaking. The most advantageous part of this study when compared to its predecessor is that of effective SNP genotyping, more importantly, genotypic association between seemingly unrelated SNP activity and the potential shared phenotypic consequence from the four initial SNPs that appeared to show association to breast cancer identification.The initial genotyping of all genotypes within stage one that are studied should be tagged using the HapMap phase II as a reference. The previous study yielded just fewer than 50% of tagged SNPs was found on the HapMap. Hopefully, the drastically increased sample size of this study will yield a higher amount of tagged SNPs and a lower amount of SNPs found on the HapMap, any tagged SNP not on the HapMap has the potential to be a new genetic mutation associated with the activity of the four known SNPs and their combined link to breast cancer. The final step of stage one is to establish which of these tagged SNPs that was not identified on the HapMap are surrogates, SNPs that fall into perfect linkage disequilibrium based on genotyping of specific individuals. One of the biggest potential fallbacks of the previous study was that all individuals that served as surrogates with which to check these potentially new significant SNPs were Caucasian. While this might not play the largest factor in SNP analysis, it could potentially lead to careless neglect of an SNP that would affect someone of a different ethnic background. It is critical that stage one of this study implement multiple races of individuals when checking for LD patterns

The statistical analysis of stages two and three, although complex in calculation, have relatively simple conceptual meanings. Once the initial study has concluded, and all tagged SNPs (MAF >5%) that were not genotyped by the HapMap and had passed surrogate testing of the LD genotyping of select individuals, all that was left in the analysis of the three stages was statistical separation of the mutated SNP from the null hypothesis that it had no correlation with breast cancer association or with activity between itself and the four initial loci. This is done by each stage having a lower p value, and therefore more accurate implications that it is involved with breast cancer or the activity of the four original loci, that separates the numerical value of said mutation from that on a non-mutation. Lastly, by utilizing a combined adjustment factor (this helps to prove difference from the null via Chi Square analysis) along with genomic control method 22 on those SNPs whose mutations have proved a significantly low p value, even after stage three analysis, it can be reasonably confirmed what loci are interacting with the four initial loci, and what new loci have a potential association to breast cancer. Utilizing the same p value requirements for each stage as in the previous study, each individual locus mutation can be shown to have an individual risk with breast cancer, as well as making a comparison to each one of the four initial SNPs and evaluating possible genetic linkage. One of the largest benefits in this study over the previous one, however, is the fact that the sheer amount of sample size will help provide multiple loci that meet the p