CisEvolver v0.4 Written by Daniel Pollard . Available at http://rana.lbl.gov/CisEvolver This file is also available as a perldoc. Copyright 2006 by Daniel Pollard. This software is provided "as is" without warranty of any kind. The author assumes no responsibility for the results it produces or conclusions based thereupon. It is distributed free of charge for academic use only. Permission to copy and use it is granted free of charge provided that no fee is charged and this copyright notice is not removed. *** Installation ____________ CisEvolver is written in PERL and has only one non-standard dependency. You must have Math::Random installed. You have three basic options for installing. You can use the CPAN shell interface (perldoc CPAN for more information). You could also check to see if your linux distribution (or whatever OS you use) has a package for Math::Random already made. Lastly, you can download a tarball from CPAN (http://search.cpan.org/~grommel/Math-Random-0.67/) and install it yourself. Please contact me if you have any trouble with this dependency. *** Description ___________ CisEvolver is a molecular evolution simulation program. CisEvolver evolves sequences down a mutation guide tree and returns an alignment of the leaf sequences. CisEvolver is written to evolve either background genomic DNA sequences or cis-regulatory sequences, made up of a mixture of background genomic DNA and transcription factor binding sites. Background genomic DNA is evolved using the Jukes Cantor (JC) or Hasegawa Kashina Yano 1985 (HKY85) substituion models and an insertion deletion (indel) model with a Poisson rate set relative to the substitution rate and a user defined indel length distribution. Transcription factor binding sites are evolved according to the Halpurn Bruno 1998 (HB98) model of position specific substitutions according to a user defined matrix without indels. A user can either provide a sequence or CisEvolver can generate a random sequence. For evolving regulatory regions, the location of binding sites can be provided by the user or the density of sites can be provided and they will be randomly placed in the sequence. *** Usage and arguments ------------------- Running the script without arguments will return the usage. There are required arguments and options. The required arguments are: --ancestor_fasta [] or (--generate_seq and --ancestor_length []) You must either provide an ancestral sequence in fasta format to be evolved or provide the length in base pairs of the ancestral sequence for the program to randomly generate. --tree_file [] You must provide a bifurcating mutation guide tree in Newick format with branch lengths in substitutions per site (not completely Newick compliant - see below). --indel_lengths_file [] You must provide a file with the probabilities for each length of indel from 1bp out to the longest indel you wish to model. The optional arguments are: --with_sites and --factor_directory [] and (unless --generate_seq then --site_location_file) or (if --generate_seq then either --site_location_file or --site_density_file) If you are simulating a cis-regulatory region then you need to specify that with the --with_sites argument. If you use --with_sites you must also specify the directory containing the files for the matrices for the transcription factor binding sites. If you provided the ancestral sequence you must also provide a binding site location file. If you had CisEvolver generate the ancestral sequence then you can either provide a binding site locations file or a binding site densities file. --background_file [] (default is 60/40 AT/GC) You can provide a file with the background mononucleotide frequencies but if you do not then the default is 60% AT 40% GC. --model (JC or HKY, default is HKY) You can specify the Jukes Cantor model of background substitutions. --kappa [] (for HKY model, default is 2) If you are using the HKY model, the kappa parameter can be set to something other than the default value of 2. --indel_relative_rate [] (default is 0.1) The relative rate of indel events to substitution events can be adjusted from the default value of 1/10. --print_with_ancestor or --print_with_all_nodes (default is to print just the leaf sequences) The output alignment by default contains just the leaf sequences from the simulation. The ancestor as well as all the internal nodes can be printed with these options. --print_seq_file [] (print the ungapped sequences in fasta format) This option will print the output alignment with the gaps removed to a specified file. --print_site_location_file [] (print the site locations to a file) Only use this with the --with_sites option. This option will print the location of the binding sites in the evolved sequences to a specified file. If --print_with_ancestor or --print_with_all_nodes is used then site locations in those sequences will also be printed. --seeds [comma sep] (comma separated seeds for the random number generator) Two random numbers separated by commas. This option is useful if you are running simultaneous simulations because PERL's random number generator uses the clock to seed which can lead to two simulations being identical. --verbose This spits out a lot of junk that will likely not be interpretable to you so you can probably ignore this. **** Details on file formats & examples ---------------------------------- Ancestral fasta file: This is standard fasta format. There is no line length limit. Only one sequence per file. A, C, G or T only. Sequence names are not used for anything. >name ACGTACGT Tree file: This is essentially standard Newick format and must all be on one line of the input file. CisEvolver requires a bifurcating tree. Tree lengths must be specified and also must be in decimal format. Even if the branch length is a whole number, a decimal and trailing zero are required. Internal node names are not supported. ((seq0:0.1000,seq1:0.1000):0.1000,(seq2:0.1000,seq3:0.1000):0.1000); Indel lengths file: This is a two column file with space or tab separation. The first column is the length and the second column is the probability of that length. The first line must be length 1 and each line must increment by 1 bp in length out to the max length. Probabilities should add up to 1, can add up to less and cannot add up to more. A given probability can be inclusively between 0 and 1. 1 0.9 2 0.05 3 0.05 Factor directory: This directory must contain a matrix file for each of the transcription factors used in a simulation. Frequency matrix files: Must be contained in the factor directory. The format is a four column, space separated file. Each line represents a position in the binding site and each column represents a base in the order A, C, G & T. This is a frequency matrix, not a counts matrix. For each position in the site, the frequencies of each base must add up to 1. Pseudocounts are recommended where bases are never observed at a given position. 0.0250 0.1833 0.0167 0.7750 0.9417 0.0167 0.0167 0.0250 0.1917 0.1833 0.0167 0.6083 0.0250 0.0167 0.5167 0.4417 0.2750 0.6833 0.0167 0.0250 0.9417 0.0167 0.0167 0.0250 Site location file: This is a three column file with space or tab separation. The first column is the zero-based position in the sequence. A site's position is determined by the first base in the sequence that contains a base in the site. Thus, sites on the reverse strand have a position that is occupied by the last position in the site. The second column is the name of the frequency matrix file corresponding to each site. The third column is the strand on which the site is found, where 1 is the forward strand and 0 is the reverse strand. 46 Kr.fm 1 81 hb.fm 0 89 Kr.fm 0 139 bcd.fm 1 Site density file: This is a two column file with space or tab separation. The first column is the name of the frequency matrix file for each factor. The second column is the density of sites to be randomly placed by the simulation. Density units are in binding sites per base pair. bcd.fm 0.005 hb.fm 0.005 kni.fm 0.003 Kr.fm 0.010 tll.fm 0.003 Background file: This is a two column file with space or tab separation. The first column is the bases and the second column is their frequency. Frequencies must sum to 1. A 0.25 C 0.25 G 0.25 T 0.25