The Gene Simulator


There is currently no user friendly tool for modeling gene expression at the level of individual polymerases elongating through the gene. There is a large diversity of published models which share many common features, but are generally inaccessible to newcomers. We have implemented a stochastic transcription simulator that is simple to use and customize.

We hope that this gene simulator be useful for individuals who are not interested in developing models from scratch either analytically or computationally, but would still like to investigate how gene expression is affected by the characteristics of a genetic circuit, such as length, binding or elongation rate. It can also yield predictions for systems too complicated to be treated analytically, whose dynamics would not be intuitively apparent. You can get the Python source code here.


Modelling Framework


We are interested in developing a quantitative model for transcription, the first step in gene expression, where a particular segment of DNA is read and translated into RNA by the enzyme RNA polymerase. Polymerase binds to DNA, and slides forward, reading the DNA sequence and synthetizing the correspondent RNA at every step.

Genes are modelled as unidimensional lattices of length L where each locus in the lattice correspond to a stretch of DNA the size of the polymerase footprint. Each position may be either empty or be occupied by one polymerase. Every locus has 5 rates defining how many "actions per unit of time" happen on average on that locus. They are all treated as simple biochemical reaction steps, and have an exponential probability distribution over time.

Elongation: If the locus is occupied, the polymerase can take a step forward, freeing its location and occupying the next one.

Backtrack: If the locus is occupied, the polymerase it can take a step back, freeing its location and occupying the previous one.

Binding: If the locus is free, a new polymerase may bind to it.

Termination: If the locus is occupied, it can become unoccupied, and an RNA transcript is produced.

Premature termination: If the locus is occupied, it can become unoccupied, but no RNA transcript is produced.


Example Gene Model


Example Gene You can try simulating this gene here.

Locus 1 is a binding element, where polymerase can bind with rate $B$, and elongate with rate $k$, with all other rates being zero.

On loci 2-3, 5-10 and 12-13, elongation rate is $k$, and backtrack rate is $r$.

Locus 4 is also an elongation element with backtrack rate $r$, but elongation rate $p$.

If we consider $p < k$, then locus 4 would be a pausing element, where polymerase takes a longer time than normal to advance.

On locus 11, in addition to normal elongation rates, the polymerase may abort with rate $A$.

On locus 14 it can either backtrack with rate $r$, or terminate with rate $T$ and produce an RNA transcript.



Algorithm


The simulation was implemented as a Python script using the Gillespie algorithm. After reading the user input and initializing an empty gene, the algorithm loops trough the following steps until the specified end time is reached:

Example Gene

1) The current state of the gene is parsed through, and the rates for all possible actions $A_i$ are summed into a value $Z = \sum_{i} A_i$.

2) One of the possible actions, $A_i$, is randomly chosen with probability $p_i = k_i/Z$, where $r_i$ is the rate of $A_i$.

3) The time $\Delta t$ the action $A_i$ takes to happen is drawn from the distribution $p(t) = e^{-t\cdot r_i}$.

4) The total simulation time is increased by $\Delta t$, and the gene state is updated according to the action taken.

5) If the total time is less than the specified end time, the loop restarts, otherwise the simulation ends and the output is displayed.



Output


The simulator will output four grpahs, you can try it out here.

a) A gene occupancy plot: each line corresponds to a snapshot of the gene in a moment of time, white pixels are empty locations and pixels representing polymerases colored in the order which they were bound.

b) A histogram of the times each polymerase took to transcribe the gene

c) A histogram of the intervals between two consecutive terminations.

d) The number of terminated polymerases over time.