Data Overview¶
This page overviews data output from DISHTINY experiments across two contexts,
generated on your own machine using the DISHTINY executable, and
generated on a compute cluster and uploaded to S3 using DISHTINY scripts.
You probably want to check when this page was last updated.
đ ïž đ·ââïž If anything is unclear or out of date, please open an issue or a pull request so we can fix it! Also, if you have project-specific questions or any thing else youâd like to chat about you can always feel free to get in touch directly. Iâd love to hear from you!
Data Files¶
This section covers running the dishtiny executable yourself to generate your own data.
In order to write data files, youâll need to use the --DATA_DUMP 1
command line flag.
Data file filenames look a little funny because theyâre named use a naming convention that allows metadata to be easily packed into and unpacked.
Usually, look for a=
field to get a description of whatâs actually in a file.
You can extract key-value metadata from these filenames easily in Python using the keyname package hosted on PyPi and in C++ using Empiricalâs header-only keyname utilities.
Each simulation thread dumps its own data files, which describe simulation state with its purview.
Configuration Settings¶
a=asconfigured
âŠ: a dump of runtime configuration settings, useful to keep a record of how a run was configured or to configure another run identically.
Simulation State¶
a=birth_log
âŠ,a=death_log
âŠ,a=spawn_log
âŠ: raw records of all cell birth, death, and spawn events that took place in the lastx
updates before the simulation ended.a=cell_census
âŠ: data describing each individual live cellâs state, organized by cell index and update.a=coalescence_result
âŠ: if the population was inoculated (i.e.,-GENESIS "innoculate"
) with genome files, this file describes relative fitness of each strain.a=kin_conflict_statistics
,a=kin_conflict_by_replev_statistics
: summary of crowding conflict between cells with common versus different kin id.a=demographic_phenotypic_phylogenetic_metrics
âŠ: miscellaneous summary statistics of population state, organized by update.
Serialized Genomes¶
a=genome
âŠ: human-readable serialization of a single genome.a=genome_statistics
âŠ: various metrics describing the associated genome (eacha=genome
file has a correspondinga=genome_statistics
file).a=population
âŠ: serialization of all genomes present in the population at simulation end.
Visualizations¶
a=drawings
âŠ: png visualizations of different aspects of simulation state.a=montage
âŠ: summary graphic combining several different visualizations.
Data File Types¶
*.bin
files contain binary-format serialization data generated using the Cereal C++ Library.
These arenât portable or human-readable.
*.csv
files are comma-separated value tables.
You should be able to open these with your preferred scripting language or spreadsheet program.
*.gz
files are gzipped.
To work with these files, youâll need to use gunzip
to uncompress them.
*.xz
files are xzipped.
To work with these files, youâll need to use xz --decompress
to uncompress them.
*.json
files are formatted using JavaScript Object Notation.
Your preferred scripting language should have some tools to open these.
*.meta
files contain metadata corresponding to each file.
You can ignore them or archive them alongside their corresponding data file for a more complete record of your experiments.
*.tar
files are tarballs.
To wok with these files, youâll need to use tar
to extract them.
Data Repository¶
This section covers using data generated from existing dishtiny experiments.
If you have interesting ideas of hypotheses to test or further experiments to run with this data, get in touch to collaborate â or, if you prefer, take the data and run with it on your own! (Iâd still love to hear about what youâre up to and what you find đ.)
Hassle-free Pre-wrangled Data via Web GUI đ€ ¶
We host a pre-configured Jupyter notebook for each of our consolidated datasets on https://mybinder.org. You can jump into them here.
Right now, the binder environment is stocked with a few Python data analysis and visualization tools. Binder also has other scripting languages like R and Julia on tap, though! If you donât see your preferred tools or packages in the binder, make a pull request, open an issue, or send me an email and weâll make sure theyâre there.
Spelunking in the OSF Data Repository¶
Our evolution experiments upload all data to the Open Science Framework as they run. This means that our data is available hot off the presses! âšïž You can find peruse and download at https://osf.io/9vkd7/. The Center for Open Science also provides a Python tool called osfclient that allows programatic file downloads.
Inside the repro/
folder, youâll find logs from our Slurm jobs â not much of scientific interest to analyze here, but useful to verify how data was generated and troubleshoot technical issues.
At the most granular level, data is organized by endeavor
.
This groups sets of evolutionary runs that are generated using the same software source and are part of a unified experimental design.
Probably, you only want to use data from one endeavor at a time.
We perform an evolutionary run as a chain of four hour Slurm jobs. Except the first (which generates a random population), each job downloads and opens the preceding jobâs population, performs four hours of evolutionary simulation, then saves out and uploads data files and its population. Lastly, each job submits the next job in the chain. (This process is somewhat analogous to the daily flask transfer in the E. coli long-term evolution experiment).
We concurrently run an array of entirely independent evolutionary chains. Currently, we run 40 independent chains. (These independent evolutionary chains are somewhat analogous to the twelve independent populations in the E. coli long-term evolution experiment.)
Hereâs an overview of whatâs within each endeavor
folder.
control-competitions/
directory¶
Contains results from wild-type versus wild-type competitions, which are used as controls to determine if other competitions have a clear winner.
evolve/
directory¶
Contains data files directly generated by the evolutionary chains.
These data files are organized and labeled according to stint
â the index within a successive daisy-chain of evolutionary jobs â and series
â the index of the jobâs evolutionary chain among independent replicates.
genomes/
directory¶
Contains human-readable serialized genomes. Evolutionary chain jobs upload genomes (one arbitrarily chosen, one chosen according to prevalence) here. Genomes generated in follow-on analyses (for example, nopping out phenotypically-neutral sites or fitness-neutral sites) end up here, too.
monocultures/
directory¶
Contains data from monoculture experiments, where a single genome extracted from an evolutionary chain is grown in isolation with mutation disabled in order to study its phenotypic characteristics.
mutant-competitions/
directory¶
Contains data from competition experiments between a wild-type strain and a variant of that strain after the mutation operator has been applied. Used to characterize the effects of a single mutation event on fitness for particular strains.
mutant-phenotype-differentiation/
directory¶
Contains data from side-by-side comparison experiments to determine if any phenotypic difference exists between a wild-type strain and a variant of that strain after the mutation operator has been applied.
mutating-competitions/
directory¶
Contains data from competition experiments between a wild-type strain with ongoing mutations disabled and the same wild-type strain with ongoing mutations enabled. Used to characterize the effects of ongoing mutation on fitness for particular strains.
noncritical-nopout-competitions/
directory¶
Contains data from competition experiments between a wild-type strain and a variant of the strain with nopout of sites that do not detectably decrease fitness when individually nopped out. Used to assess potential functionality of noncritical sites.
noncritical-phenotypeneutral-nopinterpolation-competitions
directory¶
A noncritical nopout variant refers to a genome with nopout of all sites that do not detectably decrease fitness when individually nopped out. A phenotype neutral variant refers to a gnome with nopout of as many sites as possible without affecting phenotype (kind of âwhittled downâ).
In order to understand the role of noncritical sites that do affect the phenotype, we need to study intermediate genomes between these two poles. Because the noncritical nopout variant should have strictly more sites nopped out, we can generate intermediate genomes by nopping out sites that the phenotype neutral nopout variant contains but the noncritical nopout variant does not.
This directory contains competition experiments between a wild-type strain and variants that interpolate between the noncritical nopout variant and the phenotype neutral nopout variant.
perturbation-...-state-...-competitions
directory¶
Contains data from competition experiments between a wild-type strain and variants of that strain where elements of virtual CPU input or output are disabled. Used to assess how many independent inputs and outputs a strain is relying on.
phenotype-neutral-nopout-competitions
directory¶
Contains data from competition experiments between a wild-type strain and its phenotype-neutral nopout variant. A sanity check that phenotype neutral nopout variants should indeed also be fitness neutral.
phenotype-neutral-nopout-phenotype-differentiation-competitions
directory¶
Contains data from side-by-side comparison experiments to determine if any phenotypic difference exists between a wild-type strain and its phenotype-neutral nopout variant. A sanity check that phenotype neutral nopout variants should indeed be phenotypicaly neutral.
predecessor-battles
directory¶
Contains data from comparison experiments between a population and its ancestor population from 10 stints ago. Used to assess how fitness changes over evolutionary time.
selfsend-...-competitions
directory¶
Contains data from competition experiments between a wild-type strain and variants of that strain where messages with certain tags are disabled by returning them to the sender. Used to assess how many distinct messages a strain is relying on.
stage=...+what=tabulated_and_stitched
directory¶
Contains summary data that has been collated, tabulated, and stitched from across series for a single stint.
stage=...+what=collated
directory¶
Contains summary data that has been collated across all series and all stints.
strain-competition/
directory¶
Contains data from competition experiments between genomes from different series
.
variant-competition/
directory¶
Contains data from competition experiments between wild-type and corresponding single instruction nop-out variants.
wildtype-doubling-time/
directory¶
Contains measurements of how fast a wild-type strain grows from filling a quarter of a grid to filling half of the grid. Used, in addition to competitions, as another measure of strain fitness.