Analyzing Multiple Datasets with the EnzoSimulation Class

written by Britton, on Jun 20, 2009 1:37:00 PM.

We generally analyze numerical simulations a single dataset at a time, but quite often what we really want to do is analyze the entire simulation all at once. This is particularly necessary when studying anything time dependent. This is usually accomplished by writing some disposable wrapper for the analysis routine that loops over similarly named, indexed datasets. If you’re particularly lazy, you may end up just calling your code over and over and changing the arguments by hand. When running a cosmology simulation, you will likely have datasets output at specific redshifts, in addition to data coming out at equally-spaced time intervals. This can make time-sorting of datasets a little trickier, or at least a little more tedious.

To make this a little easier, we have the EnzoSimulation class (yt/extensions/EnzoSimulation.py). To instantiate an EnzoSimulation object, you only need to provide the parameter file use to run the simulation.

>>> from yt.extensions.EnzoSimulation import *

>>> ES = EnzoSimulation("128Mpc256grid_SFFB.param")
The EnzoSimulation class doesn’t give you much, but what it does give you is nice. The EnzoSimulation object we just made has one main attribute, a list called allOutputs. This is a time-sorted list of all datasets in the entire simulation. Each item in the list is a dictionary containing entries for the path to the dataset parameter file, the time associated with that dataset, and if this is a cosmological simulation, the associated redshift.
>>> w.allOutputs[0]
{'filename': '/Users/britton/EnzoRuns/cool_core_unreasonable/RD0000/RD0000', 
 'time': 0.81631644849936602, 'redshift': 99.0}
Who wants to analyze a dataset at redshift 99, though? We can limit time interval over which datasets are kept in the list with keywords, initial_time, initial_redshift, final_time, and final_redshift, given when instantiating the EnzoSimulation object.
>>> ES = EnzoSimulation("128Mpc256grid_SFFB.param", 
                        initial_redshift=1,final_redshift=0)
Time and redshift keywords can also be mixed together. One other keyword can be specified. If links=True is thrown as instantiation, all items in the allOutputs list will also have entries, previous and next, that point to the previous and next items on the list. This is useful if you’re buried in a loop somewhere without index information and you need to know what the nearest datasets are. Now you’re disposable wrapper becomes:
from yt.extensions.EnzoSimulation import *

ES = EnzoSimulation("128Mpc256grid_SFFB.param",initial_redshift=5)

for output in ES.allOutputs:
    do_analysis_on_this_dataset(output['filename'])

You can use the EnzoSimulation class as a superclass for your own analysis subclasses. A good example of this is yt/extensions/SimulationHaloProfiler.py. This uses the EnzoSimulation class to run the HaloProfiler over multiple datasets. Here’s the entire source:

import yt.extensions.HaloProfiler as HP
from EnzoSimulation import *

class SimulationHaloProfiler(EnzoSimulation):
    def __init__(self,EnzoParameterFile,HaloProfilerParameterFile,**kwargs):
        EnzoSimulation.__init__(self,EnzoParameterFile,**kwargs)
        self.HaloProfilerParameterFile = HaloProfilerParameterFile

    def runHaloProfiler(self):
        for output in self.allOutputs:
            halo_profiler = HP.HaloProfiler(output['filename'],self.HaloProfilerParameterFile)
            halo_profiler.makeProfiles()
            halo_profiler.makeProjections()
            del halo_profiler
The call to EnzoSimulation.__init__ ensures that all of the EnzoSimulation object attributes, namely allOutputs, gets inherited by the SimulationHaloProfiler object. Without that call, all we would get are the EnzoSimulation object methods, which aren’t very useful on their own. Using this subclass, all that needs to be done to run the HaloProfiler over the entire simulation is this:
from yt.extensions.SimulationHaloProfiler import *
SHP = SimulationHaloProfiler("128Mpc256grid_SFFB.param",
                             "halo_profiler_parameters.par",
                             initial_redshift=5)
SHP.runHaloProfiler()
Simple as that.

yt Release Cycle

written by Matt, on May 29, 2009 9:54:00 PM.

A little bit over a week ago, I branched out a yt-1.5 branch in subversion. The svn:externals in the main Enzo repository have been updated, as well, to point to this new location. We’re putting the finishing touches on the 1.5 branch, which will be released as the first major release in about nine months or so. The development team has really picked up steam in that time, and we’re all really quite proud of what we’ve accomplished in the mean time.

The new features include things like pervasive and transparent parallelism, better debugging, more intuitive access to data objects, new halo finders, better clump finding, better access to plot callbacks, and some extensions like a light cone generator and a halo profiler. The command line interface has been overhauled, as well, and we’ve set up a pastebin as well as lots of fun debugging systems.

All of the code is in place – at this point we’re just trying our best to write the documentation. It’s a hard task, and it’s become kind of clear to me that the first version of the docs were good in some ways but bad in a bunch of ways, too. It’s tough, sometimes, to write a big project and then step back and try to see it from the outside. This time, Britton is helping with the documentation. We’re aiming for a full and proper release in the middle of June.

And then the next release will probably be just around the corner – we’re working on some simulated observations, some better parallelism, dieting the hierarchy and speeding up projections. That one will be 1.6, and it will be just a point release. The next gigantic upgrade – as big a step forward as the parallelism layer is for yt-1.5 versus yt-1.0 – is still a ways off. More on that another time.

Opening the Barn

written by Britton, on May 4, 2009 9:43:00 AM.

In their time working with Enzo, users have developed a host of useful tools, scripts, and snippets of code for performing any number of Enzo-related tasks. Some of these, such as yt, have grown into enormously useful, full-featured analysis toolkits that are now actively developed and used by a large community. Still, the Enzo landscape is dotted with with tiny niches that are best filled by small, standalone pieces of code.

For a while, I have wanted share some of the things I have created in in this vein, but found it tedious to continually copy my latest version of something from machine to machine and inform people in person of its whereabouts. I’m sure that over time this has been the case for many, the end result being frequent reinvention of many wheels.

Now we have something better. The EnzoTools Barn is now the place to share your creations with the community.

How it works: Each submission will be made into a Mercurial repository to which only the author will have write access. This will allow the author to make updates and changes to their code any time they like. An entry will be automatically generated on the Barn main page with a short description and screenshot. Each entry will provide links to downloadable tarballs and the code’s revision history. Further instructions cane be found here.

If you’ve written something you find useful, odds are that it will be useful to many others, so please share!

Plot Callbacks

written by Matt, on Apr 7, 2009 12:11:00 AM.

2
The idea of a ‘callback’ as a means of modifying a plot might need some justification. If you imagine a plot as being stateful, then a ‘callback’ is completely unnecessary – you just make modifications to the plot and they are reflected. Unfortunately, in yt, this is not how the plots work – because of the transition from the data-space to the image-space, the plots are destroyed and reconstructed every time the view limits change. (For instance, on every zoom command.) So to preserve any modification to a plot, following this destruction/reconstruction, a routine must be called that re-modifies it. In the next version of yt, the engine that drives this process will be examined. So, until very recently, to add a ‘callback’ for something as simple as overplotting velocity, you had to construct an object and manually add it to a plot:
p = pc.add_slice("Density", 0)
p.add_callback(QuiverCallback("y-velocity", "z-velocity", 16))
In this, the QuiverCallback object is instantiated, it is told which fields to use, and it then gets added to a SlicePlot object. This is clumsy and prone to error! So recently, a new mechanism was added, wherein each plot now possesses a simpler dictionary-like trait called ‘modify’. You can now do the same thing as above with the ‘velocity’ convenience routines:
p = pc.add_slice("Density", 0)
p.modify["velocity"]()
This handles all the field naming and just adds the velocity field over top. Hopefully in the future, we can find ways to simplify this a bit further!

Remote debugging

written by Matt, on Apr 5, 2009 10:32:00 PM.

When debugging a python script, the simplest possible way to do so is to run with the pdb module:
$ python2.6 -m pdb my_happy_script.py
This will provide a prompt to get going, as well as catching any uncaught exceptions and dumping them into an interactive pdb session. Unfortunately, this simply doesn’t work with parallel (mpi4py) run processes. We can no longer depend on the standard input/output mechanisms for debugging, and we need to be able to attach to a specific process. There are several solutions for remote debugging in Python - Komodo and winpdb are the two main solutions that come to mind. These two solutions are excellent, robust, and well-tested. They’re suitable for detailed analysis of running processes as well as post-mortem crash analysis. Unfortunately, while awesome, these two solutions are both kind of heavyweight. To solve this problem, I recently implemented a very simple, lightweight xmlrpc proxy to pdb. If you run with the command line option
--rpdb
, upon any uncaught exception the program will spawn an xmlrpc server that listens (without authentication) for connections. Any commands sent over that connection will be proxies to the pdb session, and then returned to the remote server. To connect to this xmlrpc server, the command
yt rpdb
has been created, which spawns a pdb-lookalike prompt that proxies your commands. Upon issuance of the
shutdown
command, the xmlrpc server shuts itself down, and when all tasks have been shutdown, the mpi4py process terminates. First off, this is a convenience function – it’s not meant to be robust, it’s not meant to work in all situations, and it can in fact be quite dangerous. There is no authentication, so anyone that just-so-happens to connect on the right port to your running process then controls that process and can do basically anything. You should only do this in extremely controlled environments, but, it can be very useful for reporting bugs, debugging and fixing bugs, and just even examining what’s going on in a process. Second of, it can be kind of crazy to issue shutdown to 32 processes, if you happen to be running that many! I am still thinking about the best way to issue a bunch of shutdown commands. And finally, right now it only binds to localhost. So if you’re not running locally, you would have to bind to
0.0.0.0
to open the port. This is probably unwise. Well, anyway, I hope that this can be of use. Use it with care!

Parameter File Store

written by Matt, on Mar 3, 2009 12:21:00 PM.

Often when doing analysis tasks, one wants to keep an object (that may have been painstakingly created by hand!) around between sessions. yt implements a pickle protocol to handle this. Unfortunately, it’s not quite as simple as just pickling – we want to ensure we have uniform retrieval of objects, too, and we want to ensure we’re getting them back in the right manner.

The biggest obstacle to retrieving an object is the affiliation of an object with a given parameter file. At their most base level, a parameter file can be described by a path. However, while this works for a single instantiation of a yt session, often between sessions data will be moved – between supercomputing centers, or across mounted external hard drives, or even within a given computing center to a different storage system. A means of addressing, or at least uniquely identifying, parameter files is necessary to ensure uniform access across instances of an analysis session. An absolute path, while unique, is not necessarily invariant. To this end, the basename (the final element in the absolute path), the simulation time, and the creation time of the simulation output (CurrentTimeIdentifier in Enzo) are used to identify a given static output. An MD5 hash is generated of these three items, which is then used as a key for the parameter file. By this means, collisions between different parameter files (rather than copies of a single parameter file) are made extremely unlikely.

Upon retrieval of the object, the key is handed to a parameter file storage object. This object keeps track of all instantiated parameter files; whenever a new parameter file object is instantiated, its hash is generated and compared against the set of existing parameter files. If a match is found, the current path is compared to the path being used during instantiation, and the path in the data store is updated as necessary. If the parameter file is new to the system, it is inserted. By this means, the locations of all known parameter files are kept as up to date as possible; this is by no means a foolproof system, but it works for most use cases.

Originally when implementing the ParameterFileStore, I wrote it in SQLite. This didn’t work out, because not all machines had SQLite installed, or even the python library affiliated with it. So, next up: shelve. Unfortunately, the shelve module relied on a variety of backends, and this didn’t end up working reliably on Kraken. So, finally, today I am (yt.enzotools.org/changeset/1196) the new flat file storage of parameter files. The root processor will handle the writing, and if necessary all of the information gets broadcast to the other processors. A downside of this is that every time a new parameter file is inserted, the entire csv file gets output. However, compared to things like simple IO, we’re still way in the clear, so it should not be a bottleneck.

And, as a bonus, subclasses of StaticOutput now self-register. So if you subclass StaticOutput, pickle an object affiliated with an instance of that subclass, and then try to unpickle it – yt knows, and it’ll return that subclass to you.

Welcome to the Enzotools blog

written by Matt, on Mar 3, 2009 11:59:00 AM.

Hi there! We’re just testing out some new means of communicating changes and creating a more narrative approach to the development. This won’t be a detailed, rambling blog, or even a blog about important announcements, but more of a description of why different design decisions were made, how they were implemented, and why the user base should care. Hopefully it’ll serve as a record of motivations and implementation information.