Expression Language Extensions
Here are some notes on potential extensions and improvements to the language for creating new variables.
Documentation
- Explain in the manual that there are actually two levels of evaluation going on: first, evaluate an expression to return an Opus variable; and second, compute the value of a variable for a dataset.
- Look at reorganizing the manual - there is material on expressions in two chapters, and perhaps this should be merged.
- Write additional tutorial material (either as a tutorial or in the manual) about techniques for using expressions, based on the experience so far.
Improvements to Syntax and Semantics
Deprecate fully-qualified variable names
Instead, use dataset-qualified names. Datasets should define inheritance relations, so that for example psrc_parcel should inherit from urbansim_parcel. If a variable name isn't found in the given dataset, search up the inheritance hierarchy for it. For backwards compatibility continue to allow fully-qualified names though. (Note: check on 'package_order' - do we need to do something with that here?)
Eliminate need for 'intermediates'
Eliminate the need for the 'intermediates' parameter in aggregate and disaggregate calls. This implies that datasets should know about containment relations. For an aggregation, if the dataset of the variable being aggregated isn't immediately contained in the aggregating dataset, search the containment DAG for a path between them, and if found, use that to get the intermediate datasets. If there is more than one way to do the aggregation, should we pick the first one, or give an error? (Probably pick the first one -- if the user cares, he/she can use an explicit list of intermediates. At most give a warning.)
For example, consider the expression
neighborhood.aggregate(gridcell.population), intermediates=[zone,faz])
If we have the containment relations gridcell in faz, faz in zone, and zone in neighborhood, then we can just write this as
neighborhood.aggregate(gridcell.population))
When computing the value of the variable, we'd search to see what gridcell is contained in, and find the path to neighborhood via faz and zone.
Use standard syntax for parameters
The current syntax (using SSS and DDD inside variable names) is nonstandard. This should be replaced with method parameters, using keywords for readability as needed.
For example, on the calling side, is_near_SSS_if_threshold_is_DDD would become
is_near(feature, threshhold)
and a use like is_near_highway_if_threshold_is_4 would become
is_near(feature='highway', threshhold=4)
We also need to support this in definitions. Variables defined using Python classes might work without any changes (except for using clearer names for the class and the arguments to the __init__ method). For example the __init__ method for is_near_SSS_if_threshold_is_DDD is:
def __init__(self, location, number):
self.location = location
self.number = number
Variable.__init__(self)
which seems to be what we want.
At least for simple cases, we also want to allow parameters in the definitions in the aliases.py file. Perhaps syntactically this could look like a function definition:
def plan_type(i):
return parcel.plan_type_id==i
replacing aliases such as
plan_type_1 = parcel.plan_type_id==1 plan_type_2 = parcel.plan_type_id==2 ...
If these are still in an aliases.py file, as far as the Python editor is concerned are we still just editing a string? If so we don't get any formatting help from the editor. Alternatives:
- just put up with this. If the GUI supports aliases and expressions, maybe we can bring up a Python code editor in that, which will manipulate the definition with formatting help.
- use real def statements to define parameterized aliases, but strings for ordinary ones (inconsistent though)
- use real def statements, and also for aliases (then we can't read in the file as ordinary Python though since we'd get errors)
Probably the first option is the right one.
Deprecate number_of_agents?
Should the number_of_agents method be deprecated in favor of a count function supplied to aggregate and disaggregate?
Additional functions
Should we import all of numpy (from numpy import *)? What about additional user-defined functions? Should we get rid of the nonstandard function definitions in the current system, e.g. for sqrt?
What about within_walking_distance? From the old wiki: [Liming] This is like aggregate except that the geography is always fixed. We probably have various functions associated with within_walking_distance, for example, count (number_of_household_within_walking_distance), mean (percent_residential_within_walking_distance), and sum (residential_units_within_walking_distance), and even binary operation (residential_value_per_housing_unit_within_walking_distance) ]
[Liming]: I'm pro having and and or; many of the variables are actually results of these logical operators.
Optional function parameter for disaggregate method?
Right now aggregate has an optional function parameter, but disaggregate doesn't. Should it? From the old wiki: [Joel] Specifying a particular disaggregation function might be appropriate. I can imagine at least two cases:
- A member entity directly inherits an attribute value from the collection entity it belongs to, e.g. a Person inherits the Household Income attribute of the Household to which he/she belongs
- A member entity is allocated only a portion of the collection entity's attribute value, in proportion to some attribute of the member entity, e.g. a Traffic Analysis Zone (TAZ) is allocated a Population from its Forecast Analysis Zone (FAZ) Population, in proportion to the TAZ's Area relative to the FAZ's Area.
- Depending on the allocation variable, this may require first aggregating a sum up to the Collection entity from all of the Member entities, e.g. summing TAZ areas up to determine the FAZ's total area
Automatic definition of aggregated/disaggregated variables
If we have a variable defined in some dataset, e.g. gridcell.population, should we automatically provide it for containing datasets using automatically generated aggregated variables, e.g. automatically define zone.population? This is interesting from a programming languages point of view -- we'd have traditional inheritance (e.g. psrc_parcel inherits from urbansim_parcel), and also a kind of containment-based inheritance, which however does an interesting transformation of the inherited information.
Presumably you could also do this in the other direction (disaggregation). What should happen if you can compute the value in several ways, e.g. by aggregating or disaggregating, or aggregating over different numbers of levels?
Code Cleanup
For computed attributes, always store under the short name (even if it's an autogen name) -- handle aliases separately, in a dictionary of aliases. This would allow aliases for primary attributes and multiple aliases for the same expressions.
When flushing computed attributes, also flush values of autogen variables.
Indicator Documentation
If indicators can be defined using expressions as well as Opus variables, what happens with the indicator documentation?
Proposal: an Opus variable can optionally have indicator documentation associated with it. So if you want to have documentation for an indicator, define a variable. (This doesn't seem too onerous ... if the indicator is suffficiently significant that it ought to be documented, you can define a variable for it.)
