ParallelRegression API

This page is automatically generated from the source code using Sphinx. Source code documentation follows Numpy style guidelines, not all of which is supported by Sphinx, even when using the napoleon extension.

Functions

ParallelRegression.FStatistic(X, u, coefs, R, r, vcType='White1980')[source]

Computes an F statistic that by default is heteroskedasticity-robust.

Parameters:
  • X (2-dimensional array) – Matrix of all regressors including the intercept.
  • u (vector array) – Vector of all residuals.
  • coefs (vector array) – All coefficients from the model that is being tested, including the intercept and untested parameters.
  • R (2-dimensional array) – Linear restrictions in matrix form.
  • r (vector array) – Linear restriction values in vector form.
  • vcType ({'White1980', 'Classical'}, optional) – Type of variance-covariance matrix requested. Keep the default setting for a heteroskedasticity-robust result. See vCovMatrix function for details.
ParallelRegression.despace(string)[source]

Removes inconsequential spaces without removing spaces that might impact the interpretation of a formula or line of code.

ParallelRegression.formulas_match(formA, formB)[source]

Determines whether or not two formula strings are likely to be the same formula despite differences in the order of terms and/or spacing.

ParallelRegression.has_term(formula, term)[source]

Returns True if formula either starts with term followed by one of [ )+-~*:] or contains term followed by one those characters, preceeded by one of [ (+-~*:].

ParallelRegression.mask_brackets(string)[source]

Mask anything inside brackets, including nested brackets, by replacing the brackets and their contents with a same-length string of repeated underscores.

ParallelRegression.masked_dict(string, mobj)[source]

Recovers the corresponding contents from the original string based on a regular expressions match object produced using a masked string. Compare to mobj.groupdict( ).

Parameters:
  • string (string) – The unmasked string from which content is to be recovered.
  • mobj (regular expression match object) – The match object resulting from a regular expression pattern matched to a masked version of string.
Returns:

Dictionary containing the substrings of string corresponding to the named subgroups of the match, keyed by the subgroup name.

Return type:

dict

ParallelRegression.masked_iter(string, mobj_iter)[source]

Recovers the corresponding contents from the original string based on regular expression match objects produced by an iterable returned from re.finditer( ) or from a pattern object’s .finditer( ) method.

Parameters:
  • string (string) – The unmasked string from which content is to be recovered.
  • mobj_iter (iterable of regular expression match objects) – The iterable of regular expression match objects resulting from a regular expression pattern matched to a masked version of string.
Returns:

List containing the substrings of string corresponding to the substring matched by each match object produced by the iterable.

Return type:

list

ParallelRegression.masked_split(string, mask, split)[source]

Splits string based on the location(s) at which split is located in mask. Compare to str.split( ).

Parameters:
  • string (string) – The unmasked string from which content is to be recovered.
  • mask (string) – The masked version of string to be used to determine the the location(s) at which to split string.
  • split (string) – The string identifying the location(s) at which to split string.
Returns:

List of substrings resulting from splitting string based on the presence of split in mask.

Return type:

list

ParallelRegression.syncText(strA, strB, addA, addB, pre='')[source]

Adds necessary spacing to align simultaneous additions to two strings.

Parameters:
  • strA (string) – The first of the two strings.
  • strB (string) – The second of the two strings.
  • addA (string) – The string to be appended to the first string.
  • addB (string) – The string to be appended to the second string.
  • pre (string, optional) – This string is appended to strA and strB immediately before addA and addB.
Returns:

  • strA (string) – The first string, with addA appended.
  • strB (string) – The second string, with addB appended, starting at the same index as addA in the first string.

Example

>>> upper, lower = ('John', 'Proper Noun')
>>> for a, b in [('ate', 'Verb'),('an', 'Article'),('apple.', 'Noun')]:
>>>     upper, lower = syncText( upper, lower, a, b, ' ' )
>>> print( upper, '\n', lower )
John        ate  an      apple.
Proper Noun Verb Article Noun
ParallelRegression.termString(formula, termList)[source]

Returns the subset of terms in termList that occur in formula.

ParallelRegression.terms_in(formula)[source]

Generator that yields individual terms in formula.

ParallelRegression.vCovMatrix(X, u, vcType='White1980')[source]

Computes a variance-covariance matrix.

Parameters:
  • X (2-dimensional array) – Matrix of X values.
  • u (vector array) – Vector of residuals.
  • vcType ({'White1980', 'Classical'}, optional) – Type of variance-covariance matrix requested. ‘Classical’ for the classical statistics formula, or ‘White1980’ for the heteroskedasticity-robust formula originally proposed by Halbert White in his 1980 paper, ‘A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity’.

Notes

The heteroskedasticity-robust formula supported is the formula explained in the documentation to the “car” R package’s “hccm” function:

“The classical White-corrected coefficient covariance matrix (“hc0”) (for an unweighted model) is

“V(b) = inv(X’X) X’ diag(e^2) X inv(X’X)

“where e^2 are the squared residuals, and X is the model matrix.”

This is the same formula proposed by White in 1980. However, the car pachage documentation is substantially more clear and concise than either the original paper or most textbook discussions.

ParallelRegression.val_if_present(obj, attr=None, alt=None)[source]

Returns the requested value if it is set and is not None. Otherwise alt is returned.

Avoids errors if the requested value does not exist, while handling the presence of a None value and a default value in a different manner than getattr( ).

Parameters:
  • obj – An object from which to attempt to retrieve an attribute value.
  • attr (string, optional) – The name of the attribute to be retrieved. If attr is not set, then the value of obj will be returned unless it is equal to None. If ‘.’ is present in this string, child objects will be retrieved recursively. See example below.
  • optional (alt,) – The value to be returned if the requested attribute does not exist, or is equal to None, or if the object’s value is requested but is equal to None. If this is not set, then None will be returned in these scenarios.
Returns:

If the requested value is set and is not equal to None, then the requested value is returned. Otherwise, alt is returned. No error is raised if obj does not have an attribute named attr.

Return type:

obj

Examples

>>> class testFixture(object):
>>>     NoneVal = None
>>>     number = 123
>>> testFix = testFixture( )
>>> val_if_present( testFix, 'NoneVal', 'ABCdef' ) == 'ABCdef'
True
>>> testFix.fixtureTwo = testFixture( )
>>> testFix.fixtureTwo.dict_object = {'a_key': 'has a value.'}
>>> val_if_present( testFix, 'fixtureTwo.dict_object.a_key' ) \
>>>               == 'has a value.'
True

laggedAccessor( )

class ParallelRegression.laggedAccessor(data, max_lag=None)[source]

Bases: object

Allows lagged values to be retrieved from a dict( ) or pandas.dataframe( ) object using the same L#@column_name notation used by other mathDict classes.

__getitem__(index)[source]

Returns a row, column, or slice of a column from the linked collection of columns.

Data is always retrieved via a call to .get_column( ), so “row” has the meaning it has to that method.

Supported Notation

int -> returns a dict( ) of columns values for one row

str -> returns the requested column

(int or slice, str) tuple -> returns the specified row(s) from the requested column

__init__(data, max_lag=None)[source]

Initializes the laggedAccessor( ).

Parameters:
  • data (dict( ) or pandas.dataframe( )) – The collection of columns from which values are to be retrieved.
  • max_lag (int, optional) – The maximum number of lags to be provided. The first max_lag number of rows will be hidden from each column retrieved through the laggedAccessor so that retrieving column and retrieving L1@column results in columns of the same length. This number must either be set explicitly or by using one of the .findMaxLag( ) or .rewrite( ) methods.
findMaxLag(formula_string, in_place=False)[source]

Determines the maximum lag requested for any column in the provided formula string(s).

Parameters:
  • formula_string (string or Sequence) – The formula(s) to be searched for lags.
  • in_place (bool, optional) – If True, then the max_lag attribute of this laggedAccessor( ) will be updated based on the maximum lag in the provided formula(s). Otherwise, a new laggedAccessor( ) instance linked to the same data object will be created, and its max_lag attribute will be set.
Returns:

Return type:

laggedAccessor( )

get_column(column_name, lag=0, row=None)[source]

Returns the requested column or slice thereof.

Parameters:
  • column_name (string) – The name of the column in the linked collection of columns from which to retrieve data.
  • lag (int, optional) – The column representing this many lags of column_name will be returned. A lag of zero refers to the column that is the subset of rows of column_name in the linked collection starting at the index location equal to the max_lag value that this laggedAccessor( ) is configured to support.
  • row (int or slice, optional) – The row or slice to be retrieved. This refers to row numbers in the column identified by the combination of the column_name and lag parameters and the configured max_lag value.
rewrite(formula_string)[source]

Rewrites as formula string that uses mathDict’s lag notation to instead use Python variable name aliases for the lagged columns, so that the rewritten formula string can be processed by, e.g., patsy.

Returns:The rewritten formula string followed by a new laggedAccessor( ) instance linked to the same data object containing the lagged column aliases. The new laggedAccessor( ) instance’s max_lag attribute is the greater of this laggedAccessor’s max_lag and the maximum lag used in the formula_string.
Return type:(formula_string, laggedAccessor( )) tuple

setList( )

class ParallelRegression.setList(values=None)[source]

Bases: collections.UserList

List that eliminates redundant list items and implements set comparison methods.

Attributes

lastSIOutcome : bool
True if the most recent call to .__setitem__( ) resulted in adding a value, or False if the call would have resulted in a duplicate value.

Methods (see set & list)

  • add (alias for .append( ))
  • append (returns True if it results in adding a value, or False otherwise)
  • difference
  • discard (returns True if it resulted in removing a value, or False otherwise)
  • extend (alias for .update( ))
  • intersection
  • issubset
  • issuperset
  • pop
  • symmetric_difference
  • union
  • update (returns the number of new values added to the setList( ))
  • and list methods inherited from UserList( )
setList.as_fsets

set( ) of frozenset( )s of the items in each setList( ) member. The use of frozenset( )s enables the set( ) to contain otherwise non- hashable objects. Useful for order-insensitive equality testing.

typedDict( )

class ParallelRegression.typedDict(typeRequirement, writeOnce=False, default=None)[source]

Bases: dict

dict( ) that is restricted to entries consisting of values of a specified type.

typedDict also supports default values whereby new entries are created by deepcopy( )ing an object as opposed to creating a new instance of a class, and supports a write-once mode in which keys that have a value associated with them cannot be changed, but values that are mutable objects may still mutate.

Each item has an integer key, and may also have a string key associated with it, but a string key is not required. I.e., there is a (zero-or-one)- to-one relationship between string keys and dictionary entries, as well as a one-to-one relationship between integer keys and dictionary entries.

Integer keys are not preserved when typedDict( ) is copied.

__init__(typeRequirement, writeOnce=False, default=None)[source]

Creates a typedDict( ) instance.

Parameters:
  • typeRequirement (type) – Dictionary entries will only be accepted if they satisfy isinstance( obj, typeRequirement ).
  • writeOnce (bool) – If True, then once a dictionary entry has been created for a key, the dictionary entry cannot be changed. If the entry consists of a mutable object, the object may still mutate.
  • default (object of type typeRequirement, optional) – If set, then attempting to access a dictionary entry that does not yet exist will result in a deepcopy of this object being used to create an entry for the requested key.
itemLength(key)[source]

Only checks the length of an entry if the entry already exists, otherwise returns 0.

Parameters:key (int or string) – A key for a dictionary entry that need not exist.
Returns:If there already exists a dictionary entry with the requested key, the length of the entry is returned. If there is no dictionary entry already existing with the requested key, then 0 will be returned without creating an entry for the key.
Return type:int
keys(key_type=None)[source]

Returns a setList( ) of keys for which there currently exists an entry.

Parameters:key_type ({None, 'integer', 'string', 'union'}, optional) – The type of keys to return. If key_type==None and at least one entry has a string key, then only string keys will be returned. Otherwise if key_type==None, integer keys will be returned. If key_type==’union’, then a setList( ) consisting of both integer and string keys will be returned.
Returns:
Return type:setList( )
pop(key)[source]

Returns and removes the entry associated with the specified key. Accepts both integer and string keys.

union_update(other)[source]

Similar to dict( ).update( other ) except that for keys with which an entry is associated in both this typedDict( ) and other, the new entry will be this typedDict[key].union( other[key] ).

update(other)[source]

Copies entries in other into this typedDict( ), replacing existing entries that use the same key.

categorizedSetDict( )

class ParallelRegression.categorizedSetDict[source]

Bases: typedDict

Ordered sets stored in a dict( ) in which each set and each set member potentially belongs to one or more category.

Sets and set members can be retrieved by category. Categories designated as mutually exclusive restrict category membership to sets and set members without conflicting categories, when category membership is established via the .set_category( ) method.

Attributes

mutually_exclusive : set
Set of frozensets of categories, where each category in a frozen set and all other categories in the same frozenset are are considered mutually exclusive.

Notes

Categories are assumed to be identified by strings. Code is only tested using string-identified categories. However, this is not strictly enforced.

categorizedSetDict.__init__(singular_category=None)[source]

Creates a categorizedSetDict( ) instance.

Parameters:singular_category (string, optional) – When a set is instantiated with a single item (as opposed to a sequence of one member), this category is assigned to the set.
categorizedSetDict.__setitem__(key, value)[source]

Sets dict( ) entries.

Parameters:
  • key (str) – String identifying a set.
  • value (str or tuple (list-like : set members[, list-like : set member) –
  • set (categories][,) – list( ), set( ), or setList( ) of set members associated with the key, alone or combined with categories that apply to individual set members and/or categories that apply to the whole set.
Raises:

CategoryError – If an attempt is made to associate a set or set member with a category that is already associated with a mutually-exclusive category. In this scenario, the entire .__setitem__( ) operation will fail and the categorizedSetDict( ) will not be altered.

Examples

>>> c = categorizedSetDict( )
>>> c['vocab'] = ['apple', 'bee', 'cabin']
>>> c['vocab']
['apple', 'bee', 'cabin']
>>> c.get_categories( key='vocab', value='apple' )
{None}
>>> c['vocab'] = (['apple', 'bee', 'cabin'],
>>>               {'words'})
>>> c.get_categories( key='vocab', value='apple' )
{'words'}
>>> c['vocab'] = (['apple', 'bee', 'cabin'],
>>>               [{'food'}, {'animal'}, {'building'}])
>>> c.get_categories( key='vocab', value='apple' )
{'food'}
>>> c['vocab'] = (['apple', 'bee', 'cabin'],
>>>               [{'food'}, {'animal'}, {'building'}],
>>>               {'words'})
>>> c.get_categories( key='vocab', value='apple' )
{'food', 'words'}
categorizedSetDict.del_category(category, *, key=None, value=None, keys=None, items=None)[source]

Disassociates the specified category from the specified key(s) and/ or key/value pairs. See .set_category( ) documentation for usage.

If there is no existing association between category and a specified key and/or key/value pair, no Exception is raised.

Raises:
  • CategoryError – If an attempt is made to disassociate an individual set member with a category that is associated with the whole set.
  • KeyError – If a specified key/value pair does not identify an existing set member.
categorizedSetDict.get_categories(key, value=None)[source]

Returns the set of categories associated with the whole set or a particular member thereof.

Parameters:
  • key (str) – The key associated with the set about which the inquiry is being made.
  • optional (value,) – If specified, the categories specifically associated with the specified set member are returned along with the categories associated with the whole set.
Returns:

  • categories (set) – The set of all categories associated with the specified key or with the specified key/value pair. If there is a set identified by the key key, then the set of categories associated with it will be returned regardless of whether or not value is in the set key.
  • None – If there is no entry with key‘s value as its key.

categorizedSetDict.is_a(category, key, value=None)[source]

Returns True if the specified key or key/value pair is associated with category, or False otherwise, so long as there is an entry with key‘s value as its key. If not, None is returned. The set identified by key need not contain value.

categorizedSetDict.is_None(key, value=None)[source]

Returns True if the specified key or key/value pair is not associated with any category, or False otherwise. See .is_a( ) for handling of non-existant values.

categorizedSetDict.items_categorized(category)[source]

Returns the key/value pairs consisting of keys associated with the specified category with all of their set members, as well as set members that are individually associated with the specified category (along with their keys).

categorizedSetDict.keys_categorized(category)[source]

Returns the keys that are associated with the specified category.

categorizedSetDict.make_mutually_exclusive(categories)[source]

Designates the specified set of categories as mutually exclusive.

Parameters:categories (set or Sequence) – The set of categories to be considered mutually exclusive. If this is a superset of an existing set of mutually exclusive categories, it will replace the existing subset. If an existing set is a superset of this one, then no action is taken.
categorizedSetDict.pop(key)[source]

Removes and returns the setList( ) associated with the provided key and deletes category information.

categorizedSetDict.set_categories(*categories, key=None, value=None, keys=None, items=None)[source]

Associates key(s) and/or key/value pairs with each of the specified categories.

.set_categories( ) makes a separate call to .set_category( ) for each category. As such, calls to .set_categories( ) are not atomic. For atomic behavior, use .__setitem__( ) instead of .set_categories( ).

Parameters:
  • arguments (positional) – Categories to associate with the key(s) and/or key/value pairs.
  • key (str, optional) – If key but not value is specified, then this key will be associated with the specified category.
  • optional (value,) – If key and value are both specified, then this key/value pair will be associated with the specified category.
  • keys (Sequence) – The keys in this sequence will be associated with the specified category.
  • items (dict) – The key/value pairs in this dict( ) will be associated with the specified category. Each value in this dict( ) can be either a single set member or a sequence of set members to associate with the specified category.
Raises:
  • CategoryError – If the category specified and one or more categories already associated with one or more of the specified key(s) or key/value pairs are considered mutually exclusive.
  • KeyError – If a specified key/value pair does not identify an existing set member. Note that assigning a category to a key for which there is no pre-existing entry results in the creation of a default entry instead of of a KeyError.
categorizedSetDict.set_category(category, *, key=None, value=None, keys=None, items=None)[source]

Associates key(s) and/or key/value pairs with the specified category.

Parameters:
  • category (string) – The category to associate with the key(s) and/or key/value pairs.
  • key (str, optional) – If key but not value is specified, then this key will be associated with the specified category.
  • optional (value,) – If key and value are both specified, then this key/value pair will be associated with the specified category.
  • keys (Sequence) – The keys in this sequence will be associated with the specified category.
  • items (dict) – The key/value pairs in this dict( ) will be associated with the specified category. Each value in this dict( ) can be either a single set member or a sequence of set members to associate with the specified category.
Raises:
  • CategoryError – If the category specified and one or more categories already associated with one or more of the specified key(s) or key/value pairs are considered mutually exclusive.
  • KeyError – If a specified key/value pair does not identify an existing set member. Note that assigning a category to a key for which there is no pre-existing entry results in the creation of a default entry instead of of a KeyError.
categorizedSetDict.values_categorized(category)[source]

Returns set members from all sets where either the set member or the set is associated with the specified category.

termSet( )

class ParallelRegression.termSet[source]

Bases: categorizedSetDict

Manages a set of terms, each of which might have multiple representations.

Categories

dummy
“Dummy” terms or representations of terms in which there are only two values.
Y
Terms that are used on the LHS of formulas instead of the RHS. Terms that are sometimes used on the LHS and sometimes used on the RHS are not supported.
required_X
Terms that must be included on the RHS of all formulas derived from this termSet( ).
T
RHS terms representing time/trend.

Read-Only Attributes

The following read-only attributes return setList( )s with the respective terms:

  • W_term_set (all terms)
  • W_termRep_set
  • Y_term_set
  • X_required_set
  • all_X_terms_set
  • dummy_term_set
  • dummy_termRep_set
  • real_term_set (terms not categorized as dummy terms)
  • real_termRep_set
  • other_terms (terms neither categorized as Y terms nor as required X terms)
termSet.__init__(terms, dterms, T=None)[source]

Creates a termSet( ) instance.

Parameters:
  • terms (dict) – Dictionary in which each term is represented by a key for which the value is a sequence of forms in which the term might occur. Ex: {‘X’: [‘X’, ‘ln(X)’]}. Entries for which the value is a single string, e.g. {‘d’: ‘d’}, will be treated as dummy terms. To avoid this when a term has only one form, enclose the string in a list, e.g. {‘X’: [‘X’]}.
  • T (string) – String identifying a single term that represents time/trend.
termSet.__init__(formulas)[source]

Creates a termSet( ) instance.

Parameters:formulas (iterable) – Iterable of a formula strings from which to extract the terms and their forms. Ex: [‘y ~ x**2 + x’, ‘ln(y) ~ ln(x)’]
termSet.changeT(T)[source]

Disassociates any term(s) currently associated with the category ‘T’ and associates T with the category ‘T’.

termSet.require(*args, make=True)[source]

Associates keys listed in *args with the category ‘required_x’ if make is True, or disassociates them if make is False.

termSet.Y(key, value=None, make=True)[source]

Associates the specified term with the category ‘Y’ if make is True, or disassociates it if make is False.

termSet.dummy(key, value=None, make=True)[source]

Associates the specified term with the category ‘dummy’ if make is True, or disassociates it if make is False.

mathDataStore( )

class ParallelRegression.mathDataStore(mathDict=None)[source]

Bases: dict

__setitem__(key, value)[source]

See mathDictMaker( ) documentation.

interactions_matrix(binary_columns, interact_with=None)[source]

(for internal use): calculates the columns representing the interactions between one-or-more dummy terms, alone or with one non- dummy term.

This method returns nothing. The results are stored as columns in the data store. If this mathDataStore( ) is linked to a mathDict( ), then it informs the mathDict( ) that the interaction matrix for this set of columns has been calculated and stored locally by appending the colon- separated list of terms as a single string to mathDict( ).local_calculated.

Parameters:
  • binary_columns (sequence of strings) – Strings identifying the columns to be used as dummy terms.
  • interact_with (string) – String identifying the column for the dummy terms to interact with.
itemsize

Returns the bytes-per-cell. Currently hard-coded at eight bytes.

key_list

Returns the list of columns currently in this data store in sorted list form.

padding_list

Lists the amount of padding associated with each column in the same order as the columns are listed in mathDataStore( ).key_list.

rows

Returns the number of rows that the matrix of original columns has or will have, as determined by the first column provided to mathDictMaker( ). All subsequent columns must have the same number of rows. Returns None before the first column is stored.

savePaddedColumn(key, value, padding=None)[source]

Stores a column that is shorter than the number of rows of other columns in this mathDataStore( ).

Zeros are prepended to the column before it is store so that it is the same length as the other columns. The number of prepended zeros is stored in self.padding so that those zeros can be treated as NAs when the column is retrieved, and yet new columns can be calculated by multiplying padded and full-length columns together without special handling of the padding zeros.

Padding zeros are replaced with None when a specific row is retrieved via the mathDataStore( ).__getitem__( ) method.

Parameters:
  • key (string) – Name of the column to store.
  • value – The column data.
  • padding (int, optional) – If specified, then this method will skip calculating the appropriate amount of padding and pad the column by this many zeros. mathDataStore.__setitem__( ) will raise a ValueError if the resulting padded column is not the correct length.

mathDictMaker( )

class ParallelRegression.mathDictMaker(mathDict=None)[source]

Bases: ParallelRegression.mathDataStore

To provide a column for inclusion in the matrix of original columns, simply add it as if adding to a dict( ).

The following describes setting a column via mathDictMaker[key] = value.

Error checking is relatively thorough because this is intended for use while setting up a large batch of calculations that might take a long time to perform, and mistakes might not otherwise be caught until after-the- fact.

Parameters:
  • key (str) – The key is used as the column name, and must either be a valid Python variable name, or be a valid Python variable name immediately followed by brackets. The string enclosed in the brackets is not restricted.
  • value (sequence of numerical values) – The sequence represents the column cells. The data type for the column is determined by the first value in the sequence, so if the first value is a float then all values will be treated as floats. If the first value is an integer, than all values must be integers.
Raises:
  • TypeError – If the value is not a sequence, or if the value is a string. If the first item in the sequence is an integer but one or more subsequent values is not. If the first item in the sequence is a float but one or more subsequent values is neither an integer nor a float. If the first item in the sequence is neither a float nor an integer.
  • KeyError – If the key is not a string. If the key is a string but is not a valid Python variable name.
  • ValueError – If the length of the sequence does not match the length of the existing column(s).
static fromMatrix(matrix, integer=False)[source]

Creates a SharedDataArray and mathDict( ) representation from an existing two-dimensional matrix.

Parameters:
  • matrix (2-dimensional array) – An existing Numpy matrix or 2-dimensional Numpy array.
  • integer (bool) – If True, the matrix in the SharedDataArray will consist of integers.
Returns:

  • SharedDataArray (multiprocessing.sharedctypes.RawArray) – The shared data array in which the matrix will be stored.
  • mathDict( ) (mathDict( )) – The mathDict( ) representation of a matrix.

make(cache_crossproducts=False, cache_powers=1)[source]

Assembles the shared data array and mathDict( ) matrix representation.

Parameters:
  • cache_crossproducts (boolean, optional) – If True, then the crossproducts of all combinations of columns (without replacement) will be pre-calculated and stored along with the matrix of original columns. To pre-calculate the product of a column and itself, set cache_powers to a number greater than or equal to two.
  • cache_powers (int, optional) – If an integer greater than one, then powers of all columns from two to this number will be pre-calculated and stored with the matrix of original columns. Numbers less than or equal to one will be ignored.
Returns:

  • RA (multiprocessing.sharedctypes.RawArray) – The shared data array in which the matrix of original columns and any pre-calculated columns will be stored.
  • MD (mathDict( )) – The mathDict( ) representation of a matrix initially consisting of the same columns as the matrix of original columns. The mathDict( ) object can then be used to mask some columns and/or append calculated columns to the matrix represented by the mathDict( ). Local columns can then be appended to copies of the mathDict( ) object in other processes.

mathDict( ) and additional supporting classes

class ParallelRegression.mathDict[source]

Bases: object

Attributes

hypothesis : See mathDictHypothesis.

mathDict.__init__(SharedDataArray, items, column_names, mask=None, dtypes=None, calculated_columns=None, cache_crossproducts=False, cache_powers=1, terms=None, padding=None)[source]

dict( )-like interface to a shared-memory, two-dimensional array of heterogenous numeric data-typed columns that builds linear constraints/ hypotheses and pre-calculates powers and cross-products of columns.

Parameters:
  • SharedDataArray (multiprocessing.sharedctypes.RawArray or comparable) – The shared-memory byte array in which the columns and pre- calculated manipulations thereof are stored. See the note regarding array length.
  • items (int) – The number of cells in the matrix of original columns, not including the intercept column or calculated columns. All columns must be supplied to mathDict( ) with the same number of rows.
  • column_names (sequence of strings) – The names of the original columns, not including the intercept column or calculated columns. There must be exactly one string in column_name for each column in the matrix of original columns.
  • mask (sequence of booleans, optional) – Columns with an associated mask boolean of True are hidden. The first boolean in the sequence is associated with the intercept column, resulting in a column of ones at index 0 if set to False. After that, the value at mask[1] corresponds to column_name[0], and so on. If the mask sequence is shorter than the sequence of column_names, than columns without an associated mask value are unmasked.
  • dtypes (string or Sequence, optional) – If a sequence, there must be one value for each column, representing the numpy data type of the cells in that column. If a single string, then all columns must consist of cells of that data type. Currently only 8-byte-per-item data types (‘i8’, ‘f8’) are suported.
  • calculated_columns (sequence of strings) – Each string represents a column that extends the matrix represented by the mathDict( ) beyond the matrix of original columns with a column calculated therefrom using operations supported by mathDict( ). Currently limited to crossproducts, powers, and lags.
  • cache_crossproducts (boolean, optional) – If True, then the crossproducts of each combination (without replacement) of columns in the matrix of original columns has been pre-calculated and appended to the shared data array after the original columns. Use mathDictMaker( ) to pre-calculate these values.
  • cache_powers (int, optional) – If set to an integer greater than 1, then the powers of each original column ranging from 2 through this value (inclusive) have been pre-calculated and appended to the shared data array after the original columns and cached crossproducts (if present). Use mathDictMaker( ) to pre-calculate these values.
  • terms (termSet( ), optional) – The termSet( ) to be used by mathDict.hypothesis.add( ) in determining whether or not adding a column string to the hypothesis should result in a RankError.
  • padding (list of ints, optional) – A list in which each entry corresponds to the column at the same index in column_names, indicating the number of rows of padding at the beginning of column. This is the list produced by mathDictMaker.padding_list. If provided, this must be the same length as column_names, with zeros for each column that has no padding.

Notes

Size of SharedDataArray: The shared-memory array identified by the shared data array parameter stores each column in the matrix of original columns, in column-major order, followed by each pre- calculated crossproduct column (if present) and pre-calculated power column (if present). With 8-byte-per-item data types, the matrix of original columns alone requires 8*[row count]*[column count].

mathDict.__getitem__(index)[source]

Returns the matrix represented by the mathDict( ) or a portion thereof.

Supported Notation

str -> returns one vertical array
A string can be the name of one column or the formula for one column that can be calculated from the original columns. No check is performed to ensure that referenced columns are unmasked.
int -> returns one vertical array
A single integer will return the corresponding column in the matrix represented by the mathDict( ) object.
int(m):int(n) slice -> returns a matrix with n - m columns.
A slice with start and or stop specified will return the corresponding column in the matrix represented by the mathDict( ). A slice with neither specified will return the entire matrix represented by the mathDict( ). This includes local columns and calculated_columns listed in the calculated_columns attribute, but does not include masked columns.

Examples

str : ‘a’
Returns column ‘a’.
str : ‘a * b’
Returns a column in which each row i is ‘a’[i] * ‘b’[i].
str : ‘L2@a’
Returns the second lag of column ‘a’. (Case sensitive.)
int : 0
If the first value in the mask sequence is False, this will return a column of ones. The data type of cells in the Intercept column will match the first column in the matrix of original columns.
slice : [:]
This will return the Intercept column of ones unless masked, unmasked columns in the matrix of original columns, unmasked local columns, and finally calculated columns.
mathDict.add(column_string)[source]

Adds column_string to the matrix represented by the mathDict( ), and returns the mathDict( ) if successful.

Raises:UnsupportedColumn(Warning) – If column_string is neither the name of a shared or local column, nor something that mathDict( ) can calculate therefrom. .args[1] == .columns is a list of the unsupported column(s).
mathDict.add_from_RHS(formula, defer_calculations=False)[source]

Adds columns to the matrix represented by the mathDict( ) based on terms in the right hand side of a string representation of a formula.

If there is one or more ‘~’ characters in the formula string, everything to the left of the first ‘~’ character will be ignored. Then, the string will be divided into `column_string`s by splitting on ‘+’ and ‘-‘ characters that are not enclosed within brackets, and .add( column_string ) will be attempted for each column string.

Note: Subtracting in lieu of adding negatives is not currently supported, but there is no error checking for this. Formula strings split on the minus sign in anticipation of a subsequent release supporting subtraction directly.

Returns:
  • self (mathDict) – If there were zero ‘~’ characters in the original string and every .add( ) attempt was successful.
  • string – If there were one or more ‘~’ characters in the original string, then the substring to the left of the first ‘~’ character, stripped of padding whitespaces, is returned if/when every .add( ) attempt is successful.
Raises:UnsupportedColumn(Warning) – If one or more column strings is unsupported. .args[1] == .columns is a list of the unsupported column(s). .LHS contains the string to the left of the first ~, if any, stripped of padding whitespaces. All supported columns are still added even when an unsupported column occurds mid-formula.
mathDict.columns()

list of strings: Lists the name of each column in the matrix represented by the mathDict( ), starting with the Intercept column unless masked, followed by columns in the matrix of original columns that are not masked, by local columns, and finally by calculated columns.

mathDict.config_to_dict(save_terms=True)[source]

Returns a dict( ) subclass object containing the configuration of this mathDict( ) matrix that has a .rebuild( SharedDataArray=REQUIRED ) method to recreate the mathDict( ) in a different process.

NOTE: Hypothesis information is not stored.

mathDict.crosspower(column_a, power_a, column_b, power_b, lag_a=None, lag_b=None, vector='column')[source]

Returns the product of two columns, each raised to the specified power. Makes use of precalculated columns if applicable.

mathDict.crossproduct(column_a, column_b, lag_a=0, lag_b=0, vector='column')[source]

Returns a column in which each row i is column_a[i] * column_b[i]. Checks to see if crossproducts have been pre-calculated and returns the pre-calculated column if present. If column_a and column_b identify the same column, the requested is transfered to the .power( ) method.

mathDict.get_column(column_name, lag=0, vector='column')[source]

Returns the column with the specified name from either the matrix of original columns or the local column store.

No check is performed to ensure that the requested column is not masked, and calculated columns are not returned.

mathDict.iter_map(arg_iterable, func, placement=0, process_count=None, use_kwargs=False, number_results=False)[source]

Comparable to .map( ) except that it returns an unsorted iterable instead of a list( ).

Parameters:
  • arg_iterable (iterable of tuples) – tuple of positional arguments to be passed into func. The final value in the tuple can be a dict( ) of keyword arguments if use_kwargs is set to True.
  • func (function) – The function to be called. It must be pickleable.
  • placement (int or string, optional) – If an integer, then the matrix will be inserted as a positional argument at this location. If a string, then the matrix will be passed in as a keyword argument using this keyword.
  • process_count (int, optional) – The number of child processes to launch. If this is not set, then Python will try to figure it out using a minimum of two processes, but Python isn’t good at figuring it out so it is always better to provide this argument.
  • use_kwargs (bool) – If True, then the final value in each tuple of arguments will be treated as a dict( ) of keyword arguments for func.
  • number_results (bool) – If True, each result will be provided in the form of a tuple in which the first value is the position of the argument tuple in arg_iterable from which the result was computed, and the second value is the result itself.
Returns:

Results in unsorted, iterable form.

Return type:

Iterable

mathDict.map(arg_iterable, func, placement=0, process_count=None, use_kwargs=False, ordered=False)[source]

Uses parallel processes and shared memory to call func with each tuple of arguments, also passing in the matrix as an argument.

Parameters:
  • arg_iterable (iterable of tuples) – tuple of positional arguments to be passed into func. The final value in the tuple can be a dict( ) of keyword arguments if use_kwargs is set to True.
  • func (function) – The function to be called. It must be pickleable.
  • placement (int or string, optional) – If an integer, then the matrix will be inserted as a positional argument at this location. If a string, then the matrix will be passed in as a keyword argument using this keyword.
  • process_count (int, optional) – The number of child processes to launch. If this is not set, then Python will try to figure it out using a minimum of two processes, but Python isn’t good at figuring it out so it is always better to provide this argument.
  • use_kwargs (bool) – If True, then the final value in each tuple of arguments will be treated as a dict( ) of keyword arguments for func.
  • ordered (bool) – If True, then the results will be listed in the order of the argument tuples. Otherwise, results may be in any order. Note: argument tuples are processed asynchronously (out-of-sequence) either way. This option sorts the results after they have been computed.
Returns:

Results in list( ) form.

Return type:

list

Example

>>> def sum_row( matrix, row ):
>>>     # Put this in a_file.py and import it if you receive pickle-
>>>     # related errors.
>>>     return( sum( matrix[row,:] ) )
>>> matrix = np.array( [i for i in range( 24 )] ).reshape( (6, 4) )
>>> RA, MD = mathDictMaker.fromMatrix( matrix, integer=True )
>>> res = MD.map( [(i,) for i in range( 6 )], sum_row, ordered=True )
>>> print( res )
[6, 22, 38, 54, 70, 86]
mathDict.mask_all(except_intercept=False, clear_calculated=True)[source]

Masks the Intercept column (by default), every column in the matrix of original columns, and every local column, leaving calculated columns unaffected (by default).

Parameters:
  • except_intercept (boolean, optional) – If True, then the Intercept will be unmasked regardless of its state prior to this method call.
  • clear_calculated (boolean, optional) – If True, then the list of calculated columns will be deleted.
mathDict.power(column_name, power, lag=0, vector='column')[source]

Returns the requested column with each cell raised to the requested power. Checks to see if the power has been pre-calculated and returns the pre-calculated column if present.

mathDict.rows()

The number of rows in the matrix of original columns.

mathDict.set_mask(column_name, mask=True)[source]

Sets the mask of the specified column to the specified value, extending the length of the mask sequence if necessary. If column_name == ‘Intercept’, then it sets the mask of the Intercept column instead.

mathDict.shape()

tuple(effective row count, column count): The effective row count will be listed even if all columns are masked, resulting in an (n, 0) tuple.

Effective row count: The number of rows in the matrix of original columns, less the maximum lag that the mathDict( ) has been configured to support.

mathDict.unmask_all()[source]

Unmasks the Intercept column and every column in the matrix of original columns.

class ParallelRegression.mathDictConfig[source]

Bases: dict

rebuild(SharedDataArray, terms=None)[source]

Recreates the mathDict( ) object whose configuration is stored in this dict( ). Requires that the shared data array is provided as a parameter.

class ParallelRegression.mathDictHypothesis(mathDict)[source]

Bases: object

Generates testable hypotheses about a mathDict( ) matrix in the form of linear constraints (for use in regression analysis).

An ‘X’ matrix representing the regressors (RHS, or ‘independent’ variables) for a linear model for testing the hypothesis is generated. The columns in the matrix will be the union of all columns in the matrix represented by the mathDict( ) object and all columns in the hypothesis, including calculated columns.

An ‘R’ matrix with the same number of columns as the ‘X’ matrix and one row for each column in the hypothesis, as well as an ‘r’ column vector/vertical array with one row/cell for each column in the hypothesis will also be generated. These can then be used for either an F or a Wald test.

add(column, hypothesis=0)[source]

Adds a column to the hypothesis, and to the resulting X matrix if not included in the matrix represented by the mathDict( ).

Raises:RankError – If an attempt to square a dummy variable is made.
make()[source]

Returns a tuple consisting of the X matrix, R matrix, and r column vector/vertical array, each in the form of a two-dimensional numpy array.

Exceptions and Warnings

class ParallelRegression.CategoryError[source]

Bases: Exception

Raised by categorizedSetDict( ) when an error results from an invalid category as opposed to an invalid key or value that would raise a KeyError or ValueError.

class ParallelRegression.UnsupportedColumn(*args, LHS=None)[source]

Bases: Warning

Raised by mathDict( ) when .add( ) or .add_from_RHS( ) is used in an attempt to add a string as a column that is not understood as a column by mathDict( ).

msg, args[0]

string

Error message.

columns, args[1]

list

Lists the column or columns that are not understood by mathDict( ).

LHS

string

The left-hand-side of a formula string provided to .add_from_RHS( ) when the formula string contained at least one tilde (‘~’) character.

class ParallelRegression.mathDictKeyError[source]

Bases: KeyError

Subclass of KeyError used in error handling to distinguish between calling mathDict( ).__getitem__( ) with an invalid key/index, resulting in mathDictKeyError, or a facially-valid key/index the handling of which causes a KeyError for some other reason.

class ParallelRegression.RankError[source]

Bases: Warning

Raised by mathDictHypothesis( ) when mathDictHypothesis( ).add( ) is able to determine that adding a hypothesis about the specified column would result in a matrix of insufficient rank for computing an F statistic evaluating the hypothesis.

To enable mathDict’s ability to anticipate matrices of less than full rank, use the .terms termSet( ) attribute of mathDict( ) to inform mathDictHypothesis( ) which terms in which forms are dummy variables. This has been shown in profiles to be substantially more efficient than mathematically determining the impact of the additional column on the rank of the matrix.

Example

>>> RA, MD = mathDictMaker( [details omitted] ).make( )
>>> MD.terms = termSet( [details omitted] )
>>> MD.hypothesis.add( [details omitted] )
>>> # Link the mathDict( ) to an existing termSet( ) using simple attribute
>>> # assignment.