Module Importing Is Broken

Copyright © Graham Dumpleton

This article describes various problems that exist with the module importing system which is included with mod_python. Its purpose is to highlight in one spot the problems so that users are aware of them, but also to serve as a basis for discussion for moving forward and fixing any bugs or generally improving the module importing mechanism.

This document will always be a work in progress and will be updated with more problems and issues over time. Feedback on issues not identified here or corrections to anything presented here is appreciated.

Configuration Of Logging

[ISSUE 1] Because the only way to enable logging when using the import_module() function explicitly is by supplying the log argument to the function itself, there is no effective way of enabling logging of imports globally in one place.

This is the case, because although the req object is available within the context of a handler and could be queried as to the state of the PythonDebug setting, the same cannot be done at global scope in a module. Either way, it places extra work on the user to make this explicit check to determine if logging should be enabled or not.

   1 
   2 from mod_python import apache
   3 import os
   4 
   5 directory = os.path.dirname(__file__)
   6 
   7 # No way to determine from the Apache configuration at
   8 # global scope what the "log" argument should be set to.
   9 
  10 module1 = apache.import_module("module1", path=[directory])
  11 
  12 def handler(req):
  13     log = req.get_config().get("PythonDebug", 0)
  14     module2 = apache.import_module("module2", log=log, path=[directory])
  15     ...

[ISSUE 2] Even in a production environment where PythonDebug may be disabled to prevent internal details of Python related errors being returned to a client making a request, being able to get out information about when module imports are occuring is still useful. Because the logging of this information is bound to the PythonDebug directive, it cannot be separately enabled.

[ISSUE 3] During debugging, to know when a module is being imported for the first time, as opposed to it being reloaded at a later time, can be important. At present the messages which are logged do not in general distinguish between the two cases of an initial import and a subsequent reimport of the same module, nor do they indicate which Apache child process it is occurring in.

<!> Note that [ISSUE 1] has been addressed in mod_python 3.3 when the new importer is being used. Specifically, in mod_python 3.3 the import_module() function is itself able to directly access at any time the value of the PythonDebug directive from the request or main server configuration object as appropriate. The log argument is therefore redundant and would only need to be provided in special cases where the value inherited from the configuration needed to be overridden. Because of this change, the default value of the log argument has changed to be None, indicating that the value inherited from configuration should be used.

<!> Note that [ISSUE 3] has been addressed in mod_python 3.3 when the new importer is being used. Specifically, in mod_python 3.3 there are distinct messages logged indicating when a module is being imported the first time versus a subsequent reload. Further, the messages logged also indicate the Apache process ID and Python interpreter name so that it is clear for which process and interpreter the import is occurring.

Cache Access Not Exclusive

[ISSUE 4] Whether on the first import or a subsequent reload, if a multithreaded MPM is being used and multiple threads call into the import_module() function at the same time for the same module, the module can be loaded more than once even though it isn't required.

This is because of a lack of thread locks on the module cache. Multiple threads can at the same time determine that the module needs to be reloaded. Although this occurs, the locks implicit within the underlying Python module importing system prevents each thread from importing the actual module at the same time. Each thread will however in turn still separately import the module with all imports except for the first being redundant.

<!> Note that [ISSUE 4] has been addressed in mod_python 3.2. A coarse grained lock is however used instead of a multi level locking solution. Thus performance may be affected minimally where high volumes of requests are being handled. This performance issue is not relevant to mod_python.publisher in mod_python 3.2 as it uses its own distinct module importing system which implements the preferred multi level locking solution, however, the fact it uses its own module importing system introduces other problem as described later. In mod_python 3.3, any potential performance issues in this respect are addressed through the new module importer also using a multi level locking solution. In mod_python 3.3, mod_python.publisher will use the new module importer rather than its own, eliminating the new problems introduced in 3.2 due to it having its own module importer.

Restoring Of Backup Files

[ISSUE 5] When ascertaining if a module has been changed on disk since the last time it was loaded, mod_python will only consider the file as having changed if the modification time is newer than that which it was previously. This means you cannot restore an older file from a backup with a modification time earlier than the current file. You would need to physically touch the file to make the modification time newer than that of the file being replaced in order to force it to be reloaded. This issue is further described in JIRA as MODPYTHON-7.

<!> Note that [ISSUE 5] has been addressed in mod_python 3.2. Specifically, any change to the modification time of the file will cause it to be reloaded.

Packages Loaded Wrongly

[ISSUE 6] The import_module() function can be used to load either a standalone file based module or a package. It can also be used to load a sub module/package from within a package. When loading a sub module/package which wasn't implicitly loaded when the package root was imported, the import_module() function does not insert a reference to the sub module/package into the appropriate parent module within the package. This issue is further described in JIRA as MODPYTHON-12.

The problem means that if mod_python.publisher or mod_python.psp are loaded as the PythonHandler and at a later point a handler attempts to use the Python import statement on these sub modules of the mod_python package, they will not be able to be found. This is because the reference to the sub modules is not registered in the mod_python module as "publisher" and "psp" appropriately.

<!> Note that [ISSUE 6] has been addressed in mod_python 3.2.

Disabling Of AutoReload

[ISSUE 7] Similar to the problem of enabling/disabling logging of module reloads, disabling of the autoreload feature for modules cannot be turned off globally in one place. This is because although the PythonAutoReload directive can be used to disable top level handler imports, it does not control whether autoreload is enabled for explicit use of the import_module() function. In this case it is up to the user to somehow determine whether the autoreload feature should be disabled and pass an explicit argument to the import_module() function call to disable it for that specific call. As shown for logging, the request object is not accessible when code at global scope within a module is being executed and so any configuration cannot be consulted.

[ISSUE 23] This is made worse by the fact that different parts of the document tree can have the autoreload feature enabled at the same time it is disabled elsewhere. The problem that arises here is that a module may get reloaded when not expected resulting in a mismatch in versions of code being used. Being able to have the autoreload feature selectively enabled/disabled in distinct parts of the document tree is useful, however, there should probably also be a way of globally disabling the autoreload feature for that whole interpreter from that point onwards until a restart. This would override any attempt to selectively disable/enable the feature in selected parts of the document tree.

<!> Note that [ISSUE 7] has been addressed in mod_python 3.3 when the new importer is being used. Specifically, in mod_python 3.3 the import_module() function is itself able to directly access at any time the value of the PythonAutoReload directive from the request or main server configuration object as appropriate. The autoreload argument is therefore redundant and would only need to be provided in special cases where the value inherited from the configuration needed to be overridden. Because of this change, the default value of the autoreload argument has changed to be None, indicating that the value inherited from configuration should be used.

<!> Note that [ISSUE 23] has in part been addressed in mod_python 3.3, whereby the new importer provides the apache.freeze_modules() function. Once this function has been called, any automatic reloading of modules from that point on for that specific child process is disabled until that child process is shutdown. To disable automatic module reloading and ensure that it cannot be re-enabled using the PythonAutoReload directive, a module could be imported using the PythonImport directive which calls apache.freeze_modules(). Further control might in the future be able to be implemented if aspects of MODPYTHON-183 are implemented.

Children Are Not Consulted

[ISSUE 8] Where the Python code file for a module which is referenced in a Python*Handler directive is changed and the autoreload feature is enabled, that Python code file will be reloaded before the handler is executed. If at global scope within this module the import_module() function had been explicitly used to import a child module and rather than the top level module code file being modified, the code file for the child module is changed, nothing is reloaded.

The only way to currently force the reloading of the child module is to manually touch the top level module to change its modification time as well, thereby triggering a reload of the parent and the child. Note that if there is multiple levels of imports, one in practice has to manually touch all ancestor modules of the module which was changed, right back up to the root module. If the child module was imported from multiple places, all ancestors through all parents should be touched to ensure that all parents are reloaded as necessary.

<!> Note that [ISSUE 8] has been addressed in mod_python 3.3 when the new importer is being used. Specifically, in mod_python 3.3 the module importer keeps track of the relationships between modules that have been imported and it will automatically reload a module even if not changed directly, but some child module had been changed.

Redundant Module Loading

[ISSUE 9] Because the Python module search path is extended with the directory in the document tree where the Python*Handler directive is defined, it is possible for a module contained within the document tree to be imported from different places using the import statement and the import_module() function. This can cause redundant module loading to occur as well as being the basis for other problems.

When the import statement is first used to import a module and that module is later requested to be imported using the import_module() function, that later request will result in a redundant reload of the module. This is because the import_module() function places a __mtime__ attribute in the module as part of the scheme to determine when to reload a module. Where import was first used to import the module, that attribute will not exist and its value will default to 0, causing import_module() to think it is out of date and reload it.

For basic modules triggering of this redundant loading of the module would require user code to have explicitly used the two different import mechanisms from two different places. In the case of packages, it can happen through no fault of the users code. Specifically, the redundant reload will occur when the import_module() function is used to import a sub module/package of a package and the parent within the module had already imported the module using the "import" statement.

This ability to import a module through two different means is also in effect the underlying trigger for [ISSUE 6] although that issue is a distinct problem in its own right because of how packages are incorrectly handled.

[ISSUE 24] A further form of redundant module reloading will occur where import_module() is used to import multiple sub modules of a package. The underlying problem here is that when loading a sub module of a package, the importer will always load all the __init__.py module files for the directory the sub module is contained in, plus those back up through the directory hierarchy which makes up the package. This is occuring even if those __init__.py files have previously been loaded and no changes have been made to those files.

<!> Note that [ISSUE 9] has been addressed in mod_python 3.3 when the new importer is being used. This is in part achieved by the new module importer trying to keep distinct modules appearing on sys.path from those which exist within the document tree served by Apache. It is also in part solved by the import statement using the new module importer internally when the module exists within the document tree served by Apache.

<!> Note that [ISSUE 24] has been addressed in mod_python 3.3 in as much as the new module importer will ignore packages. As such, packages must reside on sys.path and packages will not be candidates for automatic module reloading. If reloading of packages specific to the web application is desired, they will need to be restructured so as to use a style of pseudo packages supported by the new module importer.

Caching Of Attributes

[ISSUE 10] The ability to mix the import statement with import_module() also becomes a problem when the from statement is used in conjunction with the import statement. A module may use from to get access to specific attributes from within another module. If that other module is also imported using import_module() it could at some time be reloaded, which may result in the attributes obtained using from becoming disconnected from the original module with them not reflecting the new values which have just been loaded.

This can also occur when just the import statement is used and the parent module explicitly copies an attribute from the child module into itself. Other variations of the problem include the parent module calculating and caching some results based on attributes obtained from the child through direct access or as returned as the result of calling some function within the child.

<!> Note that [ISSUE 10] has been addressed in mod_python 3.3. This is in part solved by the import statement using the new module importer internally when the module exists within the document tree served by Apache.

Reloading Of Packages

[ISSUE 11] When the import_module() function is used to import a sub module/package of a package, it will correctly first load any ancestors of the sub module/package. As already described however, packages do suffer the problems detailed for [ISSUE 6], [ISSUE 9] and [ISSUE 10]. These problems are further complicated because a sub component of a package may reference back through the root of the package using the import statement to get access to a module or package appearing within the sibling of a parent. This results in cycles within the dependency graph of imports within the package itself. All in all, the current implementation of automatic module reloading is in general inadequate for dealing with packages.

<!> Note that [ISSUE 11] has been addressed in mod_python 3.3 in as much as the new module importer will ignore packages. As such, packages must reside on sys.path and packages will not be candidates for automatic module reloading. If reloading of packages specific to the web application is desired, they will need to be restructured so as to use a style of pseudo packages supported by the new module importer.

Overwriting Global Data

[ISSUE 12] When a module is reloaded, it occurs into the same object space as the existing module is using. This means that any existing data is overwritten. In general this is what you want, however it can cause big problems in a multithreaded system. This is because data can be getting overwritten at the same time that a handler in the same module or another module and executing in a separate thread is trying to access it.

Loading on top of existing data can also be a problem where the data is a shared resource such as a database pool. The existing instance of the database pool could be replaced and become inaccessible without any resources it has acquired being released properly. If multiple reloads occur, this can result in resources being exhausted.

Note that the problem with shared resources strictly speaking can't be classified as a bug. Code simply should be structured so as to deal with it correctly, but most wouldn't even be aware of the need to do it. The best approach is simply not to place such shared resources within modules that can be automatically reloaded.

<!> Note that [ISSUE 12] is addressed in mod_python 3.3 in as much as the new module importer will not reload a module on top of an existing instance of the same module. This does mean however that if it is required that data held within the existing instance of the module be preserved, that a special hook function be supplied for migrating such data from the old module instance to the new.

Removal Of Existing Data

[ISSUE 13] Arising from the fact that a module is reloaded into the same object space, is the issue that if an attribute, whether it be data or a function, is removed from the code file on disk, when the module is reloaded that item is not removed from the copy of the module cached within the Python process.

In other words, reloading is always addititive and nothing is ever removed. The only way to remove something from a module is to restart Apache. This issue is further described in JIRA as MODPYTHON-116.

<!> Note that [ISSUE 13] does not exist with mod_python.publisher in mod_python 3.2. This is because the module importer used in that version creates a completely new module into which the module is loaded each time. Although the problem has been addressed for mod_python.publisher in mod_python 3.2, it has not been addessed for modules imported as a consequence of the top level Python*Handler directives or when the import_module() function is used explicitly.

<!> Note that [ISSUE 13] has been addressed fully in mod_python 3.3 when the new importer is used.

Using Same Module Name

[ISSUE 14] The import_module() function is a thin wrapper over the standard Python module importing system. This means that modules are still stored in sys.modules. As modules in sys.modules are keyed by their module name, this in turn means that there can only be one active instance of a module for a specific name.

The import_module() function tries to work around this by checking the path name of the location of a module against that being requested and if it is different will reload the correct module. This check of the path though only occurs when the path argument is actually supplied to the import_module() function. The path is only supplied in this way when mod_python.publisher makes use of the import_module() function, it is not supplied when the Python*Handler directives are used because in that circumstance a module may actually be a system module and supplying path would prevent it from being found.

Even though mod_python.publisher supplies the path argument to the import_module() function, the check of the path has bugs, with modules possibly becoming inaccessible as documented in JIRA as MODPYTHON-9.

The check by mod_python of the path name to the actual code file for a module to determine if it should be reloaded, can also cause a continual cycle of module reloading even though the modules on disk may not have changed. This will occur when successive requests alternate between URLs related to the distinct modules having the same name. This cyclic reloading is documented in JIRA as MODPYTHON-10.

That a module is reloaded into the same object space as the existing module when two modules of the same name are in different locations, can also cause namespace pollution and security issues if one location for the module was public and the other private. This cross contamination of modules is as documented in JIRA as MODPYTHON-11.

In respect of the Python*Handler directives where the path argument was never supplied to the import_module() function, the result would be that the first module loaded under the specified name would be used. Thus, any subsequent module of the same name referred to by a Python*Handler directive found in a different directory but within the same interpreter would in effect be ignored.

A caveat to this though is that such a Python*Handler directive would result in that handlers directory being inserted at the head of sys.path. If the first instance of the module loaded under that name were at some point modified, the module would be automatically reloaded, but it would load the version from the different directory. This issue as it applies to the Python*Handler directives is seperately recorded in JIRA as MODPYTHON-115.

<!> Note that [ISSUE 14] does not exist with mod_python.publisher in mod_python 3.2. This is because the module importer used in that version properly distinguishes modules of the same name located in different directories. In doing this though, the modules loaded by mod_python.publisher are not stored in sys.modules. This fact causes other issues at present because of the mix of different module importing mechanisms that can occur. Although the problem has been addressed for mod_python.publisher in mod_python 3.2, it has not been addessed for modules imported as a consequence of the top level Python*Handler directives or when the import_module() function is used explicitly.

<!> Note that [ISSUE 14] has been addressed fully in mod_python 3.3 when the new importer is used.

Multiple Use Of PythonPath

The PythonPath setting can be used to define the value of the Python sys.path variable. It is this variable which defines the list of directories that Python will search in when looking for a module to be imported.

[ISSUE 15] Although the actual reassignment of sys.path by mod_python does not in itself present a problem due to assignment in Python being thread safe by definition, the context in which the assignment occurs is not thread safe and a race condition exists. This exists as the top level mod_python dispatcher will consult the existing value of sys.path and the last value for the PythonPath setting encountered before then making a decision to modify sys.path. If multiple requests are being serviced in distinct threads within the context of the same interpreter instance, and each at the same time decide they want to modify the value of sys.path, only one might ultimately succeed in setting it to the value it wants and any modification required by the other may be lost.

In the worst case scenario, this can result in the importation of any subsequent modules within that request failing due to a required directory not being present in sys.path. It is possible that this situation may resolve itself and go away on a subsequent request, but due to how mod_python caches the last value of PythonPath in a global variable this will be dependent on what other requests arrive.

[ISSUE 16] At the least, for mod_python to resolve the problem itself would require a request to arrive in the interim which targeted the URL which was not the last to cache its raw setting for PythonPath. This only works though due to a further issue whereby alternate requests against URLs with different PythonPath settings will cause sys.path to be extended everytime if the PythonPath setting references sys.path. This results in sys.path continually growing over time due to directories being added multiple times.

These two issues are further described in JIRA as MODPYTHON-114.

<!> Note that [ISSUE 15] and [ISSUE 16] have both been addressed fully in mod_python 3.3.

The PythonImport Directive

[ISSUE 17] When the PythonImport directive is used, mod_python uses the PyImport_ImportModule() function to import the specified module. If that same module is later imported using the import_module() function, it will be reloaded a second time even though the Python module file hadn't changed and even if PythonAutoReload is set to Off. This issue is further described in JIRA as MODPYTHON-113.

This issue is similar to [ISSUE 9], except that the C API is used instead of the import statement. The PythonImport directive should directly, or possibly indirectly through a new method of the apache.CallBack object, use the import_module() function to import such modules.

[ISSUE 22] If the PythonPath directive is defined at global scope within the Apache configuration, so as to apply to modules imported using the PythonImport directive, then directories listed in the PythonPath directive will end up being added to sys.path multiple times. This issue is further described in JIRA as MODPYTHON-147.

<!> Note that [ISSUE 17] and [ISSUE 22] have both been addressed fully in mod_python 3.3.

PythonAutoReload Fixed On

Automatic module reloading enables Python code files to be reloaded from disk when they have changed. In production systems, it is recommended that this feature be turned off to avoid unexpected problems and to enable a small performance gain. To turn the feature off, the PythonAutoReload directive should be used in an appropriate Apache configuration file. The directive should specify a value of Off to disable the feature.

[ISSUE 18] Although this ability to turn off automatic module reloading worked in mod_python 2.7.11, it no longer works. Specifically, setting it to Off has no effect and it remains On. This issue is further described in JIRA as MODPYTHON-106.

<!> Note that [ISSUE 18] has been fixed in mod_python 3.2.

Transfer Of Module Data

In mod_python 3.2, mod_python.publisher has been modified to use its own module importing system in order to get around some of the previously described issues. A major difference with this module importing system is that when reloading modules it loads new instances of a module into a new object space, and not into the same object space as the existing module. This change in behaviour is needed to eliminate problems connected with [ISSUE 12] and [ISSUE 13].

[ISSUE 19] With this change of behaviour though, there is no longer a way when using mod_python.publisher in mod_python 3.2, of properly destroying data in the old module, or transfering data, from the old instance of the module to the new. This may result in similar problems to that connected to [ISSUE 12], namely that resources may not be released correctly with various possible consequences.

The only workaround at present for this issue is to ensure that such resources are not stored in modules that can be reloaded. In the long term, an ability to provide hook functions which can be called to purge data or mediate transfer of data from the old module to the new is required.

<!> Note that [ISSUE 19] is addressed in mod_python 3.3, although it does require a special hook function to be provided to enable data to be migrated from the old instance of a module to the new.

Multiple Module Instances

[ISSUE 20] Because mod_python.publisher in mod_python 3.2 uses its own module importing system which does not store modules in sys.modules, if a particular module is loaded by mod_python.publisher and then later using the import statement or the import_module() function, the latter instance will actually be a distinct copy to the first.

This issue will cause problems where common data is stored in modules also containing published functions. Other modules containing published functions which try and import the module with the common data, will not get the original module but a distinct version and thus will not see changes made in the first by the published functions. Such common data will thus have to be moved into separate modules that are only imported using a single mechanism such as the import_module() function.

<!> Note that [ISSUE 20] has been addressed in mod_python 3.3 when the new module importer is used.

Per Request Module Cache

[ISSUE 12] highlights the problems of modules being reloaded into the same object space as an existing module. Namely that in a multithreaded MPM, such reloading of data into the existing object space may occur at the same time as a request handler in a distinct thread is trying to access the data.

[ISSUE 21] Similar to this issue is that within the context of a single request handler where a module is requested to be imported multiple times using the import_module() function, if the file on disk is modified between the time of two such calls, the latter will result in the module being reloaded. This means that within the context of a single request handler, code executing at different points will be operating on what are different versions of the same module.

In mod_python.publisher in mod_python 3.2 where module imports are done into distinct modules, this will mean that the two different bits of code will actually be holding onto different copies of the same module.

To avoid this issue, a per request module cache needs to be implemented such that if there are multiple requests to load the same module with the context of a single request, that the first loaded is always used even if it were detected that the version of the disk had changed in the interim.

<!> Note that [ISSUE 21] has been addressed in mod_python 3.3 when the new module importer is used.


CategoryModPython

ModPython/Articles/ModuleImportingIsBroken (last edited 2006-10-01 00:48:59 by GrahamDumpleton)