Jon Parise's Blog, page 2
June 12, 2011
Android Text Links Using Linkify
The Android framework provides an easy way to automatically convert text
patterns into clickable links. By default, Android knows how to recognize web
URLs, email addresses, map addresses, and phone numbers, but it also includes
a flexible mechanism for recognizing and converting additional text patterns,
as well.
The Android Developers Blog has an article entitled Linkify your
Text! that provides a nice overview of the system. It discusses how the
Linkify class can be used to enable the default link patterns and
then continues with a more advanced WikiWords example that demonstrates
custom links. That article is a fine introduction to the system, so the rest
of this article will primarily focus on details not covered therein.
All of the examples in this article are based on the TextView widget. The
Linkify class can also be used to add links to Spannable text, but those
use cases won't be covered here because their usage is nearly identical to the
TextView cases.
TextView AutoLinking
The TextView widget features an android:autoLink attribute that
controls the types of text patterns that are automatically recognized and
converted to clickable links. This attribute is a convenient way to enable
one or more of the default link patterns because it can be configured directly
from a layout without involving any additional code.
However, for those cases where programmatically setting this value is useful,
the setAutoLinkMask() function exists.
There is one important caveat to using this "auto-linking" functionality,
however. It appears that when "auto-linking" is enabled, all additional
Linkify operations are ignored. It's unclear whether this behavior is
intentional or inadvertent, so it's possible things could change in future
released of the Android SDK. Consider disabling "auto-linking" before using
any of the Linkify operations discussed below.
// Disable the text view's auto-linking behavior
textView.setAutoLinkMask(0);
Default Link Patterns
Enabling support for one of Android's default link patterns is very easy.
Simply use the addLinks(TextView text, int mask) function and
specify a mask that describes the desired link types.
import android.text.util.Linkify;
// Recognize phone numbers and web URLs
Linkify.addLinks(text, Linkify.PHONE_NUMBERS | Linkify.WEB_URLS);
// Recognize all of the default link text patterns
Linkify.addLinks(text, Linkify.ALL);
// Disable all default link detection
Linkify.addLinks(text, 0);
Custom Link Patterns
Detecting additional types of link patterns is easy, too. The
addLinks(TextView text, Pattern pattern, String scheme)
function detects links based on a regular expression pattern.
import java.util.regex.Pattern;
import android.text.util.Linkify;
// Detect US postal ZIP codes and link to a lookup service
Pattern pattern = Pattern.compile("\\d{5}([\\-]\\d{4})?");
String scheme = "http://zipinfo.com/cgi-local/zipsrch....
Linkify.addLinks(text, pattern, scheme);
The text is scanned for pattern matches. Matches are converted to links that
are generated by appending the matched text to the provided URL scheme base.
Note that the scheme doesn't have to be an external web-like URL. It could
also be an Android Content URI that can be used in conjunction with a content
provider to reference application resources, for example.
Match Filters
Regular expressions are a very powerful way to match text patterns, but
sometimes a bit more flexibility is needed. The MatchFilter class
provides this capability by giving user code a chance to evaluate the link
worthiness of some matched text.
import java.util.regex.Pattern;
import android.text.util.Linkify;
import android.text.util.Linkify.MatchFilter;
// A match filter that only accepts odd numbers.
MatchFilter oddFilter = new MatchFilter() {
public final boolean acceptMatch(CharSequence s, int start, int end) {
int n = Character.digit(s.charAt(end-1), 10);
return (n & 1) == 1;
}
};
// Match all digits in the pattern but restrict links to only odd
// numbers using the filter.
Pattern pattern = Pattern.compile("[0-9]+");
Linkify.addLinks(text, pattern, "http://...", oddFilter, null);
A more complex (but useful!) example would involve matching valid dates. The
regular expression could be generous enough to match strings like "2010-02-30"
(February 30, 2010), but a match filter could provide the logic to reject
bogus calendar dates.
Transform Filters
Up until this point, the final link was always being generated based on the
exact matched text. There are many cases where that is not desirable,
however. For example, it's common to mention a username using the @username
syntax, but the resulting link should only include the username portion of
the text. The TransformFilter class provides a solution.
import java.util.regex.Pattern;
import android.text.util.Linkify;
import android.text.util.Linkify.TransformFilter;
// A transform filter that simply returns just the text captured by the
// first regular expression group.
TransformFilter mentionFilter = new TransformFilter() {
public final String transformUrl(final Matcher match, String url) {
return match.group(1);
}
};
// Match @mentions and capture just the username portion of the text.
Pattern pattern = Pattern.compile("@([A-Za-z0-9_-]+)");
String scheme = "http://twitter.com/";
Linkify.addLinks(text, pattern, scheme, null, mentionFilter);
This approach uses the regular expression's capture syntax to extract just the
username portion of the pattern as a uniquely addressable match group.
Alternatively, the transform filter could just return all of the matched text
after the first character (@), but the above approach is nice because it
keeps all of the pattern's details within the regular expression.
Of course, transform filters can be combined with match filters for ultimate
flexibility. The Android SDK uses this approach to detect wide ranges of
phone number formats (many of which include various parentheses and dashes)
while always generating a simplified link containing only digits.
Further Reading
For more information about the specific implementation details of Android's
link generation system, the best reference is actually the source code itself.
In addition to being a good resource for understanding the system, it's also
the best way to track down potential bugs or misunderstandings about how the
system is intended to be used.
Linkify.java - The Linkify class itself, including the MatchFilter
and TransformFilter implementations for the standard link types.
Regex.java - A collection of regular expressions and utility functions
used by Linkify to work with the standard link types.
Reloading Python Modules
Being able to reload code modules is one of the many nice features of
Python. This allows developers to modify parts of a Python application
while the interpreter is running. In general, all that needs to be done is
pass a module object to the imp.reload() function (or
just reload() in Python 2.x), and the module will be reloaded
from its source file.
There are a few potential complications, however.
If any other code references symbols exported by the reloaded module, they may
still be bound to the original code. For example, imagine if module A
contains the constant INTERVAL = 5, and module B imports that constant
into its namespace (from A import INTERVAL). If we change the constant to
INTERVAL = 10 and just reload module A, any values in module B that were
based on INTERVAL won't be updated to reflect its new value.
The solution to this problem is to also reload module B. But it's important
to only reload module B after module A has been reloaded. Otherwise, it
won't pick up the updated symbols.
PyUnit deals with a variation of this problem by introducing a rollback
importer. That approach "rolls back" the set of imported modules
to some previous state by overriding Python's global __import__ hook.
PyUnit's solution is effective at restoring the interpreter's state to
pre-test conditions, but it's not a general solution for live code reloading
because the unloaded modules aren't automatically reloaded.
The following describes a general module reloading solution which aims to make
the process automatic, transparent, and reliable.
Recording Module Dependencies
It is important to understand the dependencies between loaded modules so that
they can be reloaded in the correct order. The ideal solution is to build a
dependency graph as the modules are loaded. This can be accomplished by
installing a custom import hook that is called as part of the regular module
import machinery.
import builtins
_baseimport = builtins.__import__
_dependencies = dict()
_parent = None
def _import(name, globals=None, locals=None, fromlist=None, level=-1):
# Track our current parent module. This is used to find our current
# place in the dependency graph.
global _parent
parent = _parent
_parent = name
# Perform the actual import using the base import function.
m = _baseimport(name, globals, locals, fromlist, level)
# If we have a parent (i.e. this is a nested import) and this is a
# reloadable (source-based) module, we append ourself to our parent's
# dependency list.
if parent is not None and hasattr(m, '__file__'):
l = _dependencies.setdefault(parent, [])
l.append(m)
# Lastly, we always restore our global _parent pointer.
_parent = parent
return m
builtins.__import__ = _import
This code chains the built-in __import__ hook (stored in _baseimport). It
also tracks the current "parent" module, which is the module that is
performing the import operation. Top-level modules won't have a parent.
After a module has been successfully imported, it is added to its parent's
dependency list. Note that this code is only interested in file-based
modules; built-in extensions are ignored because they can't be reloaded.
This results in a complete set of per-module dependencies for all modules that
are imported after this custom import hook has been installed. These
dependencies can be easily queried at runtime:
def get_dependencies(m):
"""Get the dependency list for the given imported module."""
return _dependencies.get(m.__name__, None)
Reloading Modules
The next step is to build a dependency-aware reload() routine.
import imp
def _reload(m, visited):
"""Internal module reloading routine."""
name = m.__name__
# Start by adding this module to our set of visited modules. We use
# this set to avoid running into infinite recursion while walking the
# module dependency graph.
visited.add(m)
# Start by reloading all of our dependencies in reverse order. Note
# that we recursively call ourself to perform the nested reloads.
deps = _dependencies.get(name, None)
if deps is not None:
for dep in reversed(deps):
if dep not in visited:
_reload(dep, visited)
# Clear this module's list of dependencies. Some import statements
# may have been removed. We'll rebuild the dependency list as part
# of the reload operation below.
try:
del _dependencies[name]
except KeyError:
pass
# Because we're triggering a reload and not an import, the module
# itself won't run through our _import hook. In order for this
# module's dependencies (which will pass through the _import hook) to
# be associated with this module, we need to set our parent pointer
# beforehand.
global _parent
_parent = name
# Perform the reload operation.
imp.reload(m)
# Reset our parent pointer.
_parent = None
def reload(m):
"""Reload an existing module.
Any known dependencies of the module will also be reloaded."""
_reload(m, set())
This reload() implementation uses recursion to reload all of the requested
module's dependencies in reverse order before reloading the module itself. It
uses the visited set to avoid infinite recursion should individual modules'
dependencies cross-reference one another. It also rebuilds the modules'
dependency lists from scratch to ensure that they accurately reflect the
updated state of the modules.
Custom Reloading Behavior
The reloading module may wish to implement some custom reloading logic, as
well. For example, it may be useful to reapply some pre-reloaded state to the
reloaded module. To support this, the reloader looks for a module-level
function named __reload__(). If present, this function is called after a
successful reload with a copy of the module's previous (pre-reload)
dictionary.
Instead of simply calling imp.reload(), the code expands to:
# If the module has a __reload__(d) function, we'll call it with a
# copy of the original module's dictionary after it's been reloaded.
callback = getattr(m, '__reload__', None)
if callback is not None:
d = _deepcopy_module_dict(m)
imp.reload(m)
callback(d)
else:
imp.reload(m)
The _deepcopy_module_dict() helper routine exists to avoid deepcopy()-ing
unsupported or unnecessary data.
def _deepcopy_module_dict(m):
"""Make a deep copy of a module's dictionary."""
import copy
# We can't deepcopy() everything in the module's dictionary because
# some items, such as '__builtins__', aren't deepcopy()-able.
# To work around that, we start by making a shallow copy of the
# dictionary, giving us a way to remove keys before performing the
# deep copy.
d = vars(m).copy()
del d['__builtins__']
return copy.deepcopy(d)
Monitoring Module Changes
A nice feature of a reloading system is automatic detection of module changes.
There are many ways to monitor the file system for source file changes. The
approach implemented here uses a background thread and the stat()
system call to watch each file's last modification time. When an updated
source file is detected, its filename is added to a thread-safe queue.
import os, sys, time
import queue, threading
_win = (sys.platform == 'win32')
class ModuleMonitor(threading.Thread):
"""Monitor module source file changes"""
def __init__(self, interval=1):
threading.Thread.__init__(self)
self.daemon = True
self.mtimes = {}
self.queue = queue.Queue()
self.interval = interval
def run(self):
while True:
self._scan()
time.sleep(self.interval)
def _scan(self):
# We're only interested in file-based modules (not C extensions).
modules = [m.__file__ for m in sys.modules.values()
if '__file__' in m.__dict__]
for filename in modules:
# We're only interested in the source .py files.
if filename.endswith('.pyc') or filename.endswith('.pyo'):
filename = filename[:-1]
# stat() the file. This might fail if the module is part
# of a bundle (.egg). We simply skip those modules because
# they're not really reloadable anyway.
try:
stat = os.stat(filename)
except OSError:
continue
# Check the modification time. We need to adjust on Windows.
mtime = stat.st_mtime
if _win32:
mtime -= stat.st_ctime
# Check if we've seen this file before. We don't need to do
# anything for new files.
if filename in self.mtimes:
# If this file's mtime has changed, queue it for reload.
if mtime != self.mtimes[filename]:
self.queue.put(filename)
# Record this filename's current mtime.
self.mtimes[filename] = mtime
An alternative approach could use a native operation system file monitoring
facility, such as the Win32 Directory Change Notification system.
The Reloader object polls for source file changes and reloads modules as
necessary.
import imp
import reloader
class Reloader(object):
def __init__(self):
self.monitor = ModuleMonitor()
self.monitor.start()
def poll(self):
filenames = set()
while not self.monitor.queue.empty():
try:
filenames.add(self.monitor.queue.get_nowait())
except queue.Empty:
break
if filenames:
self._reload(filenames)
def _reload(self, filenames):
modules = [m for m in sys.modules.values()
if getattr(m, '__file__', None) in filenames]
for mod in modules:
reloader.reload(mod)
In this model, the reloader needs to be polled periodically for it to react to
changes. The simplest example would look like this:
r = Reloader()
while True:
r.poll()
time.sleep(1)
The complete source code package is available on GitHub.
Protocol Buffer Polymorphism
Protocol Buffers provide an efficient way to encode structured data
for serialization. The language's basic organizational type is a message,
which can be thought of as a C-style structure. It is named and contains
some number of fields. Messages can also be extended, but the method by which
this is accomplished differs from familiar C++ or Java-style inheritance.
Instead, message extension is implemented by reserving some
number of field indices in the base message for use by the extending messages.
message BaseType
{
// Reserve field numbers 100 to 199 for extensions.
extensions 100 to 199;
// All other field numbers are available for use here.
required string name = 1;
optional uint32 quantity = 2;
}
extend BaseType
{
// This extension can only use field numbers 100 to 199.
optional float price = 100;
}
But can protocol buffers model more flexible inheritance and polymorphism
hierarchies?
Optional Message Fields
One approach to implementing polymorphism involves the use of optional message
fields defined in either the base message or in an extension. Each "subclass"
is defined as an independent message type that can optionally be included in
the top-level message.
message Cat
{
optional bool declawed = 1;
}
message Dog
{
optional uint32 bones_buried = 1;
}
message Animal
{
required float weight = 1;
optional Dog dog = 2;
optional Cat cat = 3;
}
This composition scheme is simple but has the property of allowing multiple
nested message types to be filled out at the same time. While desirable in
some contexts, this means that the deserialization code must test for the
availability of each and every optional message type (unless an additional
type hint field is added). It can could also be considered less than "pure"
as a result.
Embedded Serialized Messages
A second approach simply "embeds" the subclass's serialized contents within a
bytes field in the parent message. An explicit type field informs the
deserialization code of the embedded message's type.
message Animal
{
enum Type
{
Cat = 1;
Dog = 2;
}
required Type type = 1;
required bytes subclass = 2;
}
This brute-force approach to the problem, which effective, is both inelegant
and inefficient. It can be useful where it is desirable to defer the
deserialization of the embedded message, however, such as in routing systems
that are only interested in decoding the outer enveloping message's fields.
Nested Extensions
The final (and most generally recommended) approach uses a combination of
nested messages and extensions to implemented polymorphic message
types. This works by having each subclass reference itself from within a
nested extension of the base type. An explicit type field is still required to
guide the deserialization process.
message Animal
{
extensions 100 to max;
enum Type
{
Cat = 1;
Dog = 2;
}
required Type type = 1;
}
message Cat
{
extend Animal
{
required Cat animal = 100; // Unique Animal extension number
}
// These fields can use the full number range.
optional bool declawed = 1;
}
message Dog
{
extend Animal
{
required Dog animal = 101; // Unique Animal extension number
}
// These fields can use the full number range.
optional uint32 bones_buried = 1;
}
It may not be immediately obvious how to work with this message structure, so
here's an example using the Python API:
from animals_pb2 import *
# Construct the polymorphic base message type.
animal = Animal()
animal.type = Animal.Cat
# Create the subclass type by referencing the appropriate extension type.
# Note that this uses the self-referential field (Cat.animal) from within
# nested message extension.
cat = animal.Extensions[Cat.animal]
cat.declawed = True
# Serialize the complete message contents to a string. It will end up
# looking roughly like this: [ type [ declawed ] ]
bytes = animal.SerializeToString()
# ---
# Unpack the serialized bytes.
animal = Animal()
animal.ParseFromString(bytes)
# Determine the appropriate extension type to use.
extension_map = { Animal.Cat: Cat.animal, Animal.Dog: Dog.animal }
extension = animal.Extensions[extension_map[animal.type]]
This approach tends to be the most efficient and robust implementation, albeit
a bit non-obvious to protocol buffer newcomers.
Conclusion
As shown above, there are multiple ways to implement polymorphic message
behavior using protocol buffers. Like most things in software, the approaches
differ in complexity and efficiency.
It's a little unfortunate that there isn't a more obvious language syntax for
expressing inheritance. For example, many programmers would likely find these
definitions familiar:
message Animal { ... }
message Cat : Animal { ... }
message Dog : Animal { ... }
There are of course open questions as to how this syntax would map to the
underlying concepts of messages and extensions, but it shouldn't be too
difficult to enforce things like extension number ranges, etc. during the
compilation process. Perhaps something along these lines will be undertaken by
a motivated developer or an ambitious Google Summer of Code student.
December 14, 2010
Protocol Buffer Polymorphism
Protocol Buffers provide an efficient way to encode structured data for
serialization. The language's basic organizational type is a message, which
can be thought of as a C-style structure. It is named and contains some
number of fields. Messages can also be extended, but the method by which this
is accomplished differs from familiar C++ or Java-style inheritance. Instead,
message extension is implemented by reserving some number of field indices
in the base message for use by the extending messages.
message BaseType
{
// Reserve field numbers 100 to 199 for extensions.
extensions 100 to 199;
// All other field numbers are available for use here.
required string name = 1;
optional uint32 quantity = 2;
}
extend BaseType
{
// This extension can only use field numbers 100 to 199.
optional float price = 100;
}
But can protocol buffers model more flexible inheritance and polymorphism
hierarchies?
Optional Message Fields
One approach to implementing polymorphism involves the use of optional message
fields defined in either the base message or in an extension. Each "subclass"
is defined as an independent message type that can optionally be included in
the top-level message.
message Cat
{
optional bool declawed = 1;
}
message Dog
{
optional uint32 bones_buried = 1;
}
message Animal
{
required float weight = 1;
optional Dog dog = 2;
optional Cat cat = 3;
}
This composition scheme is simple but has the property of allowing multiple
nested message types to be filled out at the same time. While desirable in
some contexts, this means that the deserialization code must test for the
availability of each and every optional message type (unless an additional
type hint field is added). It can could also be considered less than "pure"
as a result.
Embedded Serialized Messages
A second approach simply "embeds" the subclass's serialized contents within a
bytes field in the parent message. An explicit type field informs the
deserialization code of the embedded message's type.
message Animal
{
enum Type
{
Cat = 1;
Dog = 2;
}
required Type type = 1;
required bytes subclass = 2;
}
This brute-force approach to the problem, which effective, is both inelegant
and inefficient. It can be useful where it is desirable to defer the
deserialization of the embedded message, however, such as in routing systems
that are only interested in decoding the outer enveloping message's fields.
Nested Extensions
The final (and most generally recommended) approach uses a combination of
nested messages and extensions to implemented polymorphic message types.
This works by having each subclass reference itself from within a nested
extension of the base type. An explicit type field is still required to guide
the deserialization process.
message Animal
{
extensions 100 to max;
enum Type
{
Cat = 1;
Dog = 2;
}
required Type type = 1;
}
message Cat
{
extend Animal
{
required Cat animal = 100; // Unique Animal extension number
}
// These fields can use the full number range.
optional bool declawed = 1;
}
message Dog
{
extend Animal
{
required Dog animal = 101; // Unique Animal extension number
}
// These fields can use the full number range.
optional uint32 bones_buried = 1;
}
It may not be immediately obvious how to work with this message structure, so
here's an example using the Python API:
from animals_pb2 import *
# Construct the polymorphic base message type.
animal = Animal()
animal.type = Animal.Cat
# Create the subclass type by referencing the appropriate extension type.
# Note that this uses the self-referential field (Cat.animal) from within
# nested message extension.
cat = animal.Extensions[Cat.animal]
cat.declawed = True
# Serialize the complete message contents to a string. It will end up
# looking roughly like this: [ type [ declawed ] ]
bytes = animal.SerializeToString()
# ---
# Unpack the serialized bytes.
animal = Animal()
animal.ParseFromString(bytes)
# Determine the appropriate extension type to use.
extension_map = { Animal.Cat: Cat.animal, Animal.Dog: Dog.animal }
extension = animal.Extensions[extension_map[animal.type]]
This approach tends to be the most efficient and robust implementation, albeit
a bit non-obvious to protocol buffer newcomers.
Conclusion
As shown above, there are multiple ways to implement polymorphic message
behavior using protocol buffers. Like most things in software, the approaches
differ in complexity and efficiency.
It's a little unfortunate that there isn't a more obvious language syntax for
expressing inheritance. For example, many programmers would likely find these
definitions familiar:
message Animal { ... }
message Cat : Animal { ... }
message Dog : Animal { ... }
There are of course open questions as to how this syntax would map to the
underlying concepts of messages and extensions, but it shouldn't be too
difficult to enforce things like extension number ranges, etc. during the
compilation process. Perhaps something along these lines will be undertaken
by a motivated developer or an ambitious Google Summer of Code student.
Protocol Buffer Polymorphism
Protocol Buffers provide an efficient way to encode structured data
for serialization. The language���s basic organizational type is a message,
which can be thought of as a C-style structure. It is named and contains
some number of fields. Messages can also be extended, but the method by which
this is accomplished differs from familiar C++ or Java-style inheritance.
Instead, message extension is implemented by reserving some
number of field indices in the base message for use by the extending messages.
message BaseType
{
// Reserve field numbers 100 to 199 for extensions.
extensions 100 to 199;
// All other field numbers are available for use here.
required string name = 1;
optional uint32 quantity = 2;
}
extend BaseType
{
// This extension can only use field numbers 100 to 199.
optional float price = 100;
}
But can protocol buffers model more flexible inheritance and polymorphism
hierarchies?
Optional Message Fields
One approach to implementing polymorphism involves the use of optional message
fields defined in either the base message or in an extension. Each ���subclass���
is defined as an independent message type that can optionally be included in
the top-level message.
message Cat
{
optional bool declawed = 1;
}
message Dog
{
optional uint32 bones_buried = 1;
}
message Animal
{
required float weight = 1;
optional Dog dog = 2;
optional Cat cat = 3;
}
This composition scheme is simple but has the property of allowing multiple
nested message types to be filled out at the same time. While desirable in
some contexts, this means that the deserialization code must test for the
availability of each and every optional message type (unless an additional
type hint field is added). It can could also be considered less than ���pure���
as a result.
Embedded Serialized Messages
A second approach simply ���embeds��� the subclass���s serialized contents within a
bytes field in the parent message. An explicit type field informs the
deserialization code of the embedded message���s type.
message Animal
{
enum Type
{
Cat = 1;
Dog = 2;
}
required Type type = 1;
required bytes subclass = 2;
}
This brute-force approach to the problem, which effective, is both inelegant
and inefficient. It can be useful where it is desirable to defer the
deserialization of the embedded message, however, such as in routing systems
that are only interested in decoding the outer enveloping message���s fields.
Nested Extensions
The final (and most generally recommended) approach uses a combination of
nested messages and extensions to implemented polymorphic message
types. This works by having each subclass reference itself from within a
nested extension of the base type. An explicit type field is still required to
guide the deserialization process.
message Animal
{
extensions 100 to max;
enum Type
{
Cat = 1;
Dog = 2;
}
required Type type = 1;
}
message Cat
{
extend Animal
{
required Cat animal = 100; // Unique Animal extension number
}
// These fields can use the full number range.
optional bool declawed = 1;
}
message Dog
{
extend Animal
{
required Dog animal = 101; // Unique Animal extension number
}
// These fields can use the full number range.
optional uint32 bones_buried = 1;
}
It may not be immediately obvious how to work with this message structure, so
here���s an example using the Python API:
from animals_pb2 import *
# Construct the polymorphic base message type.
animal = Animal()
animal.type = Animal.Cat
# Create the subclass type by referencing the appropriate extension type.
# Note that this uses the self-referential field (Cat.animal) from within
# nested message extension.
cat = animal.Extensions[Cat.animal]
cat.declawed = True
# Serialize the complete message contents to a string. It will end up
# looking roughly like this: [ type [ declawed ] ]
bytes = animal.SerializeToString()
# ---
# Unpack the serialized bytes.
animal = Animal()
animal.ParseFromString(bytes)
# Determine the appropriate extension type to use.
extension_map = { Animal.Cat: Cat.animal, Animal.Dog: Dog.animal }
extension = animal.Extensions[extension_map[animal.type]]
This approach tends to be the most efficient and robust implementation, albeit
a bit non-obvious to protocol buffer newcomers.
Conclusion
As shown above, there are multiple ways to implement polymorphic message
behavior using protocol buffers. Like most things in software, the approaches
differ in complexity and efficiency.
It���s a little unfortunate that there isn���t a more obvious language syntax for
expressing inheritance. For example, many programmers would likely find these
definitions familiar:
message Animal { ... }
message Cat : Animal { ... }
message Dog : Animal { ... }
There are of course open questions as to how this syntax would map to the
underlying concepts of messages and extensions, but it shouldn���t be too
difficult to enforce things like extension number ranges, etc. during the
compilation process. Perhaps something along these lines will be undertaken by
a motivated developer or an ambitious Google Summer of Code student.
October 23, 2010
Reloading Python Modules
Being able to reload code modules is one of the many nice features of
Python. This allows developers to modify parts of a Python application
while the interpreter is running. In general, all that needs to be done is
pass a module object to the imp.reload() function (or
just reload() in Python 2.x), and the module will be reloaded
from its source file.
There are a few potential complications, however.
If any other code references symbols exported by the reloaded module, they may
still be bound to the original code. For example, imagine if module A
contains the constant INTERVAL = 5, and module B imports that constant
into its namespace (from A import INTERVAL). If we change the constant to
INTERVAL = 10 and just reload module A, any values in module B that were
based on INTERVAL won���t be updated to reflect its new value.
The solution to this problem is to also reload module B. But it���s important
to only reload module B after module A has been reloaded. Otherwise, it
won���t pick up the updated symbols.
PyUnit deals with a variation of this problem by introducing a rollback
importer. That approach ���rolls back��� the set of imported modules
to some previous state by overriding Python���s global __import__ hook.
PyUnit���s solution is effective at restoring the interpreter���s state to
pre-test conditions, but it���s not a general solution for live code reloading
because the unloaded modules aren���t automatically reloaded.
The following describes a general module reloading solution which aims to make
the process automatic, transparent, and reliable.
Recording Module Dependencies
It is important to understand the dependencies between loaded modules so that
they can be reloaded in the correct order. The ideal solution is to build a
dependency graph as the modules are loaded. This can be accomplished by
installing a custom import hook that is called as part of the regular module
import machinery.
import builtins
_baseimport = builtins.__import__
_dependencies = dict()
_parent = None
def _import(name, globals=None, locals=None, fromlist=None, level=-1):
# Track our current parent module. This is used to find our current
# place in the dependency graph.
global _parent
parent = _parent
_parent = name
# Perform the actual import using the base import function.
m = _baseimport(name, globals, locals, fromlist, level)
# If we have a parent (i.e. this is a nested import) and this is a
# reloadable (source-based) module, we append ourself to our parent's
# dependency list.
if parent is not None and hasattr(m, '__file__'):
l = _dependencies.setdefault(parent, [])
l.append(m)
# Lastly, we always restore our global _parent pointer.
_parent = parent
return m
builtins.__import__ = _import
This code chains the built-in __import__ hook (stored in _baseimport). It
also tracks the current ���parent��� module, which is the module that is
performing the import operation. Top-level modules won���t have a parent.
After a module has been successfully imported, it is added to its parent���s
dependency list. Note that this code is only interested in file-based
modules; built-in extensions are ignored because they can���t be reloaded.
This results in a complete set of per-module dependencies for all modules that
are imported after this custom import hook has been installed. These
dependencies can be easily queried at runtime:
def get_dependencies(m):
"""Get the dependency list for the given imported module."""
return _dependencies.get(m.__name__, None)
Reloading Modules
The next step is to build a dependency-aware reload() routine.
import imp
def _reload(m, visited):
"""Internal module reloading routine."""
name = m.__name__
# Start by adding this module to our set of visited modules. We use
# this set to avoid running into infinite recursion while walking the
# module dependency graph.
visited.add(m)
# Start by reloading all of our dependencies in reverse order. Note
# that we recursively call ourself to perform the nested reloads.
deps = _dependencies.get(name, None)
if deps is not None:
for dep in reversed(deps):
if dep not in visited:
_reload(dep, visited)
# Clear this module's list of dependencies. Some import statements
# may have been removed. We'll rebuild the dependency list as part
# of the reload operation below.
try:
del _dependencies[name]
except KeyError:
pass
# Because we're triggering a reload and not an import, the module
# itself won't run through our _import hook. In order for this
# module's dependencies (which will pass through the _import hook) to
# be associated with this module, we need to set our parent pointer
# beforehand.
global _parent
_parent = name
# Perform the reload operation.
imp.reload(m)
# Reset our parent pointer.
_parent = None
def reload(m):
"""Reload an existing module.
Any known dependencies of the module will also be reloaded."""
_reload(m, set())
This reload() implementation uses recursion to reload all of the requested
module���s dependencies in reverse order before reloading the module itself. It
uses the visited set to avoid infinite recursion should individual modules���
dependencies cross-reference one another. It also rebuilds the modules���
dependency lists from scratch to ensure that they accurately reflect the
updated state of the modules.
Custom Reloading Behavior
The reloading module may wish to implement some custom reloading logic, as
well. For example, it may be useful to reapply some pre-reloaded state to the
reloaded module. To support this, the reloader looks for a module-level
function named __reload__(). If present, this function is called after a
successful reload with a copy of the module���s previous (pre-reload)
dictionary.
Instead of simply calling imp.reload(), the code expands to:
# If the module has a __reload__(d) function, we'll call it with a
# copy of the original module's dictionary after it's been reloaded.
callback = getattr(m, '__reload__', None)
if callback is not None:
d = _deepcopy_module_dict(m)
imp.reload(m)
callback(d)
else:
imp.reload(m)
The _deepcopy_module_dict() helper routine exists to avoid deepcopy()-ing
unsupported or unnecessary data.
def _deepcopy_module_dict(m):
"""Make a deep copy of a module's dictionary."""
import copy
# We can't deepcopy() everything in the module's dictionary because
# some items, such as '__builtins__', aren't deepcopy()-able.
# To work around that, we start by making a shallow copy of the
# dictionary, giving us a way to remove keys before performing the
# deep copy.
d = vars(m).copy()
del d['__builtins__']
return copy.deepcopy(d)
Monitoring Module Changes
A nice feature of a reloading system is automatic detection of module changes.
There are many ways to monitor the file system for source file changes. The
approach implemented here uses a background thread and the stat()
system call to watch each file���s last modification time. When an updated
source file is detected, its filename is added to a thread-safe queue.
import os, sys, time
import queue, threading
_win = (sys.platform == 'win32')
class ModuleMonitor(threading.Thread):
"""Monitor module source file changes"""
def __init__(self, interval=1):
threading.Thread.__init__(self)
self.daemon = True
self.mtimes = {}
self.queue = queue.Queue()
self.interval = interval
def run(self):
while True:
self._scan()
time.sleep(self.interval)
def _scan(self):
# We're only interested in file-based modules (not C extensions).
modules = [m.__file__ for m in sys.modules.values()
if '__file__' in m.__dict__]
for filename in modules:
# We're only interested in the source .py files.
if filename.endswith('.pyc') or filename.endswith('.pyo'):
filename = filename[:-1]
# stat() the file. This might fail if the module is part
# of a bundle (.egg). We simply skip those modules because
# they're not really reloadable anyway.
try:
stat = os.stat(filename)
except OSError:
continue
# Check the modification time. We need to adjust on Windows.
mtime = stat.st_mtime
if _win32:
mtime -= stat.st_ctime
# Check if we've seen this file before. We don't need to do
# anything for new files.
if filename in self.mtimes:
# If this file's mtime has changed, queue it for reload.
if mtime != self.mtimes[filename]:
self.queue.put(filename)
# Record this filename's current mtime.
self.mtimes[filename] = mtime
An alternative approach could use a native operation system file monitoring
facility, such as the Win32 Directory Change Notification system.
The Reloader object polls for source file changes and reloads modules as
necessary.
import imp
import reloader
class Reloader(object):
def __init__(self):
self.monitor = ModuleMonitor()
self.monitor.start()
def poll(self):
filenames = set()
while not self.monitor.queue.empty():
try:
filenames.add(self.monitor.queue.get_nowait())
except queue.Empty:
break
if filenames:
self._reload(filenames)
def _reload(self, filenames):
modules = [m for m in sys.modules.values()
if getattr(m, '__file__', None) in filenames]
for mod in modules:
reloader.reload(mod)
In this model, the reloader needs to be polled periodically for it to react to
changes. The simplest example would look like this:
r = Reloader()
while True:
r.poll()
time.sleep(1)
The complete source code is on GitHub. The package distribution is
available as reloader on the Python Package Index.
May 15, 2010
Twisted Python and Bonjour
Bonjour (formerly Rendezvous) is Apple's service discovery protocol. It
operates over local networks via multicast DNS. Server processes announce
their availability by broadcasting service records and their associated ports.
Clients browse the network in search of specific service types, potentially
connecting to the service on the advertised port using the appropriate network
protocol for that service.
A common example of Bonjour in action is iTunes' music library sharingfeature. iTunes...
Classless in-addr.arpa. Delegation
Classless in-addr.arpa. delegation allows network administrators to provide
authoritative reverse DNS on subnets that don't fall on octet boundaries.
This is especially useful for subnets comprised of less than eight bits in the
host portion of the address (i.e. smaller than a class C).
There are two important things to remember: first, we're dealing withclassless subnets, meaning they don't align themselves neatly with IPv4'soctet boundaries (like a class A, B, C, D, or E network); and...
Virtual Ethernet Tunneling
This paper discusses the implementation of virtual Ethernet tunnels using
OpenBSD. The current release of OpenBSD at the time of writing (2001) was
version 2.9, so some of the material may be fairly dated. I haven't revisited
the details since then.
Without going too deep into the technical details, a virtual Ethernet tunneluses packet encapsulation, Ethernet bridging, and IPSec encryption totunnel a subnet from one host to another host over a public network(generally, the...
Vim Color Schemes
The Vim text editor supports highly-configurable color schemes which build
upon the editor's rich syntax highlighting system. The stock Vim distribution
includes a number of color schemes, and many more are available from the Vim
Scripts repository.
Color scheme definitions are simply normal Vim scripts that live in the
colors/ directory of the Vim runtime hierarchy (see :help runtimepath).
Color schemes are loaded using the :colorscheme command. The scheme'sname is determined by the...