Aquifers Feed Community Development Beyond the Garden
Healthy civic engagement is often based around an emergent collective influence, a momentum from the coalescence of many individuals’ activity. As developers of civic software this is a concept we think of often and try to embrace, especially within the context of a dense urban environment. Such environments are made up of complex systems of networked activity, multitudes of input and output. Like the internet, these systems are really networks of networks. In fact, one might view a city as a macrocosm of its own biological systems – the nested ecologies and individuals that comprise it. Viewed from such a framework, a biological system as advanced as the human body provides an excellent example of emergence in the way our senses are brought together into consciousness and perception of the world. Bringing the metaphor full circle, one might ask: What if we could bring consciousness to a city?
Now I don’t mean to suggest collective consciousness, but instead by analogy I’m alluding to a pragmatic unification of information about the activity within a community. This of course has been the aspiration of many for quite some time and it’s approached in various contexts at different scales: Everyblock/Outside.in, The Facebook Platform, The Open Stack for the Web, The Semantic web, the Noosphere, Kurzweilian Singularity, Lolcats, and so on. So how exactly does one go about collecting all the data of activity in a city and turn it into a cohesive and meaningful stream of intelligence? Get all of the data and filter it. Trivial? Not quite, but bear with me: What turns consciousness and human perception into intelligence is the careful filtering of an overwhelming abundance of sensory input and cognitive activity, finding meaning in the data. With the amount of information noise that’s been introduced in the world today, robust filtering must be a fundamental function of the social web and wired communities in general. Otherwise we will be mired in the fog of infoglut, salad sentences of schizophasia, silty streams of consciousness, mixed metaphors, overwhelming alliteration, annoying hyperbole, and the list goes on. This, I believe, is where a friendly little app called Melkjug has potential to change the world.
Melkjug is a news feed reader like Google Reader, but focused on filters. Plus, it’s open source! Melkjug lets you tune your reading experience with a wide array of filters, avoiding Google Reader’s inbox model so you don’t have to stress about those 2493 unread articles. It also lets you collaborate with the Melkjug community by using the filters and filtered content produced by others. In the near future Melkjug will also employ true collaborative filtering where the preferences of others in the community can seamlessly act as a filter for you.
Since Melkjug is built around the Atom and RSS standards it can be used with anything publishing a feed, not just the feeds that we typically think of as news feeds. Another type of feed is an activity stream, a feed of a person’s online activity, as you would see aggregated by applications like the Facebook News Feed or FriendFeed. As you can see in the sidebar, this model has already been employed on this Wordpress-powered blog by simply using a Wordpress plugin that imports a Melkjug feed which is filtering and aggregating a variety of relevant activity sources – Twitter, Delicious, MediaWiki, SVN, Trac, etc. Activity streams (examples) are currently undergoing a standardization process where the grammar of each action is clearly defined. In addition to increased interoperability across the web, a standard with articulate grammar will make it even easier to filter activity streams with high fidelity.
What’s exciting about using Melkjug with activity streams is that it functions as both a muxer and a demuxer, it both combines and separates. This property and the way that Melkjug allows one to tweak their signals and EQ to get some cool feedback effects made me realize that the technology could be even more useful if used for both pre-processing and post-processing of feeds. For example, instead of advertising just the feed Flickr provides of your photos, you could provide some custom preset feeds, then when you import those into Facebook you could further hone in, filtering further as needed. I see several advantages to distributed filters: they could simplify the user interfaces for filtering by keeping the appropriate UI in the most relevant context, they might help load balancing or caching on servers, and they might even help develop standard UI conventions for content filtering throughout the web. I don’t mean to suggest feeds should always be pre-filtered, after privacy has been taken into account raw data feeds should be available, but often it’s nice to have some well tuned presets to choose from. A distributed input/output filtering model might also help us to consider the nebulous relevance issues and the sensitive privacy issues within the same framework. Bring on the OAuth.
At this point in time, any discussion about activity streams is incomplete without referring to Facebook, as it is perhaps most responsible for creating and disseminating the concept of activity streams as we know them today. Activity streams are also an important part of Facebook’s success. The Facebook News Feed has significantly evolved in a short period of time and it’s been through several inspiration-innovation feedback loops with services like Friendfeed, Twitter, Tumblr, etc.. But the most recent Facebook “redesign” fundamentally re-engineered many of the activity stream concepts Facebook had pioneered. These concepts include publishing more ambient/implicit activity, using a relevance algorithm filter (with an equalizer fader UI for filtering similar to Melkjug) and some might argue – a well defined layout and visual hierarchy that made seeing a lot of information reasonably easy to digest. Since the default News Feed now contains every item from every contact, it can be somewhat difficult to get a grasp of what has happened in the past day – a single screen of the feed might only shows a few hours of activity whereas before the relevance algorithm typically filtered it down to about a day. The filtering options that we are now left with are somewhat subtle, fragmented throughout the interface, and overall provide few options. There are also a number of options that Facebook entirely removed with their redesign. These include the feed source equalizer UI which gave more flexible control than the current linear ordering of feed sources. I’m really not sure why this was so hidden before and why it’s now been entirely removed, it came across as a compelling feature. Also gone is the ability to edit the prominence of certain items in your feed – there used to be a summary versus detail option. Until recently Facebook had even taken out some of the implicit activity such as people tagged in photos. The filtering options that remain can be summarized as 1) creating groups of (whitelisted) contacts 2) filtering by a particular app/feed source 3) ordering these groups/sources and 4) creating a blacklist of contacts. For anyone who is subscribed to multiple mailing lists and manages a high volume of email traffic, it’s easy to understand the potential frustration of being limited by such a small set of filtering options. After all, filtering, whether by some well tuned algorithm like PageRank, personalized settings, recommendations, or editorial control, is the only way we’ve ever been able to make meaningful sense of the nearly infinite information available on the web.
After critiquing the current Facebook News Feed, I should acknowledge an important non-tech point regarding its development. Initially, the feed was perceived as being forced upon a community that really wasn’t asking for it, including many who vehemently opposed it’s introduction claiming that it was akin to a tool for stalkers. Yet quickly the complaints fizzled out and the News Feed ended up creating a dramatically broader culture of engagement and openness on the site. Society as seen through Facebook has stepped up to a new level of transparency and open dialogue. Yet even while many technologists and early adopters attempt to share absolutely everything online, it must be acknowledged that Facebook has largely been allowed to succeed because of its rigorous privacy settings. Some people have no qualms with radical transparency, but many more live within multiple communites that don’t always mix well (eg family/friends/colleagues). So again, the importance of balancing the privacy/relevant-openness spectrum speaks to the need for a unifying framework to filter both the input and output of our social data.
What’s also compelling about all the filtering metaphors (equalizers, faders, feedback loops, signal-to-noise ratios, etc) in regard to society is that they allude to the sounds of many mixed down into one and the rip-and-remix ethos of free culture. It could be argued that activity streams demonstrate an increasing openness of ideas in our culture throughout multiple mediums.
Similarly, with the advent of organizations like the Sunlight Foundation and people like President Obama, Aneesh Chopra, and Vivek Kundra in the White House the open flow of civic information will be quick to fill the airwaves, if not flood the activity streams. New fire hoses such as these again present a challenge: I’m really only interested in participating in the most relevant interactions with our civic system – citing problems and suggesting solutions for my place in this democratic society. Without fine tuning the filters on my neighborhood activity streams with a tool like Melkjug I don’t think even hyperlocal sources like Everyblock or Outside.in will be very relevant to me. In fact, without smart tools like Melkjug I’ll likely filter everything out.
Note: Neither Melkjug nor the activity streams standard have reached their 1.0 release, both are still in development. Melkjug does not currently provide any features that specifically target activity streams as that was not its original intent, but that didn’t stop us from creating a jug of our feeds from other sites for the activity stream you see on this site.
I’ve been having some discussions with people at the Chicago Open Government group, talking about data openness. One common complaint all around is about data exported as PDFs. The particular topic we were discussing was TIFs. TIFs (Tax Increment Financing) are something a city can use to try to improve a neighborhood, and fund the improvements with the increased tax revenue from rising property values in those neighborhoods. These are used in many cities, and seem generally surrounded by an air of controversy.
Chicago in particular recently passed a law to open up TIF information. How the data is opened up isn’t specified that closely, and probably will be via published PDFs, along with some shape files to define the neighborhoods. (Shape files seem relatively easy to attain, probably because they are already most easily managed electronically.)
There was some vague talk about opening up the data as XML… but what would that even mean? To be fair to the city, the TIFs are actually defined by documents, and a PDF is a relatively accurate representation.
In general this idea of “XML” is confusing. XML is just a syntax for holding structured data. But there’s no particular structure that this data should conform to. There is MathML for talking about mathematical equations. There is KML for geographical information. But there’s no TIFML, no PolicyML, no GovernmentML. Though, somewhat surprisingly to me, there is a government sponsored StrategyML and what appears to be an aborted attempt at PlanningML. I have reservations about any markup language, which I’ll discuss below, but if people want these documents in StrategyML then that would mean something, XML is not that meaningful.
What is the purpose of opening up TIF data? Maybe:
There is some budgeting data that would be an excellent candidate for a structured presentation. But a substantial portion of the information is not structured. The charter has no structure, it is a narrative document. It is also essential context to understanding anything else. You can’t say that the budget is too big or that any one item is wasteful, except in relation to the purpose of the TIF, and that charter defines the purpose. A TIF zone set up to encourage tourism should be managed much differently than a zone where they are fighting urban blight, or encouraging light industry, or pursuing transit-oriented development.
Also there is the simple question of fact. A TIF is a political entity, set up by politicians, and it is a formal agreement. All the people involved work with documents. They do not write markup. The document means what was on paper. Extracting underlying semantics is not true to the process itself. (In this I am quite influenced by the principles of Microformats.)
So, what to do? The answer I see is one of annotation, not structure. The document should be posted in as accessible a manner as can also be accurate. HTML, preferably as simple as possible, is an excellent candidate, nearly as representative as PDF but more accessible (though PDF allows you to guard against OCR errors by keeping the original scan more present). From there portions of the document should be tagged. If there is a commitment from the city, tag it as such. If there is an expected outcome, tag that. Make the document easy to reference in granular pieces, so people can discuss the details.
At some point there’s either a story worth telling with the data, or there isn’t. The story may be one of success, or one of corruption, or simply one that puts TIF financing in context with a city budget. But there’s no one answer about what you will get out of this information. You can’t dump all this data into a computer and tell it how well things are going. Structure involves a rebuilding of the data, but when we don’t know why we want to rebuild the data, when we don’t know what we want to know, I believe the more distributed notion of annotation is a better fit.
Here at TOPP I’ve been working on ways to better present and manipulate New York City bus scheduling data.
Google Transit has the MTA’s bus timing data so that they can do trip planning. But the MTA for some reason won’t let them release it. I filed a FOIL request with the MTA for that data. They didn’t send me the GTFS data, what Google uses internally to describe schedules, which I assume Google generates internally. But they did send me the complete (I hope) schedule and route data for New York City Transit (I’m still waiting for MTA Bus Company data).
The data is in a weird format — a text file with fixed-width undelimited fields. This required some work to parse. And we’ve still got a lot of unknown fields. I suspect that many of the unknown fields are for MTA internal use, but every time I think I understand everything we need to know, I discover that there’s something I missed.
The one thing I know I don’t know is how to figure out what the names are on routes that have more than one path. For example, the M14 [PDF] can go along Ave A or along Ave D, and the headsign on the real bus says M14A or M14D, depending on what path it’s on. The data just says M14 A-C-D (where C is an obsolete route, according to Wikipedia). It does have a list of headsigns (that’s the sign at the top of the bus), but they don’t specify a letter. Google can’t figure this out either — if you ask Google for a public transit route from (say) 14 St at 7th Ave to 14th St at Ave A, it will give you a bus route called M14AD.
Want to help? Here are the tools. You’ll need the data from the MTA. We would love to give you the whole thing, but we’re not sure the MTA would approve (even though it would be noninfringing). But here is a data file for the M14.
We could also just manually enter path letters for headsign ids. This would be a hassle, but if you’re bored, this would be a useful task — don’t worry about what format it’s in, just send us a spreadsheet or a patch or something. Even if you know nothing about computers, you can probably do this. Just make some sort of mapping from route id to path letter.
A few months ago, while a solution for the MTA budget shortfall was being debated by the New York State Senate, The Open Planning Project helped parse MTA budget data into a machine searchable format. The MTA originally published the budget as a PDF. To extract the data I used a utility called pdftohtml to convert it into an XML document. I then used the python library lxml to convert the document into a set of csv files. The results of this labor can be seen on TOPP’s data site.
Soon after I published this data I was told by a number of people that the data would be more useful if presented in another format. At first I just started creating a bunch of command line python scripts that would suck in these csv files and spit them out in different formats. I quickly realized that I could accumulate these scripts and create a quick and dirty web application.
Over a few train rides I created an application called DataIO, and this week I finally got a chance to upload it to Google App Engine. Specifically I received three requests for data in different formats. I’ll give examples using the data set containing the MTA’s annual labor expenses.
http://www.dataio.org/data/Wfb?format=flot&base_column=0&base_row=0
The “base_column” query string parameter represents the column in the CSV file that will used for the legend of the graph. The “base_row” represents the row in the CSV file that contains the values for the x-axis of the graph.
It’s not obvious how that JSON will display, so DataIO allows you to preview the graph by adding a “preview” query string argument:
http://www.dataio.org/data/Wfb?format=flot&base_column=0&base_row=0&preview=true
http://www.dataio.org/data/Wfb?format=gchart_line&base_column=0&base_row=0
which returns the URL for the following image:
http://www.dataio.org/data/Wfb?format=html&multiplication_factor=1000000&multiplication_start_row=1
or in millions of Euros:
http://www.dataio.org/data/Wfb?format=html&multiplication_factor=0.734&multiplication_start_row=1
The number to multiply by is sent in via the multiplication_factor argument and the multiplication_start_row tells DataIO not to multiply the first row by the factor.
A complete list of query string arguments that can be used to interact with DataIO are located on its front page. The code for this application is hosted at bitbucket.