One of my bestest friends from college, Megan Wilde, just had a piece published in Salon:
Congratulations Meg!!!
One of my bestest friends from college, Megan Wilde, just had a piece published in Salon:
Congratulations Meg!!!
Posted in general.
– July 1, 2009
My Mom and I recently visited the Frisco, TX City Hall, because she said they had a lot of art on display. A lot of it was the cheesy folksy stuff that one might expect in a medium-size city’s City Hall, with a smattering of interesting photography and well-executed paintings here and there.
This piece by Elizabeth Schroeder (1956-2008) caught my attention.
Posted in general.
– June 13, 2009
In September 2007, I sat down with Karl Fogel to talk about the history of Subversion, democratic open-source projects and why they are easy to run, and his thoughts on distributed version control systems (he likes them!).
After many months in limbo, the venue which had previously expressed interest in publishing the piece informed me this week that it no longer fit into their new format, so I present it to you here, almost two years later.
Many thanks to Karl for the interview.
Why don’t you give us some background on the Subversion project, what your involvement was initially and what it continues to be.
Subversion actually had one of the most clear and unambiguous creation stories behind it of any project I know. There’s a company called CollabNet who provides services to clients that include things like hosted version control.
They provide hosted project management to their clients — things like a bug tracker, discussion forums, mailing lists, version control. Well for the version control component, they were using CVS. And of course, nobody knows better than CVS’s users and indeed its own developers, what a bear it can be. And they said we really need to offer something better — let’s go start an open source project to replace CVS, which is a pretty bold thing for a startup to do.
They had a sound business reason for wanting that, which was: having Subversion be open source, would mean that it could spread in the wild, and then when a customer came to them to buy their hosted service, the customer’s engineers would already know Subversion and be familiar with it, and that would be a selling point, rather than a “we’re locking ourselves into this thing that we don’t know, we’re going to have to learn it, and then we’ll never be able to get off it.”
Why did they want to do it open source from the getgo?
So they actually said that it /has/ to be open source, and there was never — as far as I know — any serious dissent from the investors or the board or the other management. You know it was, very clear that that was the way to go.
So they looked around, and at that time I had just written a book called “Open Source Development With CVS”, and was sort of known as a CVS developer and writer on CVS. So they came to me and said, do you want to start this new project. And I laughed and said, what a coincidence: I was just sitting around with my friend Jim Blandy, we were designing a new version control system, and Jim’s got this great design but even better he has a great name for it — he wants to call it “Subversion”. But you know, nothing will ever happen, because we’re just doing it in our spare time… unless, you wanna hire us to do it? And they said sure. They tried to Jim, but he stayed at Red Hat and Red Hat donated him to the Subversion project for about nine months.
So he worked on that full time, gave us the design, and got the repository layer mostly written. And then Red Hat pretty much needed him back and he just sort of, observed and occasionally chimed in with advice on the project. But we quickly got, like, almost immediately, we had 10 to 15 very highly qualified volunteers — people who had been looking for a version control system to replace CVS, and they just sort of said, “okay, it looks like this is the one, let’s get on board”.
And I think really within a year — I could go back and check but I think we had 30 people and now we’ve got about, I’d say, 50 some global committers, of whom at any given time 20 to 25 are active. And we get lots of patch contributions and we have a lot of partial committers that maintain like, side scripts and language bindings and things like that.
That’s something that kind of amazed me when I was looking through the CollabNet visualization of projects — well first of all it was interesting that you have far and away the most commits of anyone.
Oh you mean me personally, in the project? Yeah well maybe that’s because I keep reverting my mistakes. [chuckles]
[laughs] I also noticed that there were more than 100 committers.
Oh yeah — if you count the partial committers as well. We’ve got this division in the project, pretty much the only formal division we observe, which is, there are people who have global write access — you can commit anywhere. And then there are people who, you know we know that they can maintain the Finish translations, or the Python language bindings, or something, but we haven’t seen enough patches to know whether it’s safe to have them committing, say to the repository code, where if something goes wrong, you know, that’s people’s data, we have to be careful there.
There’s no technical enforcement of this — if one of those people did commit to the repository code, we wouldn’t revert it necessarily, we would just say, you should ask first, get approval on the patch, but it looks good so leave it in. So it’s totally socially enforced. But yeah if you count all of those committers it’s like 100-some people. And their very active, the partial committers. And many of them become full committers after they spend some time getting to know the code.
Yeah. Well that’s just fascinating. I mean, who — who manages 100 people?
Oh well, we’ve evolved some systems for self-management… that is, we — one of the things we use in fact when evaluating whether to give someone commit access at all and whether to invite them to be a global committer, is, how well do they interact with the rest of the project and use the automated systems like the issue tracker and the mailing list archives and stuff — how well do they make it so that they don’t need to be managed by another committer? And then, for certain tasks that kind of require a human to do them, we have people filling these roles, like patch manager. Basically there’s somebody watching the development mailing list all the time, for people posting patches. And that person tracks if a patch gets applied, or if there’s a discussion thread that follows and eventually rejects it and says “well, that patch isn’t quite right, or that idea isn’t good so we’re not going to do that”, or, whether people basically seem to approve, but nobody gets around to committing the patch, so in that case the patch manager is supposed to take the patch, find the right archive URL, for that whole thread, and making an issue in the issue tracker saying “here’s the patch, here’s the thread, this should really get applied, there doesn’t seem to be any objection to it, but nobody’s done it. So, he keeps things from falling through the cracks.
So, we try to pick things where it’s possible to have one person manage that space. There aren’t spheres of responsibility, it’s not like one person is in charge of the repository code, another person is in charge of the working copy code, or something like that. Everybody’s equally responsible, but also there are people who are known to be experts in say, the merge tracking code. And if you are about to commit there, and you have questions, you would just go ask them, on the list of course, but you would address it to that person.
So, people manage themselves, and when there’s a need for centralized management we try to spot it and ask a volunteer to do it.
Do you think that the Subversion project is unique from other open source project in its social organization?
It’s not unique, it is unusual in the extreme degree to which it has codified its conventions. But there are other project that have also written down how they work and have sort of, guiding documents that they keep up to date as to how they operate. I’ve sort of come to think that, sure Subversion is unique, but almost every project is unique. I haven’t really seen two that work in exactly the same way. And it’s usually some combination of their different person dynamics, because of the people involved. There can be different dynamics with the corporate sponsors — not every corporate sponsor is as enlightened and hands-off as CollabNet has been — they contribute a lot of development time, but they don’t make demands on the community. They don’t say, like, “you need to have this feature in by this date, or you’re all in trouble”. They don’t operate that way at all.
CollabNet gets it.
CollabNet totally gets it. But there are other companies that don’t get it to the same degree and yet, their projects still work –
It just takes twice as many programmers. [self-righteous laughter]
Well, you know, I feel that the is more that, there are people who don’t get as involved as they otherwise would be, because they feel like they won’t get as much influence as they would deserve, even if they did put in the time. So it’s really hard to spot that kind of disadvantage, how can you tell when someone hasn’t posted a mail to your list, right? You can’t. But those projects still work, they just have a slightly different dynamic.
And then you know there are the dictatorships, like the Linux kernel. There are the total democracies — Subversion is essentially a total democracy, there is no person in charge. There are things in between where you have module owners for different parts of the code. Everybody’s unique.
Are there any features that you- when CVS was dominant, and everyone knew its weaknesses — in fact is seems almost like, the first thing you learn about CVS is how to use the basic verbs, and the second thing you learn about CVS is what you can’t do with it. [self-satisfied chuckle]
[chuckles] Yeah that’s actually, that’s the best description I’ve heard of life with CVS– you learn its limitations very early.
And it’s funny because, I think the reason that people are so sort of, sharply aware of CVS’s limitations, is because, there is this vision in software engineering of the uses of a source control system, in terms of branches and merging — people know these models, of a branch, and a merge… a release branch. But CVS didn’t support certain aspects of that– for years. It’s kind of amazing how something like Subversion didn’t come along earlier.
I am also shocked actually, yeah.
And when it did come along it took relatively hefty sponsorship.
Yeah, although in fairness, it should be pointed out that more recent version control systems have not required the same level of investment, in terms of sponsored developers, and time. Like Git got up and running in what, 4 weeks, or something like that. I’m sure it was buggy and had all sorts of weird user interface stuff going on at that point, but it didn’t take as long to get up and running– and neither did Mercurial, for example, which does have some corporate sponsorship, but it didn’t take as long.
But part of the reason that Subversion… so Subversion has a weird dynamic. I don’t know why it took so long for something to come along and replace CVS, but when something did — the first such something being Subversion — we decided to preserve as much of CVS model as we could, which meant that we had- it was good in once sense because we had a design document in the form of an existing implementation. We said “we don’t want to be worse than this, and we want to support all the major features that this thing supports”. But it also means that you can’t say you’re done, you can’t call it 1.0, until you’ve matched a certain set of features. Whereas other projects were free to go decide where their 1.0 is. Or at least decide when it’s usable, and what they can claim. We had set the bar at CVS very publicly, and so, it took us four years to get to 1.0.
Are there features that you envisioned Subversion having, that it still does not have today?
Oh yeah. The new release — this is going to sound like advertising, but it is the answer to your question — the next release coming out is 1.5, and it’s going to have merge tracking, which is something that everyone’s been wanting for a long time. It’s gonna make maintenance of release branches and experimental branches and stuff much easier. Instead of having to remember all these URLs and remember what you merged in the past, you just sort of like do `svn merge -g [branch]`, or you know, you’ll be in your working copy, so you’ll just do `svn merge -g` and the right thing will happen — um, I think I’m summarizing the feature correctly but you know, maybe there’s some more stuff there.
So, like that’s huge. and that’s something that we wish we’d had in 1.0 but, it wasn’t part of CVS we were like, do we do this or do we ship it? Let’s ship it, we can work on it later. There’s another feature, something that’s similar to CVS’s modules, the ability to do selective filtered checkout where you only get some of the subdirectories, that’s going to be in 1.5. So, yeah there are still features. And there are things we get asked for on the users list all the time, like the ability to give a human-readable label to particular revision numbers. People want that all the time.
What are you going to call it? A label?
They call it labels, I guess we would do some searching.
That’s very similar to what CVS does, as opposed to the convention, with Subversion, of the tag methodology.
Subversion’s tag and branch methodology, with a couple of exceptions, is functionally equivalent to what CVS does, it’s just more efficient, on the server side. It takes up less space — it’s very complex to explain what the tradeoffs are, but basically it’s the same thing, inverted along a different axis. But the labels thing is different because, CVS doesn’t have the concept of atomic commits or numbered- individually identified trees. So there’s really no- the thing that people are proposing be labeled in Subversion doesn’t exist in CVS. It wouldn’t be their label [tag?], because you can’t say “the revision tree rooted at revision 2614″- that isn’t there in CVS. In Subversion that tree does exists, it’s just a question of if you want to call it revision 2614 or “My Special Golden Release”.
Well in CVS’s labeling– it is called label isn’t it?
It’s called branches and tags.
It’s called tags, yeah– once you tag something, that tag can’t be modified, which is different from the Subversion-
Ah- that’s actually not true.
I mean the code in the tag– the code that the tag refers to can’t be modified.
Well, it is true in both systems that code once checked in, cannot be modified. But the tag - that is, what the tag refers to in CVS, /can/ be modified. And you can’t tell whether it’s been done or not.
[exasperated chuckle] Okay.
So the difference is that in Subversion– in both CVS and Subversion, you can modify what a tag refers to, but in Subversion you can tell that someone’s done that. And in CVS you can’t.
That’s interesting. I only used– I did not use CVS very much before I started using Subversion. So actually I’m glad this came up. Where did the convention — the branches and tags copied from trunk convention — come from in Subversion? Was that something that you envisioned as being, you know, a sort of obvious implementation of that software engineering concept, or did that kind of emerge out of something else.
That was the original insight that Jim Blandy had about — or I think he was talking with someone who’s name I can’t remember now, about CVS and its problems. And, basically the conclusion that they came to — this was in like 1998 or 99 or something was — the problem with CVS is that it’s indexed along the wrong axis, which is — it’s supposed to be taking these snapshots of things that change over time — so you’ve got a tree of files, you make a change, now you’ve got a new tree — but it’s indexed along the files, and so you’ve got lots of little changes and you have to work really hard to collect them into some coherent change at a new version of the tree. And, what tags and branches in CVS do is they attach the same label individually to a particular revision, which could be different, in each of thousands of files in a repository. And that means that finding out what a tag represents, or even indeed creating the tag, is an O(n) operation on the number of files. And it’s related to how many different revisions are in that file and how big are they.
So his essential insight for Subversion was, what’s the difference between a tag and a branch anyway? Nothing. You can’t tell a tag from a branch, until you start committing on a branch. And then you know, “this isn’t just a tag, because I can commit on it”. So why not just call them all copies? That is, what you want to do is, you’ve got a tree of files and directories, and you want to make a new copy of it, and then you want to make changes in the copy. And, if you don’t make changes and you just make the copy, that’s a tag.
Unless you need to update the documentation because you forgot to. [nervous laughter] Which I’ve done many a time.
Oh well if you didn’t branch the documentation too [laughs].
No like if I forgot to, for example, change the version number, in the documentation, in my branch, before I tagged it. So I’ll jut modify the tag, I’ll cheat.
Oh yeah yeah, I’ve done that too. Don’t look to carefully at the history of the Subversion repository. [laughs] But the thing is, if you wanted to protect tags against that kind of playing around, you could. It’s very easy to make a pre-commit hook script that watches the tags directory and doesn’t allow anything but copies of trees from trunk. It’s just that people don’t generally bother to do that because, you can always go look and see whether your tags directory has been played around in.
And that seems to be an interesting aspect of working with Subversion, and something you touched on previously– this sort of social aspect of managing code. There’s such flexibility and confidence in the ability to manage the code exactly how you want, that you know, you don’t care if you give someone commit access and they might not behave great, because you’re looking at every commit–
I think there’s kind of a deep lesson there, actually. The reason authorization protections are necessary, ever, is when you can’t trust somebody to modify something, and can’t be certain that you’ll find out about it later. If you /can/ be certain that you’ll find out about it later, then that auditing ability is enough, and it doesn’t really matter, whether you can trust a person, either for their competence or for their intentions. Because auditability and revertability mean you’re safe. So a version control system is kind of a way to implement social controls. It makes further control unnecessary.
Subversion has great… if you know what you’re looking for, it’s easy to find the information you’re looking for with the log command. But the log command is also a little clumsy. But there are tools which have come out, to kind of compensate for that, and I think those tools, are easy to make, because the API to Subversion, is very, sort of, complete. So for example, Trac. Trac is this like, everybody’s favorite sister application with Subversion, and some people, myself included, could not imagine– I always set up a Trac system, to watch all my Subversion repositories, because it’s just such a nice visualization of the timeline, viewing changesets, the history of different files–
I love Trac, I also have used it on some projects, or have participated in projects where other people were using it. If I can gloat for a second– I feel like that example, that phenomenon in general, are kind of a complete vindication of the emphasis we placed on APIs. And part of the reason we wrote it in C even though C… makes you do more work for the same results, as compared to like Python or something, was we wanted to have binding surfaces to lots of other languages. And so people came along and they they used those APIs and they exposed them to Perl, to Python, to Java… there’s somebody doing Scheme now. And you get programs like Trac which, can add features using Subversion as a substrate, that we would never have thought of. And that was exactly the goal, in making those APIs. And it totally happened, exactly as everyone dreamed it would happen.
Is there a uh, is there kind of a layer of visualization— for yourself, when you use, for example, the log command, are you ever wishing for more interactivity? Is it something where you think “yeah, Subversion provides this flexible API and I’m glad this ecosystem of tools are there” or are you ever like, “uh, you know, like, there should be a way to take a different path through this information, with Subversion itself”.
Well, I think I might have that sort of developer’s weakness of knowing my tool too well, and so I think of– when I’m thinking of a problem, solutions come in the form of weird ways to use the tool, that you kind of have to be very familiar with it to think of. So I don’t find myself craving those other ways of looking at log data, or visualization tools. I’m also not a particularly visual person. But I know that most users are, and it’s important to support them.
When I said “visualize”, I really just meant– I guess I meant filter, actually. To get to the data that you’re looking for.
Oh– yeah, I’ve even written a filter that takes a stream of log data, and a regular expression, and just returns the log entries that contain that regular expression, that kind of thing.
That wouldn’t happen to be for, seeing what you’ve merged into another branch, would it be? [chuckles]
Actually no, but, well, you’ll need that less with merge tracking [in 1.5].
Right, and that’s what I was going to say, maybe that’s another reason that you weren’t too concerned about this is because, I think that’s one of my primary frustrations, is when you have to do svn log, stop on copy, and then peer or filter through things to determine what has been committed to the branch.
You know what– now that you mention that, I realize that– so the way I run svn log normally, either I have the file of all the log data sitting around or I run it, in a shell, but my shell is run in an emacs buffer, and so I might get 20k lines of log output but then I can do incremental search and regex search, backwards in the buffer, because it’s all inside Emacs, and so actually I am doing all of my logs, inside an extension tool, I just never consciously realized it before. And I’m doing stuff like what you’re saying. So, I’m using Emacs to compensate for a lack of a feature in Subversion. I just never knew it.
So, let’s talk more about the new features in 1.5. I guess lets start with merge, which seems to be the primary feature.
So we had this kind of very vague handwavy vision for years and years that we would one day keep track of what changes had been merged from one line of development to another, like trunk to branch, branch to another branch, branch to trunk, whatever. We would do that using Subversion’s properties, which are metadata — key-value pairs — that you can attach to any versioned object. So, files can have properties, directories can have properties. And so we’d have this one called svn:mergeinfo.
Like for example if you merged changes, revisions 2000 through 2010 from trunk into the branch, then the mergeinfo on the branch would record that fact. And then if you’d asked trunk “merge me everything you don’t have, into the branch”, it would not do those revisions, because you had already done them. And then the idea would be you could ask the branch — you could write tools that would ask the branch — “has the bug fix for issue #37 been addressed?”, and your tool presumably could go to the issue tracker, or ask some other database, “which commits in the repository addressed issue #37?”. Okay that’s revisions 2000, 2003, and 2009, and then it would go ask the branch “do you have revisions 2000, 2003, and 2009 merged into you from trunk?” and the branch would say yes or no. And then you would just be able to answer with the press of a button “yes, that bug is fixed on these following release branches:…”.
Not all of that chain of steps is in Subversions, but the asking a branch which revisions have merged into it, that is now in Subversion 1.5. And so the rest of that is for tools like Trac to implement, and they now have the APIs to do that. So I expect to see it very quickly.
So I imagine that this merge tracker will be able to skip over, sort of, spots of things that have been merged into something, and then merge in the rest. That’s the idea.
Right. So one thing it does, and it’s very important that it be able to do this, is you can cherry pick merges from various lines and bring them into some line, and then you can just say “okay, merge everything you don’t have”, and Subversion will figure out — I’m gonna get killed if I’m summarizing this wrong, because I’m not actually working on this part of the release, but I /think/ this is correct — Subversion will go find all the stuff that hasn’t made the discontinuities where necessary in the merge ranges, to compensate for stuff that’s already been merged, and successively merge the remaining stuff.
That’s great.
Yeah. I mean it’s exactly what you want. We actually had a really thorough requirements solicitation process around merge tracking. CollabNet sponsored several meetings with a bunch of customers of CollabNet’s who use Subversion at a enterprise scale. And they described in great detail what they needed, they made use cases available to us, and then we took the results of that to the public lists, and said alright folks what do you think of these, and add some more cases. And spent a long time polishing it up into a functional specification. And that functional spec has guided all of the Subversion developers, in developing this.
So, I think it’s, that to me is a classic example of how a free software project and a for-profit corporation that’s sponsoring it can really take advantage of the fact that the corporation has customers, and access to information that the project might not otherwise have. So, I don’t think for example that, say the Git project — which is doing great things, and I’d love to get a chance to try it out myself, but they have one highly unusual customer, in a sense, which is the Linux kernel community, and as far as I know there isn’t a lot of input from corporations that have thousands of developers, not all of whom are extremely familiar with their tools, telling it here’s what we need from say merge tracking, or something else. So it’s a much more ad-hoc requirement solicitation process. Subversion tried to have a very thorough, formal, well-described process, and I think the merge tracking we’re getting is going to be a result of that.
Something that, well, my primary project is downstream from another open source project. And I actually use SVN’s merge tools for that.
Like the svn load-dirs script? Or the svnmerge.py?
Nope, I just use svn merge.
Wow, okay…
REDACTED - tedious 10-minute tech support session.
[chuckles] The Subversion project discussion turns into the Subversion tech support discussion — yeah, it always happens.
[chuckles] My more general question was, are there any features that you’d like to see, with, sort of smart interactions between different repositories. And I guess you kind of answered that in saying that the metadata is there, and it’s awaiting use.
What i’d really like to see, and what some projects are starting to do is, version control systems being able to talk to the repositories of other version control systems. And I think like, there’s a Mercurial thing that can talk to Subversion repositories now, and, Git might have that, and, in some sense we’re not doing our part, because I don’t think we can extract easily from Git or Mercurial repositories. But there might be some sort of in-between thing that can sit there and translate.
[smart alecy] Oh you’re saying your API is better. Yeah, I hear you.
[not getting my awesome joke] Well, you know our APIs our great, but they’re Subversion specific.
But it would be nice if… if the semantics of the different systems line up well enough, it would be nice if which one you use becomes a matter of personal taste about clients, instead of this choice that has to be made for the whole project. But we’re not at that level of abstraction yet.
Yeah– although– I think all of these SCMs kind of, endeavor to be semantics only, right? To a certain extent. It’s kind of like — that is the fundamental thing that differs between the SCMs, is the verbs — the actions. So like, I don’t think that, that you can be seen as kind of a syntactic layer.
Well, it’s true that those are the differences. But even with that they’re still very close to each other. Here’s a perfect example: Subversion is considering taking the Mercurial revlog format and using it as our repository format, in a future release of Subversion. So, that wouldn’t be possible of the semantics weren’t basically the same.
Very true, yeah. What’s your take on distributed version control systems… where they have a place, where they don’t have a place.
Uhh, complicated. I really like them… well, i should say, I love the idea, I think it’s, it’s the way I would want to work. Although for historical reasons I’m using Subversion most places and will continue to I think. But if you’d shown me a distributed system like 10 years ago I probably would have started using it right away. But, I don’t think they’re the answer to everything, even though it is formally true that they are a superset of centralized systems. There’s nothing a centralized system like Subversion can do that a distributed system couldn’t do. But the difference is whether you’re working with the grain or against the grain of the system, in a sense.
Well so, Ben Collins Sussman, one of the other founding developers of Subversion, has a great quote that he asks about the distributed systems, which he also is interested in and you know, thinks that there’s some good stuff there. He also is in a position where he’s been talking to a lot of enterprise users of version control over the years, and when it comes to one of these new distributed systems he always says “Looks great. How’s the corporate rollout going?”. Meaning, that’s great but, if you try to get 10,000 developers of some huge company worldwide to wrap their heads around this idea of like, everybody’s their own repository, and you push and pull changes randomly, and you can all decide to publish over to this central repository for a while and then you can merge that into another one, it’s not gonna fly. It’s too complicated.
And just because we as revision control experts who spend our lives on this stuff can find it interesting and very usable, doesn’t mean that users are gonna find it that way. And it’s funny– the emergence of these systems has made use realize in Subversion like, what are our real strengths? Simplicity. Comprehensibility for users who may not even be programmers, let alone revision control experts. And centralization, paradoxically.
So for example, What are the most important features in Subversion — or the ones that have gotten the best feedback from the user community? Well, locking. That is, the ability to tell the repository “this following file is locked”, and then when somebody else tries to edit it, they get a little bounce-back notice saying “so and so has a lock, you better talk to them first.” It’s the total opposite of a distributed system. You know, it’s the ultimate centralized feature. But it’s one that everybody wanted.
Another example would be — that I hope we’re gonna get it, not in 1.5 but in 1.6, is log message templates. You know, formatted things that you can use to guide your committers as to how to write their log messages for each commit. That again requires everybody to be consulting a central repository regularly, to get updated templates. So, I think we’re starting to see that the centralization — it is in some ways a weakness, but is is also Subversion’s strength. It’s the thing that is most attractive and usable for corporate version control users, and this s kind of an edgy thing to say, but I think open source software that is useful for corporate developers has better survivability characteristics than software which doesn’t attract them so much. Because we get funding. Subversion is never going to have to worry about resources. And CollabNet is not the only company that is spending money on Subversion development now.
So, I kinda feel like, having features that put your long term survivability on a firmer footing — they’re sort of — they’re innately valuable, somehow. In an evolutionary sense.
Okay, I think that’s all I’ve got. Thanks Karl.
Thank you.
Posted in interviews.
– June 11, 2009