Bug 883 - File/dir renames consume extra space in repository (edit)
Assigned To:
Bugzilla (edit) (take)

Depends on: (edit)
Blocks: (edit)
  Show dependency treegraph
Reported: 2007-12-19 20:11 UTC by Jesse Glick
Modified: 2013-02-12 02:45 UTC (History)
48 users (show)



Note You need to log in before you can comment on or make changes to this bug.
Description Jesse Glick 2007-12-19 20:11:32 UTC
When files or dirs are renamed in Hg, repository size is increased, I guess by
about the compressed size of those files:

$ hg init
$ cp /boot/vmlinuz-2.6.22-14-generic f
$ hg add f
$ hg ci -m 1
$ du --si
1.8M	./.hg/store/data
1.8M	./.hg/store
1.8M	./.hg
3.6M	.
$ hg ren f g
$ hg ci -m 2
$ du --si
3.5M	./.hg/store/data
3.5M	./.hg/store
3.5M	./.hg
5.3M	.
$ ls -Rl .hg/store/data
total 3328
-rw-r--r-- 1 jglick jglick 1692145 2007-11-09 05:42 f.d
-rw-r--r-- 1 jglick jglick      64 2007-11-09 05:42 f.i
-rw-r--r-- 1 jglick jglick 1692204 2007-11-09 05:42 g.d
-rw-r--r-- 1 jglick jglick      64 2007-11-09 05:42 g.i

For a repository which is already hundreds of megabytes, doing major source
reorganizations is out of the question for this reason. This is a serious
drawback compared to Subversion; or even arguably to CVS, where moving a dir
means you only pay a penalty in history, not future usage.

mpm has written regarding implementation:

"Currently fixing the renaming issue would present a layering
violation. That is, individual revlogs have no knowledge of any other
revlog. So when we ask a revlog to retrieve version <x> of some file,
it has to have all the data internally."
Comment 1 Thomas Arendsen Hein 2007-12-20 15:15:42 UTC
Generally having support for referencing other revlogs could allow for other
usages, too, e.g. splitting revlogs if they grow to big, either to circumvent fs
or backup limitations, or to prevent new changes breaking hard links for already
huge revlogs.
Comment 2 Vadim Lebedev 2008-01-31 15:38:31 UTC
In response to mps's:
"Currently fixing the renaming issue would present a layering
violation. That is, individual revlogs have no knowledge of any other
revlog. So when we ask a revlog to retrieve version <x> of some file,
it has to have all the data internally."

Actually we can store in revlog a reference to generic external object,
identified by some kind of "url" and (maybe) hash
Initially it can be used to implement renames and copies but it can evolve
into some kind of super svn:external facility later (like hg repo which
retrieves file directrly form extrenal svn or git)
Comment 3 Matt Mackall 2008-03-09 20:48:17 UTC
Ok, here's a proposed fix and the problems that subsequently crawl out from
under the rock:

In filelog, override revlog.revision. Add metadata that says "the revision
returned by revlog is not a full revision as promised but a revision of file
x@rev + the body here treated as a delta." Then filelog.revision can instantiate
a temporary filelog object for x, get the specified revision, and apply the
delta. Do the appropriate steps in filelog.add to make this work.

Now with a little luck, getting the -next- revision from the filelog will just
work. Otherwise, we'll need to hack revlog.revision to call itself (and thereby
filelog.revision) to grab the base revision.

So now we've got a scheme that mostly does away with the layering violations as
revlog doesn't have to have any special knowledge about other revlogs (it's all
in the filelog class, which already knows how to find and open revlog from a
pathname). It even gets the case where c@z is a copy of b@y which is a copy of
a@x right automatically. 

But we've also got a huge compatibility problem. An old client can't just pull
this data and expect it to work. Instead, we've got to add a new version of the
wire protocol that allows us to send these sorts of deltas to new clients, but
sends full revisions to old clients. And a new client would like to take old
client data and deltify the copies, which may not be possible at pull time (for
instance, if the destination revlog is sent before the source revlog). Also,
hashes at the revlog layer and at the filelog layer no longer agree. Ouch.

In short: not an easy problem.

Marking deferred.
Comment 4 Jred 2008-03-27 19:21:03 UTC
@mpm: I would argue that the two problems -- revlog index cross-references, and
the wire protocol -- could be viewed as 2 completely separate problems. 

One of the main problems right now in Mercurial seems to be a lack of viable
cross-path-rev referencing method, in the revlog index scheme. If the index
scheme was allowed to reference URI's from other paths (internal or external),
instead of just revlog data with a matching name, that would be a simple fix for
a whole list of issues. 

This reminds me of the discussion in the mailing list about combining
HistoryTrimming, PartialCloning, Overlays, and Obliterate methods. An in-place
replacement of revlog data with its hash value, and a "reason for missing data"
that includes a URI for a third-party data source, could be a combined fix for
all of these features/issues. That "third-party data source URI" could just as
easily reference paths and revs inside the same repository, as external
repository URI's.

Now, separating the wire protocol, so that older clients get what they expect,
rather than what data is actually held locally by the revlog, is not necessarily
easy. It is possible, provided all the requested data is online *somewhere*.
Attempts to push-pull revlog data that isn't available online could be a defined
failure condition, for the "old client" wire protocol. So I would say that
internal repository reference URI's are probably the easiest, to interpret into
this "old client" wire protocol. 

Does Mercurial already have any way of signaling current repository version,
and/or available extensions, on each end of a push-pull connection? That would
be an easy way of signaling which wire protocol can be used optimally, in any
given transfer. If it doesn't already exist, maybe a push/pull flag or attribute
could be added, like a "wire protocol version specifier"?
Comment 5 Matt Mackall 2008-03-27 22:13:05 UTC
Incompatibility with old clients is a non-starter, so viewing it as two problems
is as well.

Current clients have file revision hashes that include the current metadata for
the copy info. If we change what we store, we break the hash -> old clients
break. So we've either got to fake the contents (and destroy the concept of
revlog id = hash of contents) or break compatibility.
Comment 6 Eric Hopper 2008-04-01 14:18:12 UTC
My feeling is that it's possible to make this happen without changing the
essential meaning of either the index or data files.

One rather unsubtle and probably bad idea would be do allow index files to
reference other data files via a combination of numerical linkrev (referencing a
changeset) and filerev hash (referencing a manifest entry in that changeset). 
If the filerev hash were null then the information would be ignored.  If not
they would be taken as a 'base' on which to build the current file image, along
with the delta range stuff from the main data file that's already there.

Keeping the wire protocol unaffected after doing so will be tricky but I
definitely think it's doable.  If the wire protocol is unchanged though,
divining the need for the new way of storing references to other data files for
incoming changesets is going to be a pain.  Incoming changes will have to be
scanned for copies.
Comment 7 Matt Mackall 2008-04-01 17:59:54 UTC
You're missing the first conceptual hurdle: if we change what we're storing in
the revlog, we change the hashes. Revlog is a self-contained black box. You hand
it "data", it hands you back an identifier hash. If we change our data from
"copy + full revision" to "copy + delta", revlog will hand us back a different
identifier. Thus, old and new clients will disagree about the hash for "file x
containing X, copied from y@z".

To get past this, we would need to hoist both the hash calculation and checking
up out of revlog into filelog (and changelog, and manifest). Then when we
checked in a copy, we'd have to first calculate the hash for "copy + full
revision", then calculate the delta, then tell revlog "please store 'copy +
delta' but with the hash for 'copy + full revision'". 

To recover a revision, we'd have to get "copy + delta", look up the copy,
reconstruct that revision, apply the delta to get the full revision, then
calculate the hash of "copy + full revision" and compare it with the identifier
we were asked to retrieve.

On pull over the existing wire protocol, we'd have to do the above, and then
take our reconstructed "copy + full revision" and turn it into a delta (usually,
but not always, against an empty file).
Comment 8 Eric Hopper 2008-04-02 12:22:46 UTC
I understand that.  Perhaps instead of moving that much work up the revlog could
be given an external data handler when you asked it for data.  And for the write
side you could give it an optional argument with the data for revlog to use as
the base for whatever diffing algorithm it might choose to use.

The contract would be that the external data handler you passed on read would be
able to retrieve that base for any revision in which you passed on such a base
on write.
Comment 9 Eric Hopper 2008-04-02 12:26:59 UTC
Oh, better idea for write...

Pass in an optional external data handler on write.  If there is one it should
be able to provide the data for the base of the revision for diff purposes, and
it should be able to provide a cookie that will be given to the external data
handler for read.

That way the external data handler doesn't have to remember any associations
between the revision and the data.  It will be able to the revlog to hand it the
cookie which will allow it to fetch those.
Comment 10 Kirill Smelkov 2008-06-29 09:47:14 UTC
Guys, I understand there are technical challenges in this issue, but maybe

  Something Could Be Done?

I think this issue should be one in the major list -- people usually convert 
their svn repos with hg and git and compare sizes to see which DVCS to use.

And you know, because of this issue hg often looses.
Comment 11 Dirkjan Ochtman 2009-02-27 08:44:08 UTC
So here's an idea: discard the idea of redoing historical renames. People who
want to do efficient renames for their history will have to do a full hg-to-hg
conversion and work from there. Only future renames are supported. Would that be
acceptable? It would open up a whole host of options.
Comment 12 Matt Mackall 2009-02-27 11:56:21 UTC
djc: It would be the first format change that older versions would be unable to
pull from.

That means a MAJOR flag day. Keep in mind that we're regularly hearing from
people running 0.9.5 and there are operating systems that have just been
released containing 1.0.1. We really don't ever want to break old versions.

And that still leaves us with the large conceptual hurdle: cross-revlog hash
Comment 13 Matt Mackall 2009-02-27 13:08:24 UTC
I've summarized the plan I've outlined here at:

Comment 14 Jesse Glick 2009-02-27 14:35:21 UTC
Agreed with mpm. In my case, we have a multi-100Mb repo with >100k revs in
active use across at least a dozen public clones, by dozens of developers on
several continents using an unknown mixture of Hg client versions, with rev
hashes referred to in numerous public documents and issue reports. We would be
unlikely to ever undertake a repo conversion unless moving to another SCM or (in
the absence of shallow/narrow clones) splitting the repo into smaller pieces.

Ideally, upon release of an Hg version supporting cheap renames, we would
convert the server clones over a few maintenance hours during the weekend
sometime, using whatever tool was recommended; and then recommend to developers
that at their leisure they get the new version and either make fresh clones or
convert their local clones. Wire compatibility is not absolutely essential if
hashes are preserved - we could wait to change format for, say, a year after the
version of Hg with cheap rename support was released, so that everyone gets a
chance to upgrade - but certainly desirable.
Comment 15 Arne Babenhauserheide 2009-03-02 07:18:34 UTC
When renames can use pointers, could similar pointers also be used for shallow
copies, so past revisions can be loaded lazily? 

Actually a shallow copy only needs the data since the last snapshot, so
requesting earlier revisions could trigger a similar second request for the
data as cheap renames, but in that case for downloading it and then reckecking
the missing changeset.
Comment 16 Dirkjan Ochtman 2009-03-02 07:38:29 UTC
ArneBab, no, that doesn't make sense.

And, it's entirely off-topic for this issue.
Comment 17 Vsevolod Solovyov 2009-06-21 20:14:15 UTC
I'm working on it as it's my GSoC project.
Comment 18 Matt Mackall 2009-07-01 19:37:13 UTC
Degrading to bug.
Comment 19 Martin Geisler 2009-07-11 21:26:10 UTC
Progress report:


(for those who are not following the mailing list)
Comment 20 Matt Mackall 2010-01-01 18:24:26 UTC
No longer in progress
Comment 21 Jose Miguel Hernandez Miramontes 2010-01-09 06:21:25 UTC

I apologize if this was discussed before.

The problem of extra space is because  a new file is created on the
repo to track the future history of the renamed file. It repeats the
data to conserve history, More or less if i understood correctly.

I read here http://mercurial.selenic.com/wiki/RenameSpaceSavingPlan
that there is a lot of complications to handle revlogs as deltas (they
become not self contained)

is there a way to workaround this with another approach? for example

Instead of creating deltas, the space could be saved by
compressing/packing the related revlogs and keep them compressed
It could be a new operation of the filelog (the filelog tracks the
renames?) to decompress/compress the revlogs.

As all the history is in one revlog, may be only one of them would
need to be uncompressed at a specific time.

I don't know if this makes sense but i was thinking that it might be
easier to implement it keeping backward compatibility as the revlogs
content will not change, so the hash do not need to change. It is just
one extra layer.

What do you think?

Comment 22 Eugene 2010-07-10 11:01:00 UTC
My 2c: Version compatibility would not be possible anyway when repos are on
a shared folder or when they're copied as files. Both of these methods are
appealing due to ease/simplicity.

Also the impact of preserving compatibility to Mercurial's relatively clean
code/design should be considered.. IMHO it's quite valuable. Plus, I'm sure
some admins would prefer to force an upgrade than have increased io/cpu
usage on their servers.

If there are optional default-off optimisations as part of, perhaps,
"Mercurial 2" people can choose if and when to force an upgrade.

I myself need this for a proposed project which would store mp3 podcasts in
hg to version-control their ID3 tags and filenames. Mercurial is great
except for the renames part.
Comment 23 gidyn 2012-05-01 06:00:25 UTC
The usual response when this is brought up on a mailing list is that a
compressed text file doesn't take up much space. This obviously doesn't
apply to binary resources, but doesn't always apply to text files either. If
someone does a lot of refactoring, it is indeed possible that rename/move
copying will take up more space in the repository than actual changesets.
Comment 24 Bugzilla 2012-05-12 12:46:14 UTC

--- Bug imported by bugzilla@serpentine.com 2012-05-12 08:46 EDT  ---

This bug was previously known as _bug_ 883 at http://mercurial.selenic.com/bts/issue883
Comment 25 Martin F 2013-01-29 09:44:37 UTC
Are there any plans in fixing this issue?
Comment 26 Matt Mackall 2013-01-29 10:01:58 UTC
There are plans:


But plans do not translate to development resources or timetables.