In general, a Rope is a heavyweight string; whereas a string is a simple array of contiguous storage, a Rope is much more complicated and is typically a tree whose leaves are contiguous strings.
Boost.Text has three rope types: segmented_vector
, unencoded_rope
,
and rope
.
segmented_vector
is not a proper rope, in that it is intended for non-character element types.
rope
is part of the Boost.Text's text
layer, and so deals with
Unicode quite a bit. To keep things simple, the discussion below sticks to
unencoded_rope
.
Everything described there applies to segmented_vector
and rope
too, of course.
The user's view of Boost.Text's unencoded_rope
is that it is
a string that is inexpensive to insert into at any point -- even the middle
or front -- no matter how large it is. In addition, substrings within the
unencoded_rope
have copy-on-write semantics. This means that often a copy of a very large
unencoded_rope
is nearly free, and that a mutated copy of an unencoded_rope
often shares much
of its data with the unencoded_rope
from which it
was copied.
unencoded_rope
's
implementation uses a tree structure similar to a B-tree, and each subtree
is copy-on-write. The tree consists of interior nodes that contain structural
information, and leaf nodes which contain the data. unencoded_rope
only contains
a single pointer to a root node. Here is one of the simplest nonempty unencoded_ropes
you can have:
This unencoded_rope
is just a single leaf node, containing the super-useful string "text"
. In this case, the leaf was
a std::string
node, labeled here with s
.
There can also be nodes that are references to std::string
nodes, labeled as ref
. More
on that in a bit.
If we copy the unencoded_rope
above, we get something like this:
No copying was done, nor allocations performed. Copying an entire tree only
requires copying a pointer and incrementing a reference count. The string in
this std::string
leaf is only four elements, but let's suppose it was megabytes instead. Inserting
even a small sequence near the beginning of the string would be costly, since
we'd have to copy all the elements after the insertion point. And since the
std::string
leaf may be shared, we would have to copy everything before
the insertion, too! For large std::string
leaves, insertions into them result in something like this:
Note | |
---|---|
This is not what actually happens to small
|
Here, we see that we've made two new reference nodes, each of which refers
to part of the original std::string
,
and we've put the inserted string in between, forming "text
text"
, the best string.
And now we see our first interior node. Each interior node has a key and a
pointer for each child, and a fixed maximum number of children. Each key Ki
is the cumulative size of all children
in the range [0, i]
.
To make the images smaller, each interior node in these diagrams has a maximum of four children (the actual number is an implementation detail, but it's a lot higher).
Now let's take a look at a more complicated unencoded_rope
:
In this tree, we have three interior nodes: the root and its two children. We also have at least one of each type of leaf node.
Copying even this unencoded_rope
is very cheap,
requiring only copying a pointer and incriminating a reference count:
Let's say we wanted to erase part of this unencoded_rope
. We'll erase a
substring that matches a whole leaf exactly, to keep things simpler. Let's
erase the ref
node on the left:
Simple, right? In this case, we did the erasure by creating a copy of each
node on the path from the root to the erased ref
leaf, and just referring to all the other nodes that did not change. Again,
most of the string data and even interior nodes are shared among the three
ropes in the diagram. This same principle applies to insert()
, erase()
, and replace()
.
Note | |
---|---|
We don't make new nodes every time we need to do an insert, erase, or other
mutation. If the reference counts on the root-to-leaf path are all |
Besides the segmented nature of unencoded_rope
, there are some
other properties you need to know about. An unencoded_rope
as a whole is
copy-on-write; it is also therefore thread-safe for reading and writing; it
is not null-terminated.
Insertions, erasures, and indexing each happen in O(log_B(N))
time,
where B is the branching factor (maximum number of children per interior node).
As mentioned before, B is an implementation detail, and thus subject to change,
but it is unlikely to be less than 16. That means very little pointer chasing
is required to access an arbitrary element of an unencoded_rope
, because its tree
is very shallow.
In fact, with a branching factor of 16, the maximum depth of the tree is also
16, since the max_size()
of the tree is PTRDIFF_MAX
.
This means that O(log_B(N))
time has a fixed upper bound, making it
O(1)
!
As mentioned previously, mutations to an unencoded_rope
are done in-place
if the mutations do not affect nodes that are shared with other unencoded_ropes
.
If a mutation can be done in place, and the point of mutation falls within
a std::string
leaf, the std::string
object is directly mutated if: the inserted string fits within the capacity()
of the std::string
,
or
std::string
object is not too large (there is a maximum size that is an implementation
detail, but is likely to be hundreds or thousands, not tens or tens of
thousands), and
std::string
).
Important | |
---|---|
One more thing. |
unencoded_rope
contains logic like the pseudocode below to determine when to create a partial
copy of an unencoded_rope
's
tree, or when to mutate the tree in-place when performing inserts or erases
on an unencoded_rope
:
path = find_path_from_root_to_mutation_point() in_place = true for (node : path) if (node.references() == 1) in_place = false
Let's call the unencoded_rope
R
, and the thread on which
the operation is being performed T
.
Also, assume that all references to R
's
root node exist in copies of R
,
and no reference (C++ reference/pointer) to R
or any of its copies exists.
With this assumption, we know at the end of the pseudocode for
loop above, that there is still exactly one reference to each node in path
. For this not to be true, we would need
to either:
T
.
Clearly we cannot, since we're currently executing insert()
or erase()
instead.
T
. We can't do this either, since we
have no reference to R
that can be used by another thread, and we know that there are no other
copies on any other threads either — since each node's reference
count is 1
.
Were we to write references/pointers to R
into our code, a thread other than T
could create a copy of R
between
when we read the reference count of one of its nodes, and when we decided to
mutate R
in place or not.
Important | |
---|---|
Therefore, if you care about using |
Note | |
---|---|
A major use case for |