Design¶
Given any two tree-like data structures, let’s call them oldtree
and newtree
,
that consist of dict-like nodes with node_id
-like and content_id
-like attributes.
Tree diff format¶
A tree diff is a dictionary that describes four types of edits:
nodes_deleted
: list of nodes that were present (bynode_id
) inoldtree
and absent in thenewtree
(bynode_id
)nodes_added
: list of nodes added to the tree (includes both nodes ADD and COPY actions)nodes_moved
: nodes in thenewtree
that have the samecontent_id
as a node in theoldtree
by a newnode_id
has changed. If multiple nodes satisfy this criterion, the only the first node is retuned (in tree-order).nodes_modified
: to be included in this list, a node in thenewtree
must have the samecontent_id
as a node in theoldtree
and modified attributes.
Note a node may appear in more than one of the above lists. For example a node that was moved from location X in the tree to location Y in the tree, and whose title was edited will appear in all four lists:
- in the
nodes_added
list because it now appears in location Y - in the
nodes_deleted
list because it was removed from location X - in the
nodes_moved
list since we can detect the node has the samecontent_id
- in the
nodes_modified
list because of the title edit
For more info, see the detailed diff examples in Tree diffs.
Technical notes¶
Invariant: the information in a tree diff treediff(oldtree, newtree)
, when
applied as a “patch” to the oldtree
should produce the newtree
.
Data format: each of the “diff items” in the four lists includes additional
structural annotations like parent_id
and all node attributes like title,
description, files, assessment items, tags, etc., even if they haven’t changed
to allow for easy display of diffs and post-processing tasks.
Diff limitations¶
- Will not recognize assessment items that are moved between exercises
- Nodes from the
oldtree
can be duplicated in several places in thenewtree
. The first of these uses will be recognized as a move, while all others will be recognized as if the nodes were added. A more appropriate way to do this would be to recognize them asnodes_copied
(a special type of added, where a node with the same content_id in theoldtree
exists). - The logic modified and moved logic assumes
content_id
is available for all nodes (topic nodes and content nodes). Need to watch out to make sure this assumption is valid in all cases where we want to use this (ricecooker, studio and Koilbri, and if necessary relax this in future).
External API (High Level)¶
The API for this library is to call the treediff
function which has signature:
treediff(oldtree, newtree, preset=None, format="simplified", # HL
attrs=None, exclude_attrs=[], mapA={}, mapB={}, # LL API
assessment_items_key='assessment_items', setlike_attrs=['tags']) # LL API
The line tagged with HL
is the “high level” API for the library, where users
just need to specify the two trees they need diffed (e.g. ricecooker
, studio
,
or kolibri
), which will set the appropriate values for the low-level API kwargs.
The raw
format returns the diff in the “internal representation” described above
that includes duplication (primarily used for debugging), the simplified
format
(default) removed moved nodes from the added and deleted lists but keeps all results
as flat lists. The restructured
format tries to return the diff as “chunks” of
tree (i.e. indicate an addition of topic + 3 children as one addition, instead
of four separate additions).
Internal API (Low Level)¶
The functions the treediff
module are named diff_
and accept the following
standard set of keyword arguments, which we’ll call the “low level” API:
LL = dict(
attrs=None, exclude_attrs=[], mapA={}, mapB={}, # LL API
assessment_items_key='assessment_items', setlike_attrs=['tags']) # LL API
)
attrs
(list(str)): if specified, only these attributes will be check for differences the default value is set toNone
which means all values will be checked.exclude_attrs
(list(str)): do not check these attributes, e.g. for cases where we expect them to be different and we want to avoid false positives.mapA
(dict): map from the common diff attributes to attributes nodes intreeA
.mapB
(dict): map from the common diff attributes to attributes nodes intreeB
.assessment_items_key
(str): specify where to look for assessment items, in Studio and Kolibri leave the default value. In ricecooker, set toquestions
.setlike_attrs
(list(str)): treat these attributes as sets (i.e. order doesn’t matter)
Note: in general users should not need to set the low level API params and can
instead rely on one of the presets, which will set the appropriate keyword args,
e.g. using preset="studio"
is equivalent to call with **studio_preset_kwargs
.
Attribute maps¶
The arguments mapA
and mapB
are used to make the same diff logic work for
trees that have different attributes names. The internal diff logic always works
will look for the following standard attributes (derived from Studio data model):
- Nodes:
node_id
: an unique identifier that represents the node’s position within the treeparent_id
: the node ID of the parent nodecontent_id
: a persistent identifier for the content of the node (allows us to detect moves)sort_order
: position within parent
- Assessment items:
assessment_id
: like content_id for assessment items (allows us to detect moves)order
: sort order of the question within the exercise
Example attribute map used to map the “standard attributes” to the attributes used by ricecooker JSON trees:
ricecooker_map = {
# channel attrs
"root.node_id": "id", # a.k.a. channel_id
"root.content_id": "source_id", # unique identifier within source_domain
#
# node attrs
"license_name": "license.license_id",
"license_description": "license.description",
"copyright_holder": "license.copyright_holder",
"role_visibility": "role",
}
Note the special root.*
attributes which are used when processing the root node.
Note also the values in the ricecooker_map
use .
-accessor patterns, which
tell to look use the value of the nested dict object.
List of functions that use the low level API¶
If we call the keyword args of the low level API **LL
we can write the entire
functions of the treediffer
library as follows:
# PHASE 1: diffing
def diff_subtree(parent_idA, nodeA, parent_idB, nodeB, **LL):
"Compute the changes between the node `nodeA` in the old tree and the..."
def diff_attributes(nodeA, nodeB, **LL):
"Compute the diff between the attributes of `nodeA` and `nodeB`.
def diff_files(listA, listB):
"Compute the diff of two lists for files, treating them as set-like.""
def diff_assessment_items(listA, listB, **LL):
"Compute the diff between the lists of assessment items `listA` and `listB`,
def diff_children(parent_idA, childrenA, parent_idB, childrenB, **LL):
"Compute the diff between the nodes in `childrenA` and the nodes in `childrenB`."
# PHASE 2: move detection
def detect_moves(nodes_deleted, nodes_added):
"Look for nodes with the same `content_id` that appear in both lists, ..."
# PHASE 3: simplification
def simplify_diff(raw_diff):
# PHASE 4: restructuring
def restructure_diff(simplified_diff):
The “main” work happens in diff_subtree
which computes the diff of node attributes
and recursively calls diff_subtree on all children down the tree (PHASE 1).
Node moves are detected are detected as a post-processing step (PHASE 2),
and so are the format conversions for presentation needs (PHASE 3 and PHASE 4).