Design¶
Given any two tree-like data structures, let’s call them oldtree and newtree,
that consist of dict-like nodes with node_id-like and content_id-like attributes.
Tree diff format¶
A tree diff is a dictionary that describes four types of edits:
nodes_deleted: list of nodes that were present (bynode_id) inoldtreeand absent in thenewtree(bynode_id)nodes_added: list of nodes added to the tree (includes both nodes ADD and COPY actions)nodes_moved: nodes in thenewtreethat have the samecontent_idas a node in theoldtreeby a newnode_idhas changed. If multiple nodes satisfy this criterion, the only the first node is retuned (in tree-order).nodes_modified: to be included in this list, a node in thenewtreemust have the samecontent_idas a node in theoldtreeand modified attributes.
Note a node may appear in more than one of the above lists. For example a node that was moved from location X in the tree to location Y in the tree, and whose title was edited will appear in all four lists:
- in the
nodes_addedlist because it now appears in location Y - in the
nodes_deletedlist because it was removed from location X - in the
nodes_movedlist since we can detect the node has the samecontent_id - in the
nodes_modifiedlist because of the title edit
For more info, see the detailed diff examples in Tree diffs.
Technical notes¶
Invariant: the information in a tree diff treediff(oldtree, newtree), when
applied as a “patch” to the oldtree should produce the newtree.
Data format: each of the “diff items” in the four lists includes additional
structural annotations like parent_id and all node attributes like title,
description, files, assessment items, tags, etc., even if they haven’t changed
to allow for easy display of diffs and post-processing tasks.
Diff limitations¶
- Will not recognize assessment items that are moved between exercises
- Nodes from the
oldtreecan be duplicated in several places in thenewtree. The first of these uses will be recognized as a move, while all others will be recognized as if the nodes were added. A more appropriate way to do this would be to recognize them asnodes_copied(a special type of added, where a node with the same content_id in theoldtreeexists). - The logic modified and moved logic assumes
content_idis available for all nodes (topic nodes and content nodes). Need to watch out to make sure this assumption is valid in all cases where we want to use this (ricecooker, studio and Koilbri, and if necessary relax this in future).
External API (High Level)¶
The API for this library is to call the treediff function which has signature:
treediff(oldtree, newtree, preset=None, format="simplified", # HL
attrs=None, exclude_attrs=[], mapA={}, mapB={}, # LL API
assessment_items_key='assessment_items', setlike_attrs=['tags']) # LL API
The line tagged with HL is the “high level” API for the library, where users
just need to specify the two trees they need diffed (e.g. ricecooker, studio,
or kolibri), which will set the appropriate values for the low-level API kwargs.
The raw format returns the diff in the “internal representation” described above
that includes duplication (primarily used for debugging), the simplified format
(default) removed moved nodes from the added and deleted lists but keeps all results
as flat lists. The restructured format tries to return the diff as “chunks” of
tree (i.e. indicate an addition of topic + 3 children as one addition, instead
of four separate additions).
Internal API (Low Level)¶
The functions the treediff module are named diff_ and accept the following
standard set of keyword arguments, which we’ll call the “low level” API:
LL = dict(
attrs=None, exclude_attrs=[], mapA={}, mapB={}, # LL API
assessment_items_key='assessment_items', setlike_attrs=['tags']) # LL API
)
attrs(list(str)): if specified, only these attributes will be check for differences the default value is set toNonewhich means all values will be checked.exclude_attrs(list(str)): do not check these attributes, e.g. for cases where we expect them to be different and we want to avoid false positives.mapA(dict): map from the common diff attributes to attributes nodes intreeA.mapB(dict): map from the common diff attributes to attributes nodes intreeB.assessment_items_key(str): specify where to look for assessment items, in Studio and Kolibri leave the default value. In ricecooker, set toquestions.setlike_attrs(list(str)): treat these attributes as sets (i.e. order doesn’t matter)
Note: in general users should not need to set the low level API params and can
instead rely on one of the presets, which will set the appropriate keyword args,
e.g. using preset="studio" is equivalent to call with **studio_preset_kwargs.
Attribute maps¶
The arguments mapA and mapB are used to make the same diff logic work for
trees that have different attributes names. The internal diff logic always works
will look for the following standard attributes (derived from Studio data model):
- Nodes:
node_id: an unique identifier that represents the node’s position within the treeparent_id: the node ID of the parent nodecontent_id: a persistent identifier for the content of the node (allows us to detect moves)sort_order: position within parent
- Assessment items:
assessment_id: like content_id for assessment items (allows us to detect moves)order: sort order of the question within the exercise
Example attribute map used to map the “standard attributes” to the attributes used by ricecooker JSON trees:
ricecooker_map = {
# channel attrs
"root.node_id": "id", # a.k.a. channel_id
"root.content_id": "source_id", # unique identifier within source_domain
#
# node attrs
"license_name": "license.license_id",
"license_description": "license.description",
"copyright_holder": "license.copyright_holder",
"role_visibility": "role",
}
Note the special root.* attributes which are used when processing the root node.
Note also the values in the ricecooker_map use .-accessor patterns, which
tell to look use the value of the nested dict object.
List of functions that use the low level API¶
If we call the keyword args of the low level API **LL we can write the entire
functions of the treediffer library as follows:
# PHASE 1: diffing
def diff_subtree(parent_idA, nodeA, parent_idB, nodeB, **LL):
"Compute the changes between the node `nodeA` in the old tree and the..."
def diff_attributes(nodeA, nodeB, **LL):
"Compute the diff between the attributes of `nodeA` and `nodeB`.
def diff_files(listA, listB):
"Compute the diff of two lists for files, treating them as set-like.""
def diff_assessment_items(listA, listB, **LL):
"Compute the diff between the lists of assessment items `listA` and `listB`,
def diff_children(parent_idA, childrenA, parent_idB, childrenB, **LL):
"Compute the diff between the nodes in `childrenA` and the nodes in `childrenB`."
# PHASE 2: move detection
def detect_moves(nodes_deleted, nodes_added):
"Look for nodes with the same `content_id` that appear in both lists, ..."
# PHASE 3: simplification
def simplify_diff(raw_diff):
# PHASE 4: restructuring
def restructure_diff(simplified_diff):
The “main” work happens in diff_subtree which computes the diff of node attributes
and recursively calls diff_subtree on all children down the tree (PHASE 1).
Node moves are detected are detected as a post-processing step (PHASE 2),
and so are the format conversions for presentation needs (PHASE 3 and PHASE 4).