Contents¶
Diff all the things¶
Diffs are very useful when working with content, especially at scale.
Studio librarians, admins, and LE staff need diffs to keep track of content changes. Curators need diffs to see what new materials are added when content channels are updated. Teachers and learners operating in the “we’re running out of materials” mode can watch the diffs to see a list of only the new things added in the current update.
JSON trees¶
Tree structures are ubiquitous in the data model throughout the Kolibri platform, and each of these tree structures can be serialized in the form of a JSON tree:
- T0: Ricecooker web archives format? (to know if source website has changed and we need to re-run the chef)
- T1: Ricecooker json input format (debug cheffing code, record input paths before processing)
- T2: Ricecooker->Studio wire format (archive chef runs, see detailed channel diff before uploading to studio)
- T3: Studio json tree format (compute diff(main tree,staged tree) before DEPLOY click)
- T4: Kolibri DB wire format sqlite3 files (show detailed diff info on UPDATE CHANNEL screen)
- T5: Kolibri tree format (show detailed diff info on IMPORT MORE screen)
JSON trees for diffs¶
Main focus of this project is T2 diffs in ricecooker, and T3 diffs for reviewing changes between staging tree and main tree in Studio.
JSON trees for PUBLISH (a.k.a. EXPORT)¶
The Studio channel export task needs the information about the studio main tree that is being published which can be “serialized” as a tree T3, with no further need for the export code to “touch” the Studio DB (but may need interaction with GCP content bucket)
The T3 tree format has several uses:
- A1 = representation format for doing diffs [see diff(main tree,staged tree) above]
- A2 = archival (save studio json tree as “snaphots” or “staged versions”, archive full studio tree data on PUBLISH version, could put in git)
- A3 = share export code between ricecooker and studio (single path for all the Kolibri backward compatibility code)
- A4 = publish staging tree (a Studio->Kolibri side-channel for importing a special “draft” channels exported from the staging tree, e.g. In Kolibri import from https://drafts.studio.leq.org/content/databases/{ch_id}.sqlite3 == PUBLISH({ch_id}.staging_tree)
Specs¶
Business Requirements Specification¶
- Know what has changed between two channel “states” (e.g. diff(staged,main) studio trees)
- Desirable changes to content channels propagate quickly through the content pipeline
- Unnecessary and undesirable changes are stopped and corrected at an early state of the pipeline
- Is able to process large trees like Khan Academy (think 500MB of uncompressed JSON)
Software Requirements Specification¶
Can be subdivided into essential functionality, optional, and stretch goals.
The essential objectives are:
- Produce a “summary diff” that contains only the counts of added/removed/moved/modified
- Given any two trees (ricecooker, studio, kolibri) produce the detailed diff information about the differences (added/removed/moved/modified) between the two trees
Optional (depending on frontend needs):
- Remove redundacy in diff format (e.g. show only node move instead of node add, node delete, and node move)
- Post-process the diff format to make it most convenient for presenting to users
Stretch goals:
- Post-process the diff to make a minimum description length and avoid information overload (e.g. instead of showing 30 content nodes added, display the diff as the action of adding a single topic node)
- Support kinds of trees:
- JSON: ricecooker, studio wire format, studio trees
- Django ORM: Studio, Kolibri
- Basic ORM: standalone script processing of kolibri sqlite3 DB files
Design¶
Given any two tree-like data structures, let’s call them oldtree
and newtree
,
that consist of dict-like nodes with node_id
-like and content_id
-like attributes.
Tree diff format¶
A tree diff is a dictionary that describes four types of edits:
nodes_deleted
: list of nodes that were present (bynode_id
) inoldtree
and absent in thenewtree
(bynode_id
)nodes_added
: list of nodes added to the tree (includes both nodes ADD and COPY actions)nodes_moved
: nodes in thenewtree
that have the samecontent_id
as a node in theoldtree
by a newnode_id
has changed. If multiple nodes satisfy this criterion, the only the first node is retuned (in tree-order).nodes_modified
: to be included in this list, a node in thenewtree
must have the samecontent_id
as a node in theoldtree
and modified attributes.
Note a node may appear in more than one of the above lists. For example a node that was moved from location X in the tree to location Y in the tree, and whose title was edited will appear in all four lists:
- in the
nodes_added
list because it now appears in location Y - in the
nodes_deleted
list because it was removed from location X - in the
nodes_moved
list since we can detect the node has the samecontent_id
- in the
nodes_modified
list because of the title edit
For more info, see the detailed diff examples in Tree diffs.
Technical notes¶
Invariant: the information in a tree diff treediff(oldtree, newtree)
, when
applied as a “patch” to the oldtree
should produce the newtree
.
Data format: each of the “diff items” in the four lists includes additional
structural annotations like parent_id
and all node attributes like title,
description, files, assessment items, tags, etc., even if they haven’t changed
to allow for easy display of diffs and post-processing tasks.
Diff limitations¶
- Will not recognize assessment items that are moved between exercises
- Nodes from the
oldtree
can be duplicated in several places in thenewtree
. The first of these uses will be recognized as a move, while all others will be recognized as if the nodes were added. A more appropriate way to do this would be to recognize them asnodes_copied
(a special type of added, where a node with the same content_id in theoldtree
exists). - The logic modified and moved logic assumes
content_id
is available for all nodes (topic nodes and content nodes). Need to watch out to make sure this assumption is valid in all cases where we want to use this (ricecooker, studio and Koilbri, and if necessary relax this in future).
External API (High Level)¶
The API for this library is to call the treediff
function which has signature:
treediff(oldtree, newtree, preset=None, format="simplified", # HL
attrs=None, exclude_attrs=[], mapA={}, mapB={}, # LL API
assessment_items_key='assessment_items', setlike_attrs=['tags']) # LL API
The line tagged with HL
is the “high level” API for the library, where users
just need to specify the two trees they need diffed (e.g. ricecooker
, studio
,
or kolibri
), which will set the appropriate values for the low-level API kwargs.
The raw
format returns the diff in the “internal representation” described above
that includes duplication (primarily used for debugging), the simplified
format
(default) removed moved nodes from the added and deleted lists but keeps all results
as flat lists. The restructured
format tries to return the diff as “chunks” of
tree (i.e. indicate an addition of topic + 3 children as one addition, instead
of four separate additions).
Internal API (Low Level)¶
The functions the treediff
module are named diff_
and accept the following
standard set of keyword arguments, which we’ll call the “low level” API:
LL = dict(
attrs=None, exclude_attrs=[], mapA={}, mapB={}, # LL API
assessment_items_key='assessment_items', setlike_attrs=['tags']) # LL API
)
attrs
(list(str)): if specified, only these attributes will be check for differences the default value is set toNone
which means all values will be checked.exclude_attrs
(list(str)): do not check these attributes, e.g. for cases where we expect them to be different and we want to avoid false positives.mapA
(dict): map from the common diff attributes to attributes nodes intreeA
.mapB
(dict): map from the common diff attributes to attributes nodes intreeB
.assessment_items_key
(str): specify where to look for assessment items, in Studio and Kolibri leave the default value. In ricecooker, set toquestions
.setlike_attrs
(list(str)): treat these attributes as sets (i.e. order doesn’t matter)
Note: in general users should not need to set the low level API params and can
instead rely on one of the presets, which will set the appropriate keyword args,
e.g. using preset="studio"
is equivalent to call with **studio_preset_kwargs
.
Attribute maps¶
The arguments mapA
and mapB
are used to make the same diff logic work for
trees that have different attributes names. The internal diff logic always works
will look for the following standard attributes (derived from Studio data model):
- Nodes:
node_id
: an unique identifier that represents the node’s position within the treeparent_id
: the node ID of the parent nodecontent_id
: a persistent identifier for the content of the node (allows us to detect moves)sort_order
: position within parent
- Assessment items:
assessment_id
: like content_id for assessment items (allows us to detect moves)order
: sort order of the question within the exercise
Example attribute map used to map the “standard attributes” to the attributes used by ricecooker JSON trees:
ricecooker_map = {
# channel attrs
"root.node_id": "id", # a.k.a. channel_id
"root.content_id": "source_id", # unique identifier within source_domain
#
# node attrs
"license_name": "license.license_id",
"license_description": "license.description",
"copyright_holder": "license.copyright_holder",
"role_visibility": "role",
}
Note the special root.*
attributes which are used when processing the root node.
Note also the values in the ricecooker_map
use .
-accessor patterns, which
tell to look use the value of the nested dict object.
List of functions that use the low level API¶
If we call the keyword args of the low level API **LL
we can write the entire
functions of the treediffer
library as follows:
# PHASE 1: diffing
def diff_subtree(parent_idA, nodeA, parent_idB, nodeB, **LL):
"Compute the changes between the node `nodeA` in the old tree and the..."
def diff_attributes(nodeA, nodeB, **LL):
"Compute the diff between the attributes of `nodeA` and `nodeB`.
def diff_files(listA, listB):
"Compute the diff of two lists for files, treating them as set-like.""
def diff_assessment_items(listA, listB, **LL):
"Compute the diff between the lists of assessment items `listA` and `listB`,
def diff_children(parent_idA, childrenA, parent_idB, childrenB, **LL):
"Compute the diff between the nodes in `childrenA` and the nodes in `childrenB`."
# PHASE 2: move detection
def detect_moves(nodes_deleted, nodes_added):
"Look for nodes with the same `content_id` that appear in both lists, ..."
# PHASE 3: simplification
def simplify_diff(raw_diff):
# PHASE 4: restructuring
def restructure_diff(simplified_diff):
The “main” work happens in diff_subtree
which computes the diff of node attributes
and recursively calls diff_subtree on all children down the tree (PHASE 1).
Node moves are detected are detected as a post-processing step (PHASE 2),
and so are the format conversions for presentation needs (PHASE 3 and PHASE 4).
Tree diff formats¶
The functions in treediffer
are used to find differences between two trees.
The resulting diff contains both structural information about the nodes’ position
in the tree as well as the nodes’ attributes.
Example of node deleted¶
A node deletion appears under the key nodes_deleted.{node_id}
and contains the
structural info where the deletion occurred, and the attributes of the node that was deleted:
nodes_deleted = [
{
"old_node_id": (str),
"old_parent_id": (str),
"old_sort_order": (float),
"content_id": (content_id),
"attributes": {
"title": {"value": "The title of the deleted node"},
"description": {"value": "The description of the node that was deleted"},
"content_id": {"value": (str)},
"sort_order": {"value": (float)},
... }
},
{},
...,
{},
]
To apply this deletion patch, find content node
c = ContenNode.filter(tree_id=...).get(node_id=deleted['node_id'])
, and its parent
parent_node = ContenNode.filter(tree_id=...).get(node_id=deleted['old_parent_id'])
,
then call parent_node.children.remove(c)
. The attributes
dict is provided
for information purposes only (e.g. to display info about the node being deleted in the UI).
Example of node added¶
A node added to the tree is appears under the key nodes_added.{node_id}
and is
represented by the following dict:
nodes_added = [
{
"parent_id": (node_id),
"node_id": (node_id),
"sort_order": (float),
"content_id": (content_id),
"attributes": {
"title": {"value": "The title of the new node"},
"description": {"value": "The description of the new node"},
"content_id": {"value": (str)},
"sort_order": {"value": (float)},
... }
},
{},
...,
{},
]
To apply this “patch,” create a new c = ContentNode(**map(..., attributes))
,
find the parent node where the new node is to be added,
parent_node = ContenNode.filter(tree_id=...).get(node_id=added['parent_id'])
,
then call parent_node.children.add(c)
.
Note: when applying a patch with multiple node additions, watch out for sort order
in systems that don’t know how to handle sort_order
(ricecooker).
Example of node moved¶
Nodes in the newtree
that have the same content_id
as a node in the oldtree
but whose node_id
has changed can be interpreted as moves. If multiple nodes
satisfy this criterion, the only the first node is treated as a move (in tree-order)
while others “node clones” of the same content_id
are recorded as additions.
nodes_moved = [
{
"content_id": (content_id),
"node_id": (node_id),
"old_node_id": (node_id),
"parent_id": (node_id),
"old_parent_id": (node_id),
"sort_order": (float),
"old_sort_order": (float),
"attributes": {
"title": {"value": "The title of the new node"},
"description": {"value": "The description of the new node"},
"content_id": {"value": (str)},
"sort_order": {"value": (float), "old_value": (== old_sort_order)},
...},
},
{},
...,
{},
]
Example of node modified¶
When a node appears in both the “before tree” and the “after tree” in the same
node_id
(position within the tree) but who have different attributes.
These “attribute diffs” are stored under the keys old_value
, value
= new value,
with special handling for set-like attributes (files and tags), and list-like
attributes (assessment_items).
Example data structure:
nodes_modified = [
{
"node_id": (node_id),
"parent_id": (node_id),
"content_id": (content_id),
# intentionally not tracking changes in sort_order since those are defined as moves
"changed": ["title", "sort_order", "tags", "assessment_items"],
"attributes": {
"title":{
"old_value": "Old title",
"value": "The new title for the node"
},
"description": {"value": "The description of the new node"},
"content_id": {"value": (str)},
"sort_order": {"value": (float), "old_value": (== old_sort_order)},
"tags": { # set-like
"old_value": ['oldtag1', 'oldtag2'],
"value": ['oldtag1', 'newtag3', 'newtag4'],
"tags_removed": ['oldtag2'],
"tags_added": ['newtag3', 'newtag4'],
},
"assessment_items": { # list-like
"old_value": [
{"id": "aiid1", "assessment_id": 'q1', ... },
{"id": "aiid2", "assessment_id": 'q2', ... },
{"id": "aiid3", "assessment_id": 'q3', ... }
],
"value": [
{"id": "aiid1", "assessment_id": 'q1', ... },
{"id": "aiid4", "assessment_id": 'q4', ... },
{"id": "aiid3", "assessment_id": 'q3', ... },
],
"deleted": [
{"id": "aiid2", "assessment_id": 'q2', ... },
]
"added": [
{"id": "aiid4", "assessment_id": 'q4', ... },
]
"moved": [],
"modified": [],
},
},
},
{},
...,
{},
]
Node structural annotations¶
old_parent_id |
parent_id |
old_sort_order |
sort_order |
|
---|---|---|---|---|
nodes_deleted |
x | x | ||
nodes_added |
x | x | ||
nodes_moved |
x | x | x | x |
nodes_modified |
Flat diffs¶
Suppose a topic node T1 is added which has two children N1 and N2. Depending on
the use case for the diff, we want to represent this change in different ways.
The raw
and simplified
diff formats correspond to flat lists, so this change
will appear as three separate items in the nodes_added
list:
nodes_added = [
...,
{
"parent_id": (node_id(parentT1)),
"node_id": (node_id(T1)),
"sort_order": (float),
"content_id": (content_id(T1)),
"attributes": {
"title": {"value": "T1"},
"description": {"value": "The description of the new topic node"},
"content_id": {"value": (str)},
"sort_order": {"value": (float)},
... }
},
{
"parent_id": (node_id(T1)),
"node_id": (node_id(N1)),
"sort_order": 1.0,
"content_id": (content_id(N1)),
"attributes": {
"title": {"value": "N1"},
"description": {"value": "The description of the first new node"},
"content_id": {"value": (str)},
"sort_order": {"value": 1.0},
... }
},
{
"parent_id": (node_id(T1)),
"node_id": (node_id(N2)),
"sort_order": 2.0,
"content_id": (content_id(N2)),
"attributes": {
"title": {"value": "N2"},
"description": {"value": "The description of the second new node"},
"content_id": {"value": (str)},
"sort_order": {"value": 2.0},
... }
},
...,
]
This is appropriate for counting number of nodes added/deleted/moved/modified.
Restructured diffs¶
In the restructured
diff format we’ll combine these additions into a single
logical addition of the topic, and indicate the presence of the child notes as
children to the top-level topic addition:
nodes_added = [
...,
{
"parent_id": (node_id(parentT1)),
"node_id": (node_id(T1)),
"sort_order": (float),
"content_id": (content_id(T1)),
"attributes": {
"title": {"value": "T1"},
"description": {"value": "The description of the new topic node"},
"content_id": {"value": (str)},
"sort_order": {"value": (float)},
# no children # <<< Note: children treated outside of attributes
... },
"children": [
{
"parent_id": (node_id(T1)),
"node_id": (node_id(N1)),
"sort_order": 1.0,
"content_id": (content_id(N1)),
"attributes": {
"title": {"value": "N1"},
"description": {"value": "The description of the first new node"},
"content_id": {"value": (str)},
"sort_order": {"value": 1.0},
... }
},
{
"parent_id": (node_id(T1)),
"node_id": (node_id(N2)),
"sort_order": 2.0,
"content_id": (content_id(N2)),
"attributes": {
"title": {"value": "N2"},
"description": {"value": "The description of the second new node"},
"content_id": {"value": (str)},
"sort_order": {"value": 2.0},
... }
},
]
},
...,
]
This format would is better suited for displaying the changes in a UI in a compact logical manner to avoid overwhelming users with long lists.