Content Analysis
Author Cell Type annotations more granular than CL
Numbers of author annotations (in loaded datasets) with no direct mapping to a CL term
MATCH (s2:Cell_cluster) where NOT (s2)-[:composed_primarily_of]->(:Cell)
return count (distinct s2)
// = 21953
;
Numbers of author annotations (in loaded datasets) with mapping to CL broader than 1:1
MATCH p=(c:Class:Cell)<-[:composed_primarily_of]-(s1)<-[:subcluster_of*..]-(s2)
where NOT (s2)-[:composed_primarily_of]->(:Cell) return count (distinct s2)
;
Example - Elementaite - Gut
MATCH (s1:Cell_cluster)-[:has_source]->(ds)
WHERE ds.publication = ['https://doi.org/10.1038/s41586-021-03852-1']
MATCH p=(c:Class:Cell)<-[:composed_primarily_of]-(s1)<-[:subcluster_of*..]-(s2)
WHERE NOT (s2)-[:composed_primarily_of]->(:Cell)
RETURN p
;
All new or unannotated cell types
MATCH (c:Class:Cell)<-[:composed_primarily_of]-(s1)<-[:subcluster_of*..]-(s2:Cell_cluster)
WHERE NOT (s2)-[:composed_primarily_of]->(:Cell)
AND NOT (toLower(c.label_rdfs[0])=toLower(s2.label))
AND NOT (s2.label STARTS WITH 'ns_') // BUG workaround
AND NOT (s2.label =~ ".+[0-9]$") // ignore anything ending in a number
AND NOT (toLower(s2.label) =~ ".*unknown.*") // ignore anything with unknown in name
AND NOT (toLower(s2.label) =~ ".*unclassified.*") // ignore anything with unclassified in name
AND NOT (toLower(s2.label) =~ ".*other.*") // ignore anything with other in name
MATCH (s2:Cell_cluster)-[t:tissue]->(:Class) //-[:part_of|SUBCLASSOF*0..]->(anat:Class) // Optional restrict to system
WHERE toFloat(t.percentage[0]) > 5 // and anat.label_rdfs[0] = 'intestine' // Optional restrict to system
MATCH (s2)-[d:disease]->(normal:Disease { label: 'normal'}) where toFloat(d.percentage[0]) > 5
MATCH (s2)-[:has_source]-(ds:Dataset)
RETURN DISTINCT c.label_rdfs[0] as broad_CL_ann, s2.label as unmappped_author_ann, collect(distinct (anat.label)) as source_tissues, toFloat(d.percentage[0]) as percent_normal, ds.title as dataset_title, ds.citation as dataset_details
ORDER BY ds.title, broad_CL_ann
;
=> 3978
AMICA auto-annotator can be used to make a first pass mapping & flag potential missing cell types.
Attemped break down by system:
MATCH (c:Class:Cell)<-[:composed_primarily_of]-(s1)<-[:subcluster_of*1..]-(s2:Cell_cluster)
WHERE NOT (s2)-[:composed_primarily_of]->(:Cell)
AND NOT (toLower(c.label_rdfs[0])=toLower(s2.label))
AND NOT (s2.label STARTS WITH 'ns_') // BUG workaround
AND NOT (s2.label =~ ".+[0-9]$") // ignore anything ending in a number - assume there are T-types
AND NOT (toLower(s2.label) =~ ".*unknown.*") // ignore anything with unknown in name
AND NOT (toLower(s2.label) =~ ".*unclassified.*") // ignore anything with unclassified in name
AND NOT (toLower(s2.label) =~ ".*other.*") // ignore anything with other in name
MATCH (s2:Cell_cluster)-[t:tissue]->(anat:Class)-[:part_of|SUBCLASSOF*0..]->(sys:Class)-[:SUBCLASSOF]->(as:Class { label: 'anatomical system'})// AND anat.label_rdfs[0] = 'intestine' // Optional restrict to by anatomy
WHERE toFloat(t.percentage[0]) > 5
MATCH (s2)-[d:disease]->(normal:Disease { label: 'normal'}) where toFloat(d.percentage[0]) > 5
MATCH (s2)-[:has_source]-(ds:Dataset)
RETURN count (distinct s2.label), sys.label
;
Not perfect but gives a rough idea of systems with large numbers of new or unannotated cell types
- 1367 "nervous system"
- 44 "ventricular system of central nervous system"
- 112 "respiratory system"
- 209 "digestive system"
- 48 "cardiovascular system"
Marker validation with GO
Few cases of GO validation markers for NS-Forest
MATCH p=(g:Gene)<-[:has_part]-(ms:Class)<-[:has_characterizing_marker_set|has_marker_set]
-(:Cell_cluster)-[]->(c:Cell)-[rg]-(go:Class)<-[]-(:Protein)<-[]-(g)
WHERE 'Biological_process' in labels(go) OR 'Cellular_component' in labels(go)
return c.label as cell_type, g.label as gene, rg.label as cell_go_rel, go.label, ms.label,
CASE
WHEN id(c) = id(startNode(rg)) THEN 'cell -> go'
ELSE 'go -> cell'
END AS direction
;
Much larger number if valildations from other marker sources:
MATCH p=(g:Gene)<-[]->(c:Cell)-[rg]-(go:Class)<-[]-(:Protein)<-[]-(g)
WHERE 'Biological_process' in labels(go) OR 'Cellular_component' in labels(go)
return c.label as cell_type, g.label as gene, rg.label as cell_go_rel, go.label,
CASE
WHEN id(c) = id(startNode(rg)) THEN 'cell -> go'
ELSE 'go -> cell'
END AS direction
ORDER BY cell_type, gene
;
General marker queries
NS-Forest to cell types
MATCH p=(g:Gene)<-[:has_part]-(:Class)<-[hm:has_characterizing_marker_set|has_marker_set]-(c:Cell)
return count (distinct c) as cell_types, count (distinct g) as genes, count (distinct hm) as marker_sets
;
NS-Forest to clusters:
MATCH p=(g:Gene)<-[:has_part]-(:Class)<-[hm:has_characterizing_marker_set|has_marker_set]-(c:Cell_cluster)
return count (distinct c) as cell_types, count (distinct g) as genes, count (distinct hm) as marker_sets
;
Single markers by source:
MATCH p=(g:Gene)<-[hm:has_marker]->(c:Cell)
return count (distinct c) as cell_types, count (distinct g) as genes,
count (distinct hm) as marker_assertions
;