Content Analysis

Author Cell Type annotations more granular than CL

Numbers of author annotations (in loaded datasets) with no direct mapping to a CL term

MATCH (s2:Cell_cluster) where NOT (s2)-[:composed_primarily_of]->(:Cell) 
return count (distinct s2)
// = 21953
;

Numbers of author annotations (in loaded datasets) with mapping to CL broader than 1:1

MATCH p=(c:Class:Cell)<-[:composed_primarily_of]-(s1)<-[:subcluster_of*..]-(s2)  
where NOT (s2)-[:composed_primarily_of]->(:Cell) return count (distinct s2)
;

Example - Elementaite - Gut

MATCH (s1:Cell_cluster)-[:has_source]->(ds) 
WHERE ds.publication = ['https://doi.org/10.1038/s41586-021-03852-1']
MATCH p=(c:Class:Cell)<-[:composed_primarily_of]-(s1)<-[:subcluster_of*..]-(s2) 
WHERE NOT (s2)-[:composed_primarily_of]->(:Cell)
RETURN p
;

All new or unannotated cell types

MATCH (c:Class:Cell)<-[:composed_primarily_of]-(s1)<-[:subcluster_of*..]-(s2:Cell_cluster) 
WHERE NOT (s2)-[:composed_primarily_of]->(:Cell) 
AND NOT (toLower(c.label_rdfs[0])=toLower(s2.label)) 
AND NOT (s2.label STARTS WITH 'ns_') // BUG workaround
AND NOT (s2.label =~ ".+[0-9]$") // ignore anything ending in a number
AND NOT (toLower(s2.label) =~ ".*unknown.*") // ignore anything with unknown in name
AND NOT (toLower(s2.label) =~ ".*unclassified.*") // ignore anything with unclassified in name
AND NOT (toLower(s2.label) =~ ".*other.*") // ignore anything with other in name
MATCH (s2:Cell_cluster)-[t:tissue]->(:Class) //-[:part_of|SUBCLASSOF*0..]->(anat:Class) // Optional restrict to system 
WHERE toFloat(t.percentage[0]) > 5 // and anat.label_rdfs[0] = 'intestine' // Optional restrict to system
MATCH (s2)-[d:disease]->(normal:Disease { label: 'normal'}) where toFloat(d.percentage[0]) > 5
MATCH (s2)-[:has_source]-(ds:Dataset)
RETURN DISTINCT c.label_rdfs[0] as broad_CL_ann, s2.label as unmappped_author_ann, collect(distinct (anat.label)) as source_tissues, toFloat(d.percentage[0]) as percent_normal, ds.title as dataset_title, ds.citation as dataset_details
ORDER BY ds.title, broad_CL_ann
;

=> 3978

AMICA auto-annotator can be used to make a first pass mapping & flag potential missing cell types.

Attemped break down by system:

MATCH (c:Class:Cell)<-[:composed_primarily_of]-(s1)<-[:subcluster_of*1..]-(s2:Cell_cluster) 
WHERE NOT (s2)-[:composed_primarily_of]->(:Cell) 
AND NOT (toLower(c.label_rdfs[0])=toLower(s2.label)) 
AND NOT (s2.label STARTS WITH 'ns_') // BUG workaround
AND NOT (s2.label =~ ".+[0-9]$") // ignore anything ending in a number - assume there are T-types
AND NOT (toLower(s2.label) =~ ".*unknown.*") // ignore anything with unknown in name
AND NOT (toLower(s2.label) =~ ".*unclassified.*") // ignore anything with unclassified in name
AND NOT (toLower(s2.label) =~ ".*other.*") // ignore anything with other in name
MATCH (s2:Cell_cluster)-[t:tissue]->(anat:Class)-[:part_of|SUBCLASSOF*0..]->(sys:Class)-[:SUBCLASSOF]->(as:Class { label: 'anatomical system'})// AND anat.label_rdfs[0] = 'intestine' // Optional restrict to by anatomy
WHERE toFloat(t.percentage[0]) > 5 
MATCH (s2)-[d:disease]->(normal:Disease { label: 'normal'}) where toFloat(d.percentage[0]) > 5
MATCH (s2)-[:has_source]-(ds:Dataset)
RETURN  count (distinct s2.label), sys.label
;

Not perfect but gives a rough idea of systems with large numbers of new or unannotated cell types

1367 "nervous system"
44 "ventricular system of central nervous system"
112 "respiratory system"
209 "digestive system"
48 "cardiovascular system"

Marker validation with GO

Few cases of GO validation markers for NS-Forest

MATCH p=(g:Gene)<-[:has_part]-(ms:Class)<-[:has_characterizing_marker_set|has_marker_set]
-(:Cell_cluster)-[]->(c:Cell)-[rg]-(go:Class)<-[]-(:Protein)<-[]-(g) 
WHERE 'Biological_process' in labels(go) OR 'Cellular_component' in labels(go)
return c.label as cell_type, g.label as gene, rg.label as cell_go_rel, go.label, ms.label,
       CASE
         WHEN id(c) = id(startNode(rg)) THEN 'cell -> go'
         ELSE 'go -> cell'
       END AS direction
;

Much larger number if valildations from other marker sources:

MATCH p=(g:Gene)<-[]->(c:Cell)-[rg]-(go:Class)<-[]-(:Protein)<-[]-(g) 
WHERE 'Biological_process' in labels(go) OR 'Cellular_component' in labels(go)
return c.label as cell_type, g.label as gene, rg.label as cell_go_rel, go.label,
       CASE
         WHEN id(c) = id(startNode(rg)) THEN 'cell -> go'
         ELSE 'go -> cell'
       END AS direction
ORDER BY cell_type, gene
;

General marker queries

NS-Forest to cell types

MATCH p=(g:Gene)<-[:has_part]-(:Class)<-[hm:has_characterizing_marker_set|has_marker_set]-(c:Cell)
return count (distinct c) as cell_types, count (distinct g) as genes, count (distinct hm) as marker_sets
;

NS-Forest to clusters:

MATCH p=(g:Gene)<-[:has_part]-(:Class)<-[hm:has_characterizing_marker_set|has_marker_set]-(c:Cell_cluster)
return count (distinct c) as cell_types, count (distinct g) as genes, count (distinct hm) as marker_sets
;

Single markers by source:

MATCH p=(g:Gene)<-[hm:has_marker]->(c:Cell)
return count (distinct c) as cell_types, count (distinct g) as genes, 
count (distinct hm) as marker_assertions
;