Skip to content

Latest commit

 

History

History

repair

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Atomspace deduplication repair scripts
--------------------------------------

Due to bugs, the SQL backend can end up with multiple copies of
atoms. This script will find them, merge them, and sum the counts
on the associated count truth values.  Its up to you to recompute
anything else.

The scripts are semi-manual. A lot of manual fiddling is required.

First check for multiple ANY atoms:
  select * from typecodes where typename='LinkGrammarRelationshipNode';

reports that type=89 is LinkGrammarRelationshipNode)
Then see how many:
   select * from atoms where type=89 and name='ANY';

If more than one, start with ./any-merge.scm

Next, edit common.scm and set do-update to #f --
  -- this will avoid clobbering your data.
  -- then run ./dedupe-pairs.scm and see what happens..
  -- dedupe-pairs looks for duplicate ListLinks.

Next, edit common.scm and set do-update to #t --
  -- this is the actual run.

Next run ./eval-dedupe.scm

Then review contents of and run word-merge.scm
then delete the duplicate WordNodes by hand.

Then run dedupe-pairs.scm a second time ...

Finally, fix the indexes, manually:
\di
DROP INDEX linkidx;
CREATE UNIQUE INDEX linkidx ON Atoms(type, outgoing);
DROP INDEX nodeidx;
CREATE UNIQUE INDEX nodeidx ON Atoms(type, name);

ALTER TABLE atoms ADD UNIQUE (type, outgoing);
ALTER TABLE atoms ADD UNIQUE (type, name);


and then remove duplicates again with ./eval-dedup.scm

After scrubbing, we may end up with Links that specify children that
are not in the atomspace. Check for these with ./check-oset.scm

Notes
-----
ssh -L 5555:localhost:5432  example.org

Working on Rohit's dataset.
psql simple_pairs
SELECT Count(*) from atoms; gives 9 230 537 -- that is 9 million or so
Of these, there are --
   3 636 569 ListLinks          type=8
   5 453 952 EvaluationLinks    type=27
     139 884 WordNodes          type=45
select count(*) from atoms where type=8;

----
After first dupe elimination:
   3 636 568 ListLinks        (one less)
   3 601 161 EvaluationLinks  (1852791 less) (about right ...)
---

After word-dedupe:
   3 389 455 ListLinks
   3 354 048 EvalLinks
     139 839 WordNodes
----

After list dedupe:
   3 354 804 ListLinks
   3 328 118 EvalLinks


Find the ANY node:
61 | LinkGrammarRelationshipNode
select * from atoms where type=61;

arghhh. There are 6 of these...
(map count-evlinks (list 57 139 140 186 190 270))

These are being handled at the rate of 100 per second.
so 5.5M 55K seconds == 15 hours!!! ouch.

Relabel any uuid 139 to 57
Changed uuid count was 833348
Changed 140 to 57
Changed uuid count was 623716
Changed 186 to 57
Changed uuid count was 1769698
Changed 190 to 57
Changed uuid count was 492974
Changed 270 to 57
Changed uuid count was 573087

real	349m46.696s
user	46m10.172s
sys	22m34.208s

argh do it again:
Number of bad evals: 588291
Relabel ANY uuid 139 to 57
Relabeled uuid count was 101662
Relabel ANY uuid 140 to 57
Relabeled uuid count was 311786
Relabel ANY uuid 186 to 57
Relabeled uuid count was 39714
elabel ANY uuid 190 to 57
Relabeled uuid count was 62779
Relabel ANY uuid 270 to 57
Relabeled uuid count was 72350






select uuid,stv_count from atoms where type=61;
 uuid | stv_count
------+-----------
  186 | 221385260
   57 |  64865342
  139 |  64780936
  270 |  38383732
  140 |  47758052
  190 |  20589354
(6 rows)

select sum(stv_count) from atoms where type=61;
    sum
-----------
 457762676

UPDATE atoms SET stv_count=457762676 WHERE uuid=57;
DELETE FROM atoms WHERE uuid=186;
DELETE FROM atoms WHERE uuid=139;
DELETE FROM atoms WHERE uuid=270;
DELETE FROM atoms WHERE uuid=140;
DELETE FROM atoms WHERE uuid=190;


./dedupe says only one!! duplicate EvalLink!?
select * from atoms where outgoing='{125,131}';
15138
15945
DELETE FROM atoms WHERE uuid=15945;

select uuid,type,stv_count,outgoing from atoms where outgoing='{57,15138}';
So -- only one ListLink, but lots of Duplicated Evals for that
ListLink...
So, run ./eval-dedup.scm
The duplicate eval list size: 1817384

argh.
 about 415805 in re1, re2 and 1094637 in re3

----------
Now run ./word-merge.scm

-- sum the word counts too.
Done.

Finally, run dedupe-pairs.scm