This repository contains code for running label propagation algorithms on author graph data.
See requirements/requirements.txt
for the required packages.
You need to convert the anonymized graph data from Elsevier into Dataframes
that contain mappings from auid -> eids
and eid -> auids
. These need to be put
into the data
directory.
Then to run the algorithm
python -m src.run_algo --runtime sciserver
For other options, see the help message.
python -m src.run_algo --help
Then to run the algorithm
python -m src.run_algo --runtime elsevier
For other options, see the help message.
python -m src.run_algo --help
The algorithm works generally in three phases:
- Get the prior data for that year.
- Run the label propagation algorithm
- Update the posterior data for that year
A backend then needs to implement steps 1 and 3.
You need to implement the following functions
MaybeSparseMatrix = Union[np.ndarray, sp.spmatrix]
get_data(
year: int,
logger: logging.Logger
) -> Iterable[Tuple[MaybeSparseMatrix, np.ndarray, np.ndarray]]:
This function accepts a year and a logger and returns a tuple of the following:
- The adjacency matrix
- The auids
- The prior for the auids wrapped by an iterable. This is because the graph may be disconnected and you may want to parse it in pieces. However, you could parse the entire graph at and then the iterable would only contain one element.
The second function you need to implement is
def update_posterior(
auids: np.ndarray,
posterior_y_value: np.ndarray,
year: int,
logger: logging.Logger,
) -> None:
This function accepts the auids, the posterior_y_value, and the year and updates the posterior values for that year. It's important to note that if you parse the graph in pieces of disconnnected sets, this will update the same file multiple times.
- Finish tests for sciserver.py
- Add SocNL