Skip to content

Latest commit

 

History

History
147 lines (131 loc) · 5.87 KB

MAPPING.md

File metadata and controls

147 lines (131 loc) · 5.87 KB

Mapping files

A mapping file consists of a series of specifications of which index
and measure dimensions we want a nanocube index to have based on the 
columns of the input file (eg .csv, .psv).

    index_dimension(NAME, INPUT, INDEX_SPEC)
        NAME        is the name of the nanocube dimension.
        INPUT       identifies the .csv column names that will be used
       	            as input by the MAPSPEC rule to generate the info for
		    the nanocube index dimension

	            input()              # no input columns needed
		    input('lat','lon')   # two input values coming
                                         # from columns 'lat' and
                                         # 'lon'
	INDEX_SPEC  identifies (possibly-parameterized) the index dimension
	            encoding (flat-tree, binary tree, quad-tree, k-ary tree)
		    and resolution (number of levels), and how to map input
		    values into 'bins' in this dimension.

		    categorical(B,L)
		    categorical(B,L,S)
		        flat-tree with B bits of resolution (eg. 8) in L
			levels. expects one input column and every distinct
			value that appears in the .csv input column is mapped
			into a unique number with L digits in {0,1,...,2^(B)-1}.
			Numbers are automatically generated by their appearence
			order in the .csv file.

			S is a string encoding an alias table. In its simplest
			form (A), we just name nodes on the deepest level of the
			hierarchy. In its more elaborate form (B) we allow
			naming intermediate nodes.
			(A)
				in_lbl_1 <nl>
				in_lbl_2 <nl>
				...
				in_lbl_n <nl>
			or
				in_lbl_1 <tab> out_lbl_1 <nl>
				in_lbl_2 <nl>
				...
				in_lbl_n <tab> out_lbl_1 <nl>
			or
				in_lbl_1_1 <tab> ... <tab> in_lbl_1_k <tab> out_lbl_1 <nl>
				in_lbl_2 <nl>
				...
				in_lbl_n <tab> out_lbl_n <nl>
			(B)
				@hierarchy
				alias_root
				<tab> alias_level_1
				<tab> <tab> alias_level_2
				<tab> <tab> <tab> input_text (== alias_level_3)
				<tab> <tab> <tab> input_text_1 <tab> input_text_2 <tab> alias_leaf_level
				<tab> <tab> <tab> input_text (== alias_level_3)
				<tab> alias_level_1
				<tab> <tab> alias_level_2
				<tab> <tab> <tab> input_text (== alias_level_3)
				<tab> <tab> <tab> input_text (== alias_level_3)
			or
				@hierarchy
				<tab> alias_level_1
				<tab> <tab> <tab> input_text (== alias_level_3)
				<tab> <tab> <tab> input_text_1 <tab> input_text_2 <tab> alias_leaf_level
				<tab> <tab> <tab> input_text (== alias_level_3)
				<tab> alias_level_1
				<tab> <tab> <tab> input_text (== alias_level_3)
				<tab> <tab> <tab> input_text (== alias_level_3)

		    latlon(L)
		        creates a quad-tree with L levels using the mercator
			projection. Expects two input columns with floating
			pointing numbers for latitude and longitude.

		    time(L,BASE,WIDTH_SECS,OFFSET_SECS)
		        creates a binary-tree with L levels. Expects one input
			column with a string that is convertible to a timestamp.
			Some accepted formats are:
		           '2000-01-01T00:00:00-06:00.125'
		           '2000-01-01T00:00:00-06:00'
		           '2000-01-01T00:00-06:00'
		           '2000-01-01T00:00'
		           '2000-01-01T00'
		           '2000-01-01'
                        It uses BASE the timestamp (also in a format from the
			above) as the alignment point for temporal bins. The
			bins have width given in seconds: WIDTH_SECS; and a
			conversion is possibly applied to align cases where
			data will come for instance in local time and we want
			to correct it to UTC.

		    unixtime(L,BASE,WIDTH_SECS,OFFSET_SECS)
			Analogous to time, but instead of expecting a date
			and time string, it expects a string with the number
			of seconds since unix epoch (1970-01-01 UTC).

		    ip(L)
		        creates a quad-tree with L levels mapping IPv4 entries
			(eg. 123.122.122.98) into a corresponding entry using the
			hilbert space-filling curve convention.

    measure_dimension(NAME, INPUT, MEASURE_SPEC)
        NAME and INPUT are the same as in the index_dimension.
	MEASURE_SPEC
	    either a primitive scalar type like
	    signed integer, unsigned integer, or floating point
	    numbers with 32/64 bits. Or a pre-defined function to
	    convert input columns into something meaningful

	    u32, u64
	    f32, f64
	    row_bitset()

    file(F)
	reads content of file F as a string. Can be used together
	with categorical index dimensions to point to alias mapping
	descriptions (see categorical).

Here is an example of a MAPPING file (it accepts line comments using #) based
on New York City taxi datasets (http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).

    # I1. quadtree index dimension for the pickup location
    index_dimension('pickup_location',input('pickup_latitude','pickup_longitude'),latlon(25));
    # I2. quadtree index dimension for the dropoff location
    index_dimension('dropoff_location',input('dropoff_latitude','dropoff_longitude'),latlon(25));
    # I3. binary tree for the time dimension: hourly bins
    index_dimension('pickup_time', input('tpep_pickup_datetime'), time(17,'2009-01-01T00:00:00-05:00',3600,5*60));
    # I4. weekday of pickup
    index_dimension('weekday', input('tpep_pickup_datetime'), weekday());
    # I5. hour of pickup
    index_dimension('hour', input('tpep_pickup_datetime'), hour());

    # M1. measure dimension for counting  (u32 means integral non-negative
    #     and 2^32 - 1 max value) if no input .csv column is used in a measure
    #     dimention, it functions as a count (ie. one for each input record)
    measure_dimension('count',input(),u32);
    # M2. duration
    measure_dimension('duration',input('tpep_pickup_datetime','tpep_dropoff_datetime') ,duration(f32,60));
    # M3. duration squared (for stddev computations)
    measure_dimension('duration2',input('tpep_pickup_datetime','tpep_dropoff_datetime') ,duration2(f32,60));
    # M4. Map fare_amount into a 32-bit floating point value
    measure_dimension('fare',input('fare_amount'),f32);