Skip to content
This repository has been archived by the owner on Oct 29, 2019. It is now read-only.

Interfaces to write warc.gz / CDX files #13

Open
riking opened this issue Feb 26, 2018 · 3 comments
Open

Interfaces to write warc.gz / CDX files #13

riking opened this issue Feb 26, 2018 · 3 comments

Comments

@riking
Copy link
Contributor

riking commented Feb 26, 2018

The package should provide facilities to write warc.gz and CDX file pairs, and to append to already existing WARC/CDX pairs (see wpull --warc-append). Should also support uncompressed WARC files with uncompressed CDX size/offsets.

This issue is to discuss interface requirements.

Identified requirements:

  • either CDX headers need to be provided by caller, or an existing file opened as ReadWriteSeeker (ugly, prefer option 1)
  • Output needs to be a WriteSeeker to grab file offsets (S/V fields in cdx, named CDXCompressedSize / CDXCompressedOffset in PR Add helper methods to create request/response records #12 )

WriteRecord() would go something like: write record to *writer, Flush the *writer, grab the file offsets and save into CDX

@riking
Copy link
Contributor Author

riking commented Feb 27, 2018

I think that CDX functionality doesn't really fit well in this package, so I designed a different interface. How does this look?

type flusher interface {
	Flush() error
}

// Writer provides functionality for writing WARC files in compressed and
// uncompressed formats.
//
// To construct a Writer, call NewWriterCompressed or NewWriterRaw.
type Writer struct {
	seekW io.WriteSeeker
	w     io.Writer

	// RecordCallback will be called after each record is written to the file.
	// If a WriteSeeker was not provided, the provided positions will be
	// invalid.
	RecordCallback func(r *Record, startPos, endPos int64)
}

// NewWriterCompressed initializes a WARC Writer writing to a compressed
// stream.  The first parameter should be the "backing stream" of the
// compression.  The second parameter must implement interface{Flush() error},
// which should establish a "checkpoint" in the compressed stream - a place
// where decompression can be resumed partway through, so individual records
// can be retrieved from the compressed file.
//
// Seek will only be called with whence == io.SeekCurrent and offset == 0.
//
// See also CountWriter() if you need a "fake" Seek implementation.
func NewWriterCompressed(rawFile io.WriteSeeker, cmprsWriter io.Writer) (*Writer, error) {}

// NewWriterRaw initializes a WARC Writer writing to an uncompressed stream.
// If the provided Writer implements io.Seeker, the RecordCallback will be
// available.  If the provided Writer implements interface{Flush() error}, it
// will be flushed after every written Record.
func NewWriterRaw(w io.Writer) (*Writer, error) {}

And a CountWriter utility for e.g. writing to a net.Conn:

type countWriter struct {
	count int64
	w     io.Writer
}

// CountWriter implements a limited version of io.Seeker around the provided
// Writer.  It only supports offset == 0 and whence == io.SeekCurrent or
// io.SeekEnd, and returns the current number of written bytes in both cases.
func CountWriter(w io.Writer) io.WriteSeeker {
	return countWriter{count: 0, w: w}
}

// implements io.Writer
func (c *countWriter) Write(p []byte) (int, error) {
	n, err := c.w.Write(p)
	if n >= 0 {
		c.count += n
	}
	return n, err
}

var errCountWriterNotImplemented = stdErrors.New("unsupported seek operation")

// implements io.Seeker
func (c *countWriter) Seek(offset int64, whence int) (int64, error) {
	if offset != 0 || !(whence == io.SeekCurrent || whence == io.SeekEnd) {
		return errCountWriterNotImplemented
	}
	return c.count, nil
}

@riking
Copy link
Contributor Author

riking commented Feb 28, 2018

update: reading more of the gzip stuff, I think Flush is not sufficient - it needs a Close / Reset.

@b5
Copy link
Member

b5 commented Mar 1, 2018

Thx for the update @riking, I'm hoping to take some time this weekend to sit down with your proposed interface change & understand your use case. Hopefully I'll be able to add constructive input, as this sounds like another exciting update!

riking added a commit to riking/warc that referenced this issue Mar 3, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

2 participants