Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duckdbvfs: Forward to DuckDB file system via vfs mechanism #66

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

carlopi
Copy link
Contributor

@carlopi carlopi commented Sep 19, 2023

Sqlite supports virtual file systems, this is a simple implementation, on top of demovfs, of the duckdb file system.

This allows to read sqlite database, for example over http(s):// or s3://. Fixes #39 and it's a big step towards solving also duckdb/duckdb-wasm#1213.

FROM sqlite_scan('https://raw.githubusercontent.com/duckdblabs/sqlite_scanner/main/data/db/sakila.db', 'film');

Implementation is mostly copy-pasted from demovfs (a demo virtual file system provided with sqlite) to use DuckDB FileHandles and their methods.

Implementation is in src/sqlite/duckdbvfs.cpp, places I modified are the one with mention of duckdb::FileHandle or FIXME.

Performance

One concern is that currently performance for sqlite_attach over the wire is not amazing, like 20 seconds for

CALL sqlite_attach('https://raw.githubusercontent.com/duckdblabs/sqlite_scanner/main/data/db/sakila.db');

and I believe this is due to 2 factors:

  • read maximum size is 4096
  • there are no assumptions that the file is immutable, so many reads are performed over the same segments over and over (I guess this makes sense if you attach to something that's currently modified)

Behaviour seems to be the same in native and over the wire, as in the same segments are read in the same order, so I am not sure if that's something that should be solved at at the virtual file system layer or in general.

Local performances seems unchanged, and scan_sqlite performances seems good also over the wire.

Minor modifications on top of demovfs, available as part of sqlite3 source code
Copy link
Contributor

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Some comments below. I wonder if it is possible to only use the DuckDB file system for e.g. https or s3 files (or in case of WASM) and fallback to the regular SQLite file system otherwise?

We could switch to the DuckDB FS entirely but that would at least require all the methods there to be implemented fully.

@@ -17,6 +17,14 @@

namespace duckdb {

DatabaseInstance *_db;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to cause concurrency issues if we have multiple DuckDB's in the same process, no?

The solution that I discussed with @Maxxen before for spatial is to make this a thread-local variable that is set prior to calling into any SQLite functions that might trigger the VFS. Can that be done here as well?

*/
static int duckdbTruncate(sqlite3_file *pFile, sqlite_int64 size){
#if 0
if( ftruncate(((DemoFile *)pFile)->fd, size) ) return SQLITE_IOERR_TRUNCATE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably shouldn't be a no-op for our file system no?

src/sqlite/sqlite3.c Show resolved Hide resolved
int *pResOut
){
//FIXME: This might need to be properly implemented
*pResOut = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with removing this is that we break file locking of SQLite - which means the SQLite scanner is no longer safe to use with other SQLite instances

** file has been synced to disk before returning.
*/
static int duckdbDelete(sqlite3_vfs *pVfs, const char *zPath, int dirSync){
//FIXME: This might need to be properly implemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably also needs to be implemented

@carlopi
Copy link
Contributor Author

carlopi commented Sep 20, 2023

Thanks for the review, it should be possible to have multiple file systems at the same time, I took a shortcut there mostly to increase testing surface, will give it another proper look at it later (and ping Max on the lock).

@carlopi
Copy link
Contributor Author

carlopi commented Oct 12, 2023

Update: I have a branch with most problem solved, last one is the concurrency issue, that's also the most serious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable the use of SQLite data sources from remote locations
2 participants