Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguishing between XML files and HTML files #37

Open
jggautier opened this issue Jan 21, 2020 · 2 comments
Open

Distinguishing between XML files and HTML files #37

jggautier opened this issue Jan 21, 2020 · 2 comments

Comments

@jggautier
Copy link

Dataverse categorizes some uploaded XML files as HTML, such as the two XML files in this dataset: https://doi.org/10.7910/DVN/ERQWPH. And in these cases, a poor preview of the XML file is shown where the tags and structure are removed.

Other times it categorizes uploaded XML files as XML, like the XML files in this dataset: https://doi.org/10.7910/DVN/BF2VNK. In these cases, there is no option to preview the file, which I would expect since it isn't listed in this repo's readme as a filetype that can be displayed.

Is it right that the dataverse-previewers determine which files to preview and how to preview them based on the file type or mimetype that Dataverse assigns the file when the file is uploaded? If so, could you imagine any reasons why one set of XML files have been categorized as XML files and one set was categorized as HTML files? I looked at the content of XML files from both datasets but nothing seemed obvious. Thanks :)

@jggautier jggautier changed the title Distinguishing between XML file and HTML files Distinguishing between XML files and HTML files Jan 21, 2020
@qqmyers
Copy link
Member

qqmyers commented Jan 21, 2020

Yep - exactly. The previewers are actually ~generic - they'll try to do their thing on any data you give them, which may or may not be useful given the data. However, the triggering is solely based on the manifests sent to dataverse to register the previewer tool. (And since each manifest can only specify one mimetype, tools like the image or audio previewers get registered multiple times for different mimetypes.)

The challenge with mime-types in general is that the source of the determination can come from several places. The browser may send that info when uploading, Dataverse can check based on the file extension, and, for some types, it will even look inside the file to determine the type.

I've seen this happen with csv, where different browsers send either text/csv or text/comma-separated-value so the tabular previewer needs to be registered for both.

I'm not sure what makes sense when the mime-type sent is wrong - whether Dataverse should just impose it's own mimetype based on the extension (or override for the ones it 'knows') or if the UI should allow it to be changed.

A smarter HTML previewer might be able to show tags as an option (it's a security risk to just display them as is, but replacing the < and > with < and > codes would show them but not make them real html tags. Etc. That won't solve the problem of two mimetypes unless you just want to let the html previewer view xml as well (once someone has added an option like the one I mention).

@jggautier
Copy link
Author

Thanks for the quick and helpful reply, as always!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants