Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove IN BOOLEAN MODE #14

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Remove IN BOOLEAN MODE #14

wants to merge 1 commit into from

Conversation

WebDevEtc
Copy link

@WebDevEtc WebDevEtc commented Oct 22, 2018

Removed IN BOOLEAN MODE, as it does not seem to add any features.

I like this package, have used it on several projects and also on https://github.com/WebDevEtc/BlogEtc

However, I've had a problem with some searches not showing any results. I had a play around, and it works if I remove the boolean mode parts of the query.

for example, if I change the (generated) SQL query from:

select * from `laravel_fulltext` where
MATCH (indexed_title, indexed_content) AGAINST ('+the +between  +difference ' IN BOOLEAN MODE)
order by (1.5 * (MATCH (indexed_title) AGAINST ('What is the difference between')) +
          1 * (MATCH (indexed_title, indexed_content) AGAINST ('What is the difference between')))  
DESC  limit 100

to

select * from `laravel_fulltext` where
MATCH (indexed_title, indexed_content) AGAINST ('the between  difference ')
order by (1.5 * (MATCH (indexed_title) AGAINST ('What is the difference between')) +
          1 * (MATCH (indexed_title, indexed_content) AGAINST ('What is the difference between')))  
DESC  limit 100

(the first query shows zero results, the second shows results, as it should as there is content in indexed_title with the exact text "what is the difference between something and something else". I don't really why know it wasn't working).

I did some digging in the source code, and as far as I can tell, it isn't possible for the user to add any boolean search queries, as in TermBuilder::terms() it will remove any boolean characters (such as - or +).

So, I was wondering - why is the boolean mode stuff there (the + before each word, and the IN BOOLEAN MODE)?

I thought that maybe it used to be possible to do boolean searches (such as restaurant -pizza, but then with this fix: #9 that functionality got removed, but the "IN BOOLEAN MODE" parts stayed?

Anyway, I've removed the IN BOOLEAN MODE. Maybe there was some other reason that I missed, and it should stay. If that is the case, I'd be interested in the reason why.

I've attached a pull request.

Description

Removed $termsBool and IN BOOLEAN MODE.
Tested on local machine and a production machine, it works fine.

Motivation and context

See above

How has this been tested?

Tested, seems ok.

Screenshots (if appropriate)

n/a

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • I have read the CONTRIBUTING document.
  • My pull request addresses exactly one patch/feature.
  • I have created a branch for this patch/feature.
  • Each individual commit in the pull request is meaningful.
  • I have added tests to cover my changes. - none added
  • If my change requires a change to the documentation, I have updated it accordingly. - none required

Removed IN BOOLEAN MODE, as it does not seem to add any features.
@bbrala
Copy link
Member

bbrala commented Oct 23, 2018

Hi there, thanks for the contribution, nice to see you use this library.

There is quite a few reasons why things could not result in data when using boolean mode. Things like index token size can sometimes make results unpredictbable. Also InnoDB and MyISAM handle fulltext rather differently. From what i remember the boolean mode is also a performance consideration.

I will need to test this against a few of our running applications to see how this affects results. Hopefully i can find the time later this week and see if the results are still sane. Also i'll try and dig in to see if i can find out what the considerations where to use boolean mode.

@WebDevEtc
Copy link
Author

Hi,
Ok thanks for the reply. I think the problem I was initially facing was just due to my own mysql settings (I'm not too sure, I'm going to try and find time later this week to really debug it, but for now I got the search showing results correctly by removing the boolean mode bits).

Ah, if it is for performance then ignore my pull request.
Like I said, I wasn't sure if i was missing a reason why it was like this.

@oestrada1001
Copy link

oestrada1001 commented Apr 14, 2022

Hi, @WebDevEtc,

I spend the last day figuring out this issue. The reason why this is happening is that by default, MySQL/MariaDB collations treat space(" "), periods("."), and commas(",") as punctuation. Long story short, collations "weight" characters to determine how to filter or sort them. The punctuations mentioned above are considered EOL or 'stopwords.'

We need to have MySQL/MariaDB treat those punctuations as characters rather than punctuations to solve this issue.

We are presented with three solutions in the MySQL documentation. The first one requires changing the source code and recompiling, which isn't a very viable option for me. The second and third options are good and aren't too hard to follow.

  • Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish letters and numbers from other characters. You can edit the contents of the array in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes. For information about the array format, see Section 10.13.1, “Character Definition Arrays”.

First things first:

We need to know which character we're trying to fix. Take a look link below and find the HEX equivalent to the character you're trying to fix. In my case, it was 2E, the period.
https://www.eso.org/~ndelmott/ascii.html

Now, we need to find the collation files in the database server.

  1. SSH into your server.
  2. Login into your MySQL/MariaDB: mysql -u root -p
  3. Run Show VARIABLES LIKE 'character_sets_dir'

The result should return a table with a value of a directory path. I was using docker, so mine came back as usr/share/mysql/charsets.

At this point, I opened a second terminal, but this is necessary.

Back in the server, outside of the MySQL/MariaDB command line:

  1. Navigate to the directory path the previous query returned. You'll find an Index.xml as well as other XML files.
  2. Follow the first step in the MySQL Documentation

NOTE: Before continuing the second step, open latin1.xml and look closely at the <map> nested in <lower> and <upper>. Find the HEX equivalent character to the one you want to fix, in my case, 2E. We can then map the correct spot in the <map> nested inside <ctype>.

  1. Continue to the second step in the MySQL Documentation
  2. After the changes, Restart your server.

Assign the User-defined Collation to our database/table/colum.

All we need to do is assign our collation to our database, table, or column. In my case, I just needed to assign it to two columns, so I ran the following command:
ALTER TABLE laravel_fulltext MODIFY indexed_title TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, MODIFY indexed_content TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci;

Here are some links that might be helpful:
https://mariadb.com/kb/en/setting-character-sets-and-collations/
https://dev.mysql.com/doc/refman/8.0/en/charset-syntax.html

This should solve your problem if you don't have any existing data in the table.

If you do have existing data and you try to run the query above, you might have gotten an error similar to the one below:

SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xE2\x80\x93 fr...' for column.

The issue here is due to attempting to convert a 4byte character into a 3byte character. To solve this, we need to convert our data from 4bytes to binary, then to 3bytes(latin1). For more info, check out this link.

Run the following query in the mysql/mariadb command line:
UPDATE laravel_fulltext SET indexed_content = CONVERT(CAST(CONVERT(indexed_content USING utf8) AS BINARY) USING latin1);

Then follow it with:
ALTER TABLE laravel_fulltext MODIFY indexed_title TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, MODIFY indexed_content TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci;

We are done. We can now search a term with our character, and our database engine will match against it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants