Results 1 to 2 of 2

Thread: Search Engine Database (PHP)

  1. #1
    Join Date
    Mar 2010
    Posts
    24

    Search Engine Database (PHP)

    I am trying to write a search program for files and I do not know which is the best data structure to store the files in. I want to type keywords and get files that have it in their name and (for some files) in the content. What is the best data structure to use for this?
    (speed wise)

    thx

    (for example, when the database gets huge, searching through it linearly will take longer, are there any better search methods than simply reading each and return matching results?)
    Last edited by Jonnyb42; 09 Apr 2010 at 01:03 AM.

  2. #2
    Join Date
    Feb 2007
    Location
    Ireland
    Posts
    1,007
    You need to store a word and map it to a list of all the files that contain that word. Naturally you should filter out very common words and order results based on the number of times the word appears within the text.

    As for storing it, putting them all into one database table would be extremely inefficient. You need to split it up. A simple example would be to put a word into a table based on it's first letter. So you would end up with 26 tables. The problem with this is that the table for "a" and the table for "z" words would be extremely unbalanced.

    A better solution would be a hash table. You'll need a good hash function though. Have a look at http://en.wikipedia.org/wiki/Hash_table for more information.

    Edit:

    Another [very simple] example would be to pick a prime number (the number would reflect the number of tables you will have in your database). You could have a counter and each time a new word is added, assign it an id. You could then divide the id by the prime number and that would tell you what table it would go into. The only problem with this is that you will need to maintain a list of id's mapped to the words. Another idea would be to convert the words to their numerical values and then sum them (and then divide by the prime number).

    Anyway, those are just some suggestions, I'm not really an expert on this, but just make sure you keep the number of records in your database tables relatively small, as the database won't be very efficient otherwise.
    “The best thing about a boolean is even if you are wrong, you are only off by a bit.”

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •