-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long User-Agent result in empty Access table rows #427
Comments
This was previously discussed in #200. Therefore EPrints 3.4.4+ now users LONGTEXT instead of VARCHAR(255). Obviously this does not help EPrints repositories created before 3.4.4. Also, it probably takes quite some time to update the access table to use the different type. This would need to be done manually. Whether this is the best solution for the issues it may create space and indexing is debatable. We have for some time tried to move away from relying on the access table being a record of all usage from the beginning of the repository's life and move towards storing this is files and only using the access table as a most recent cache. It feels like creating hash file(s) seems an overly complex solution to the problem. Either telling people to update their database table after that have upgrade to 3.4.4+ or truncating the values if is still using VARCHAR, feels like a more straightforward approach that does not at a long term overhead to manage these hashes. I am not sure if using a database table rather than file(s) to store these hashes improves the overhead that much. I appreciate that this is not that dissimilar to using access files rather than a full access history in the database. However, I think the benefits outweigh the drawbacks, whereas I don't think most would mind if long user agents strings get truncated, as long at there is a way to save that being need by changing the column type. |
Thanks for the info - I looked in the LogHandler code for any recent changes - but the change to column type isn't in there - so I didn't spot it. I was trying to find a method in
This gets what's needed: my $db = $session->get_db;
my $ds = $session->dataset( "access" );
my $field = $ds->get_field( "requester_user_agent" );
my $sth = $db->{dbh}->column_info(
undef, #catalogue
undef, #schema
$ds->get_sql_table_name,
$field->get_name
);
my $r_u_a_size = $sth->fetchall_arrayref( { COLUMN_SIZE => 1} );
if( defined $r_u_a_size ){
my $s = @$r_u_a_size[0]->{COLUMN_SIZE};
# stash it somewhere to allow LogHandler to access it
$session->{access_max_requester_user_agent_length} = $s if defined $s;
} I was thinking of wrapping the above in an The above would need to run after database connect, but ideally only once e.g. on apache load, rather than per request. Is there an adopted naming convention for this type of thing? Alternatively, the return value here:
could be checked - and the value truncated if the error was 'data too long for column requester_user_agent' - but that feels dirty. |
A much simpler alternative to the above 'calculated' reckoning would be to support a hard-coded config option e.g. |
@jesusbagpuss I am happy to implement the more complex solution, as long as it is self-container, which this looks to be. |
The
access.requester_user_agent
column is a VARCHAR(255).If someone/something with a User-Agent longer than that visits an eprint/document, it results in an empty (just the accessid is populated) table row in the access table.
The metrics about what was accessed, and by whom are lost, and IRStats complains quite a bit when processing the empty row.
One possible solution would be to md5sum long user-agents to allow clean insertion to the database.
The full value of the user-agent could be stored in either:
var/long-user-agents/[md5hash]
files (one file per hash)var/long-user-agents.txt
in ahash [tab] value
formatThe first option may end up with a lot of individual files. The second may end up with a large file to scan (or hold in memory).
From our Apache logs, last week we had ~100 unique 'long' UAs out of 25k unique UAs.
The text was updated successfully, but these errors were encountered: